Skip to content

AutoMQ vs Kafka: An Independent In Depth Evaluation and Comparison by Xiaohongshu

lyx edited this page Jan 17, 2025 · 1 revision

Test Background: The current Xiaohongshu message engine team is deeply collaborating with The AutoMQ Team to promote community building and explore cutting-edge cloud-native messaging engine technologies. This article provides a comprehensive evaluation of AutoMQ based on the OpenMessaging framework. We welcome everyone to join the community and share their evaluation experiences.

1. Testing Conclusion

This article primarily evaluates the performance comparison between the cloud-native messaging engine AutoMQ and Apache Kafka® (version 3.4).

Testing Conclusion:

  • Real-time Read/Write: With the same cluster size, AutoMQ's maximum read/write throughput is three times that of Apache Kafka, and the E2E latency is 1/13 of Apache Kafka.

  • Catch-up Read: With the same cluster size, AutoMQ's peak catch-up read is twice that of Apache Kafka, and during the catch-up read, AutoMQ's write throughput and latency remain unaffected.

  • Partition Reassignment: AutoMQ's partition reassignment takes seconds on average, whereas Apache Kafka's partition reassignment takes minutes to hours on average.

2. Testing Configuration

The benchmark testing is enhanced based on the Linux Foundation's OpenMessaging Benchmark, simulating real user scenarios with dynamic workloads.

2.1 Configuration Parameters

By default, AutoMQ forces data to be flushed to disk before responding, using the following configuration:



acks=all
flush.message=1

AutoMQ ensures high data durability through EBS's underlying multi-replica mechanism, making multi-replica configurations unnecessary on the Kafka side.

For Apache Kafka, choose version 3.6.0, and based on Confluent's recommendations, do not set `flush.message = 1`. Instead, use a three-replica, in-memory asynchronous flush to ensure data reliability (power outages in the data center may cause data loss), configured as follows:



acks=all
replicationFactor=3
min.insync.replicas=2

2.2 Machine Specifications

16 cores, maximum network bandwidth of 800MB/s, configured with a cloud disk of 150MB/s bandwidth

3. Detailed Comparison

3.1 Real-time Read and Write Performance Comparison

This test measures the performance and throughput limits of AutoMQ and Apache Kafka® under the same cluster size and different traffic scales. The test scenarios are as follows:

  1. Deploy 6 data nodes each, create a Topic with 100 partitions

  2. Starts with 100 MiB/s and 200 MiB/s 1:1 read/write traffic (message size=4kb, batch size=200kb); additionally, both are tested for their maximum throughput.

Load files: [tail-read-100mb.yaml], [tail-read-200mb.yaml], [tail-read-900mb.yaml]

Detailed Data on Send Duration and E2E Duration:

Analysis:

  1. In a cluster of the same scale, AutoMQ's maximum throughput (870MB/s) is three times that of Apache Kafka (280MB/s).

  2. Under the same cluster scale and traffic (200 MiB/s), AutoMQ's P999 latency is 1/50th that of Apache Kafka, and the E2E latency is 1/13th that of Apache Kafka.

  3. Under the same cluster scale and traffic (200 MiB/s), AutoMQ's bandwidth usage is 1/3rd that of Apache Kafka.

3.2 Comparison of Catch-up Read Performance

Catch-up reading is a common scenario in message and stream systems:

  • For messages, they are typically used to decouple business processes and smooth out peaks and valleys. Smoothing out peaks requires the message queue to hold the upstream data so that the downstream can consume it slowly. In this case, the downstream is catching up on cold data that is not in memory.

  • For streams, periodic batch processing tasks need to scan and compute data from several hours or even a day ago.

  • Additionally, there are failure scenarios: Consumers may go down for several hours and then come back online; consumer logic issues may be fixed, requiring a catch-up on historical data.

Chasing read primarily focuses on two aspects:

  • Speed of chasing read: The faster the chasing read, the quicker consumers can recover from failures, and batch processing tasks can produce analytical results faster.

  • Isolation of read and write: Chasing read should minimize the impact on the production rate and latency.

Testing

This test measures the chasing read performance of AutoMQ and Apache Kafka® under the same cluster scale. The test scenario is as follows:

  1. Deploy 6 data nodes each, create a Topic with 100 partitions

  2. Continuously send data at a throughput of 300 MiB/s.

  3. After sending 1 TiB of data, start the consumer to consume from the earliest offset.

Load file: [catch-up-read.yaml]

Test Results:

Analysis

  • Under the same cluster size, AutoMQ's catch-up read peak is twice that of Apache Kafka.

  • During the catch-up read, AutoMQ's sending throughput was unaffected, with an average send latency increase of approximately 0.4 ms. In contrast, Apache Kafka's sending throughput decreased by 10%, and the average send latency surged to 900 ms. This is because Apache Kafka reads from the disk during catch-up reads and does not perform IO isolation, occupying the cloud disk's read-write bandwidth. This reduces the write bandwidth, leading to a drop in sending throughput. Moreover, reading cold data from the disk contaminates the page cache, further increasing write latency. In comparison, AutoMQ separates reads and writes, utilizing object storage for reads during catch-up, which does not consume disk read-write bandwidth and hence does not affect sending throughput and latency.

3.3 Partition Reassignment Capability Comparison

This test measures the time and impact of reassigning a partition with 30 GiB of data to a node that does not currently have a replica of the partition, under a scenario with regular send and consume traffic. The specific test scenario is as follows:

  1. 2 brokers, with the following setup:

    • 1 single-partition single-replica Topic A, continuously reading and writing at a throughput of 40 MiB/s.

    • 1 four-partition single-replica Topic B, continuously reading and writing at a throughput of 10 MiB/s as background traffic.

  2. After 10 minutes, migrate the only partition of Topic A to another node with a migration throughput limit of 100 MiB/s.

Load file: [partition-reassign.yaml]

Analysis

  • AutoMQ partition migration only requires uploading the buffered data from EBS to S3 to safely open it on the new node. Typically, 500 MiB of data can be uploaded within 2 to 5 seconds. The time taken for AutoMQ partition migration is not dependent on the data volume of the partition. The average migration time is around 2 seconds. During the migration process, AutoMQ returns the NOT_LEADER_OR_FOLLOWER error code to clients. After the migration is complete, the client updates to the new Topic routing table and internally retries sending to the new node. As a result, the send latency for that partition will increase temporarily and will return to normal levels after the migration is complete.

  • Apache Kafka® partition reassignment requires copying the partition's replicas to new nodes. While copying historical data, it must also keep up with newly written data. The reassignment duration is calculated as partition data size / (reassignment throughput limit - partition write throughput). In actual production environments, partition reassignment typically takes hours. In this test, reassigning a 30 GiB partition took 15 minutes. Besides the long reassignment duration, Apache Kafka® reassignment necessitates reading cold data from the disk. Even with throttle settings, it can still cause page cache contention, leading to latency spikes and affecting service quality.

AutoMQ Wiki Key Pages

What is automq

Getting started

Architecture

Deployment

Migration

Observability

Integrations

Releases

Benchmarks

Reference

Articles

Clone this wiki locally