How AutoMQ Saves Nearly 100% of Kafka's Cross AZ Traffic Costs

TAG：Cross AZ Traffic Cost、Apache Kafka、WrapStream

Apache Kafka serves as a channel for log collection, CDC, and data entry into the lake. Kafka requires more machine resources to cope with high throughput traffic and disk resources to cope with data storage. In the Big Data system, Kafka occupies an undeniable cost.

When deploying and maintaining Kafka, SRE usually focuses on the cost of ECS machines and EBS storage, which can be estimated at the moment of deployment. However, after actual operation for a period of time, it may be found that the cross-AZ traffic fees (AWS, GCP) in the bill account for 80% to 90% of the overall cost. Unlike mainstream public clouds such as Alibaba Cloud Ali Cloud Aliyun in China, all cross-AZ traffic needs to be charged on Google Cloud and AWS. When deploying multiple AZ, if the application has a large amount of traffic, a large amount of cross-AZ fees will be generated. When Kafka is deployed as a traffic-intensive application with multiple AZ, this traffic cost often accounts for a large proportion of the cost.

Amazing traffic costs for Apache Kafka

In order for the business to continue providing services in the event of a single AZ (Available Zone) failure, the High Availability team will require multiple AZ deployments for both the application and the Kafka cluster. When a single AZ failure occurs:

Stateless Application can still retain 2/3 of its capacity and continue to provide services to the outside world.
There are still 2 replicas left in the Kafka cluster that can meet the requirements of min.insync.replicas and continue to provide sending and receiving services.

Although Apache Kafka's multi-AZ deployment architecture can solve the problem of AZ failure disaster recovery, it also brings huge cross-AZ traffic, as shown in the following figure.

Produce: Assuming that the Producer has not set a ShardingKey and the partitions are evenly distributed among cluster nodes. Then at least 2/3 of the Producer traffic will be sent across AZ, for example, 1/3 of the traffic of the Producer in AZ1 will be sent to AZ2 and AZ3 respectively.
Replication: After receiving the message, Kafka Broker will copy the data to other AZ brokers to ensure high reliability of the data, generating twice the cross-AZ traffic of Produce.
Consumers can avoid generating cross-AZ traffic by setting client.rack to consume partitions/replicas with the same AZ.

In summary, Apache Kafka's multi-AZ deployment architecture will generate at least 2/3 + 2 = 267% more cross-AZ traffic than Produce traffic.

AWS (similar to GCP) has a unit price of 0.01 dollars/GB for cross-AZ traffic, and the inflow and outflow will be charged separately, so the overall cost of cross-AZ traffic is 0.02 dollars/GB. Taking 3 r6i.large (2C16G) nodes providing 30MiB/s write throughput as an example and 1 day of storage as an example, the monthly cost of cross-AZ traffic for Apache Kafka with 40MiB/s continuous traffic is:

30 * 60 * 60 * 24 * 30/1024 * (2 /3 + 2 ) * 0.02 = $4,050 = Cross AZ Produce traffic cost $1,012 + Cross AZ Replicate traffic cost $3,038

In comparison:

The machine cost is 3 * 0.126 dollars/h (r6i.large unit price) * 24 * 30 = 272 dollars, which is 6.7% of the cross-AZ traffic cost.
The storage cost is 30 * 60 * 60 * 24 * 3 (replicas)/1024/0.8 (80% disk usage) * 0.08 (GP3 unit price GB per month) = 759 dollars, which is 18.7% of the cross-AZ traffic cost

-	Traffic	Machine	Storage	Total sum
Cost $/month	$4,050	$272	$759	$5,081
Cost share	80%	5%	15%	100%

AutoMQ saves 90% on traffic costs

As mentioned earlier, Apache Kafka's cross-AZ traffic mainly consists of Produce and Replication. In this article, we will introduce how AutoMQ saves 95% of traffic costs through multi-point writing and cloud storage priority.

Multipoint write

Before introducing how AutoMQ saves cross-AZ traffic costs through multipoint writing, let's first introduce the basic process of Produce.

The Producer will first send a METADATA request to the Broker, and the Broker will return the node information where the partition is located.
The Producer sends a PRODUCE request to send a message to the node corresponding to the partition.

After receiving messages, Kafka partitions will perform a series of operations such as sequencing, generating Offsets, generating time indexes, and generating transaction indexes. For compatibility reasons, AutoMQ did not completely rewrite Kafka like WarpStream. AutoMQ retains the logical layer of Apache Kafka to achieve 100% compatibility. Therefore, Kafka partitions are still bound to a Broker in AutoMQ, so messages sent by producers different from the AZ where the partition is located will inevitably require cross-AZ communication.

Speaking of cross-AZ communication, the usual practice is based on Communication Protocols such as RPC and HTTPS. The underlying media of these Communication Protocols is based on the network, and the two-way charging network fee on AWS is as high as 0.02 dollars/GB. S3 provides region-level services, and any AZ read and write access within the same region does not charge traffic fees, only charging for API and storage space: 0.005 dollars/1000 Put, 0.0004 dollars/1000 Get, and 0.023 dollars/GB storage per month. Therefore, AutoMQ chose S3 as the cross-AZ communication medium on AWS, abstracted the RPC for cross-AZ communication based on S3, integrated it into AutoMQ's S3 CROSS ZONE ROUTER component, and avoided cross-AZ traffic costs by using S3 as the channel for cross-AZ communication.

AutoMQ S3 CROSS ZONE ROUTER is available in 1.3.0 version and is currently in Early Access.

S3 CROSS ZONE ROUTER intercepts METADATA and PRODUCE requests at the KafkaApis layer.

METADATA: Identify the source of the Producer and return the node with the same AZ as the Producer. Therefore, the Producer will only send messages to nodes with the same AZ, achieving convergence of traffic within AZ.
PRODUCE：
- After receiving the PRODUCE request, the Broker will batch write the partition data belonging to other AZs to S3.
- Then send the metadata of the S3 object to the Broker of the target AZ.
- After receiving the request, the Broker of the target AZ reads the data from S3 based on the object metadata and persists the data.
- After the Broker persistence of the target AZ is completed, it responds to the source Broker, which ultimately returns it to the Producer.

The cross-AZ traffic between the source broker and the target broker is extremely small compared to the data traffic, and can be ignored in cost accounting.

AutoMQ achieves multi-point writing to Kafka partitions by intercepting METADATA and PRODUCE. Each AZ broker can provide PRODUCE services for any partition, ultimately achieving no cross-AZ traffic fees between Producer and Broker.

Currently, AutoMQ defaults to batch uploading based on 8MiB. Taking Apache Kafka's 30MiB/s write traffic as an example, the monthly S3 API call cost is:

30 * (2 /3) * 60 * 60 * 24 * 30/8/1000 * (0.005 dollars/thousand Put + 2 * 0.0004 dollars/thousand Get) = 35 dollars

AutoMQ multi-point writing uses S3 as the API call cost for cross-AZ communication channel, which is 3.5% of the original cross-AZ traffic cost.

Cloud storage first

Finally, let's briefly introduce how the cloud storage priority strategy helps AutoMQ save replication traffic costs.

AutoMQ data is stored in cloud storage (EBS, S3). Cloud storage has multiple replicas at the bottom layer. For example, S3 provides data persistence of 11 nines and AZ disaster recovery capability. Therefore, AutoMQ does not require additional replication at the application layer at the upper layer, saving 100% of cross-AZ replication traffic.

At the same write traffic of 30MiB/s, Apache Kafka saves $3,038/month on replication traffic.

Cloud storage priority not only saves replication traffic, but also saves computing and storage costs.

Because AutoMQ has no replication traffic, the r6i.large (bandwidth 99MiB/s) model can provide 30MiB/s write traffic per node, saving 33% of computing resources compared to Apache Kafka.
AutoMQ's data is all on S3, with a storage unit price of 0.023 dollars per GB per month. Under the same storage scale as before, the storage cost is 30 * 60 * 60 * 24/1024 * 0.023 = 58 dollars /month. AutoMQ estimates the cost of API calls per GB of data based on 213 Put and 640 Get calls. The cost of API calls per GB of data is 0.001321 dollars, so the monthly S3 API cost is 30 * 60 * 60 * 24 * 30 /1024 * 0.001321 = 100 dollars.

Cost comparison

AutoMQ vs Apache Kafka, cost comparison of 30MiB/s write throughput in multi-AZ scenarios

-	Traffic	Machine	Storage	S3 API	Total sum
Apache Kafka	$4,050	$272	$759	$0	$5,081
AutoMQ	$0	$90	$58	$100	$248

Summary

On AWS and GCP, the cross-AZ traffic cost of Apache Kafka may account for 80% of the total cost. If there is no need for AZ disaster recovery, it is recommended to deploy a single AZ cluster as much as possible.
For scenarios that are not sensitive to latency (at the hundred milliseconds level), S3 will be a low-cost cross-AZ communication solution to replace cross-AZ traffic, saving at least 90% of the cost compared to cross-AZ traffic.

AutoMQ Wiki Key Pages

What is automq

Getting started

Architecture

Deployment

Migration

Observability

Integrations

Data analysis
- RisingWave
- Databend
- Timeplus
- Apache Doris
- Flink
- StarRocks
Object storage
- MinIO
- Ceph
- CubeFS
Kafka ui
- Kafdrop
- Redpanda Console
Observability
- Flashcat
- Guance Cloud
Data integration
- CloudCanal

Provide feedback

Saved searches

Use saved searches to filter your results more quickly