Fixing high latencies in Kafka caused by hot partitions

How to resolve high consumer lag reading from topics with overloaded partitions

·

-

One of the first problems I faced using Kafka was investigating high consumer lags exceeding a minute.

Consumer lag here refers to the delay between the time when events are published to a topic, and the time when they are read by a consumer.

I pinpointed the problem to a hot partition. Here's how I fixed it.

What are partitions?

Data in a Kafka topic can be split up across partitions.

This partitioning allows data to be written/read from different partitions (or nodes) in parallel. This allows Kafka to be very performant in comparison to a system where you're only reading and writing data from a single node.

The issue with data being written to different partitions is that you lose any guarantee of ordering across partitions.

Say your application sends events A, B and C (in this order) to your Kafka topic. If each of these events get sent to a different partition, then they might actually get published to the topic in a different order (e.g. B, C, A).

This is because each event is sent independently to different partitions. Therefore, the final order becomes influenced by external factors like differing network latencies, or how the producer had decided to arbitrarily batch and send these events.

The only way to guarantee ordering on events is by adding a partition key to it. This key is hashed and used to determine which partition an event is sent to. So while ordering is not guaranteed across partitions, partition keys can be used to guarantee ordering within a partition.

For example, if you wanted events related to a particular user to always be read in order, then you could set user ID as a partition key to guarantee that all events for a particular user would always be persisted in the same partition.

What is a hot partition?

What happens if you have a set of particularly active users?

Applying the 80-20 rule as a guideline, 20% of your most active users could be responsible for 80% of your traffic.

Therefore, these power users would publish a disproportionately higher number of events, resulting in an uneven distribution of events across partitions.

Reading from hot partitions can cause significant consumer lag, as consumers would struggle to catch up in reading through all of their events.

How do we solve this problem?

There are a few options to consider:

  1. Reducing the time taken to process each event
  2. Optimising your Kafka configuration to minimise latency
  3. Increasing the number of partitions to spread out the distribution of data further
  4. Changing the partition key to something else that could distribute the data more evenly

In my particular case, option 1 and 2 were enough to solve the problem.

To reduce the time taken to process each event, I added an index to a database query that was executed during the event's processing flow. Inefficient algorithms and I/O operations (like network calls or database queries) are culprits I would scrutinise.

To optimise our Kafka configuration, I changed a concurrency setting on our consumers to enable concurrent reading and processing of events from multiple partitions at once. I found that by default, Spring Kafka Listeners and Streams use only 1 thread to read events from all of its assigned partitions. Therefore, changing this setting to be more than or equal to the number of partitions ensures that all partitions can be read from concurrently.

Stay up to date

Get notified when I publish something new, and unsubscribe at any time.