Welcome back to the second blog in the message ordering series. If you haven’t read the first blog, Kafka Message Ordering Part – 1, I would recommend you to go through it before proceeding with this one. In the last blog, I highlighted multiple factors which can disrupt the order of Kafka messages at the producer level itself. In this blog, we would see the factors at the infrastructure layer and consumer side which can alter the order of the messages (received and processed).
Infrastructure Layer Factors: As per the OSI model the bottom four layers (transport, network, data link, and physical) take care of packet transfer and data restoration between hosts. The throughput and scalability of message transfer middleware systems depend heavily on asynchronous message processing, i.e. decoupling of producers and consumers. When messages are sent continuously one after the other, then due to the inherent nature of the data transfer networks, the message packets can follow different, uncongested, smallest, quickest routes to destination making a perfect recipe for out-of-order messages. Essentially this means that even if a producer tries its best to send all the messages in perfect order then also there is no guarantee that messages will be received in the same order at the consumer end.
Consumer Side Factors: Like producers, the message consumers also try to best utilize the system by parallelizing message processing. Due to this very intent, the sequencing of the messages gets jumbled, here is how:
Kafka Partitions: For increased throughput and scalability, Kafka provides for multiple partitions in each topic. Producers publish their messages to partitions (in a round-robin manner or key-based hashing so that a message with a given key is always sent to the same partition). More partitions mean more parallelism and more parallelism inversely impacts the effectiveness of message ordering measures. Messages in each partition are picked up by individual threads of the consumers. The most basic case can be pictorially represented as:
In this representation, all the messages are processed by only one consumer in a given consumer group and Kafka ensures that it hands over the messages in the exact same order in which they were received within each partition (i.e. the ordering within the partition is maintained). Kafka ensures intra-partition ordering and not inter-partition ordering. Here consumer has to iterate over the partitions to retrieve the messages and process them. Unless you implement a specific logic, in the case above the messages would be processed out of order (m4 with timestamp t4 is processed immediately after m1 whereas it should have processed m2 which arrived at t2).
Parallel Processing: Extending the above point a little further where in most architectures the consumers try to increase the message processing throughput by using more threads to process the messages in parallel which is an absolutely guaranteed way of distorting the message sequence. Looking at the picture below:
Each consumer thread is allocated a partition to process the messages in there and each of these threads is completely unaware of what other threads are doing.
Re-partitioning: This is the zenith of parallel processing, wherein the messages from all the partitions are re-divided to achieve the desired parallelism. In the previous approach, it was one consumer thread per partition yielding maximum parallelism of three. For a moment assume that only Partition #1 gets all the messages implying that only one-third of the compute power is being used and the other two consumer threads are sitting idle. In such use cases with unequal message load in each partition, it is sometimes required to distribute the load equally across the available compute instance on the consumer end. For example, if there are three partitions with 100,200,2700 messages respectively with 3000 messages in total. It is worthwhile to repartition these 3000 messages into, say, 15 partitions of 200 messages each to speed up the processing:
Each vcore now gets a task (partition) to process and hence all partitions can be processed in parallel with an equal load on each vcore. Unless you implement a specific key-based logic, the repartition process would alter the ordering (even for intra-partition) we had earlier.
Essentially no matter how efficiently you plan to ensure message order in upstream systems, there are good chances that the sequence would be altered by the subsequent layers for different use case implementations. Hence while architecting a solution it is advisable to build enough flexibility so that if required, the message order can be managed/retrieved in downstream systems. I will discuss one of the two solutions in my next blog Kafka Message Ordering Part – 3.