When working with messaging systems it is a good practice to architect systems that are asynchronous, idempotent, and independent of message sequence (see Saga Pattern). But for some specific use cases like financial transactions, the order needs to be maintained. This four blog series caters to these specific set of use cases, associated challenges as well the suggested solution approaches.
(Spoiler alert) I have suggested using either Apache HBase or Apache Kudu as a transient store to persist the messages temporarily before processing them in order and this essentially means that this would add latencies to the system due to which it can no longer work as a true real-time solution.
Message ordering is a real nemesis for any application wherein the one ask is to process the messages in real-time and on the other hand, ensure that messages are processed in the exact same order in which they were generated. It is very challenging to satisfy both the requirements simultaneously. Message processing is not a challenge at all but processing them in the desired sequence is! To ensure strict in-order message processing the architects have to consciously add additional components which not only mean additional points of failures in the system but increased latency as well. This blog, Kafka Message Ordering Part I, is the first in the four-part series wherein I would discuss the factors distorting the Kafka message order before proceeding with possible solution architectures. There are no right or wrong solutions, my intent here is to highlight the considerations for designing the most appropriate solution for the use case.
Though I have mentioned Kafka here because of its popularity as a distributed event streaming platform, the same discussion applies to any other enterprise integration(EI) middleware responsible for managing streams of records. In reality, Kafka (or any other EI tool) has no power to exert any control over the messages outside its own realm (i.e. where messages are produced or consumed) and that’s the area of our discussion wherein we would see how we can address the message ordering problem outside Kafka cluster.
What is message ordering? In layman’s terminology, it means ensuring the messages are received or processed by consumers in the exact same order in which they were produced by producers. Now let’s understand why this is considered a dreaded problem.
Assume that you have a number of sensors installed on the sea bed forming an IoT cloud for transmitting acoustic data generated by the collision of water waves with the sea bed. These sensors can be used for a wide variety of applications like predicting the presence of oil/gas underneath, prediction of earthquakes, underwater flora fauna analysis, tracking cargo movement, etc. Pictorially it can be represented as:
As depicted the three sensors send a total of six events (2 per sensor) at different times (represented by t1, t2, etc.). The correct sequence of events as per this representation is e1, e2, e3, e4, e5, and e6. The intent is to process these events in the exact same sequence. Do you know what all things can go awry here? At a high level, there could be issues at the producer side, at the consumer side, or during message transmission from producers to consumers.
Producer Side Factors: Though producers are originators of the events, any order-related issues have a crippling effect on the entire subsystem. In general, if something goes wrong at the producer side then there is no easy way to recover from these issues in downstream systems. The best way is to fix the problem at the source (producer) end itself. There could be two main issues at the producer side which can disrupt the message ordering:-
- Clock sync: Generally it is the actual producer of the event/message which sends the timestamp of the event along with the event data. But imagine what impact it could have in the case of tens of thousands of sensors with a slight time difference in microseconds among a subset of sensors on a real-time system. In the picture above if sensor 3 has a clock difference of negative 100 microseconds as compared to sensor 1, then for all the readings taken by sensor 3 within 100 microseconds after sensor 1’s reading would reflect as pre-sensor1 readings (i.e. reading which came before sensor 1 readings) which is incorrect.
- Producer partitioning the events: Most applications have a throughput requirement that defines the number of events/messages sent per second. But dependence on other integrated third-party systems can have a limiting effect on the throughput requirement. To overcome this third-party (or legacy) systems’ limitations, producers parallelize the data collection and data transmission activity which means that data from a single producer is accumulated and transferred in parallel. For example, let’s assume that every sensor produces 2000 events per second and the ask is to send only half of them, 1000 events, every second to the downstream consumers. Let’s also consider that the third-party service can’t send individual events hence has to consolidate multiple events into a single message but due to message size limitation, only 250 events can be packed inside a single message. In total four messages (m1, m2, m3, m4) have been sent per second per sensor each containing 250 events. It looks something like this:
This is a problem now, where either these messages might arrive out-of-order on multiple partitions of a Kafka topic or pulled for processing in the incorrect order or processed by parallel consumer threads in an unwanted sequence. We can minimize these issues but at the cost of something else, I will discuss the solution to these issues in a subsequent blog.
- Multiple producers: This is a slight variation of the problem above wherein there are sensors that are producing the same type of data/events, and at the consumer side the ordering of events is not limited to intra-sensor reading but inter sensor readings which essentially means that the data from multiple sensors could be interwoven by virtue of time lag between the events. If you see the first figure above the intra-sensor sequence is:
Sensor 1: e1, e4
Sensor2: e2, e5
Sensor 3: e3, e6
Whereas the inter-sensor sequence is: e1, e2, e3, e4, e5, e6.
I would discuss the solution architecture after covering other factors which could alter message ordering at the consumer side as well as the infra layer. This is the topic of the next blog, Kafka Message Ordering Part – 2.