LinkedIn software engineer Joel Koshy detailed how he and a team of engineers resolved two Kafka production incidents due to interplays between multiple bugs, unusual client behavior and gaps in monitoring.
The first bug, observed in LinkedIn's change request tracker and deployment platform manifested itself as duplicate emails from the service. Koshy noted the root cause was a formatting change to messages, and the subsequent termination of cache loading on an offset manager that had been fed a stale offset. Log compaction and purge triggers never fired on the deployment topic because of low data volume in the topic’s partition. This led to a stale offset being used as the consumer’s starting point, leading to a reread of previously consumed messages and triggering duplicate emails.
The second bug was in a data deployment pipeline in which Hadoop push-jobs send data to to a non-production Kafka environment, which are then mirrored to production clusters via Kafka cluster replication. The replication became stuck after an offset fetch returned without a valid checkpoint, indicating a loss of previously checkpointed offsets. Koshy describes the root cause as
... since the log compaction process had quit a while ago, there were several much older offsets still in the offsets topic. The offset cache load process would have loaded these into the cache. This by itself is fine since the more recent offsets in the log would eventually overwrite the older entries. The problem is that the stale offset cleanup process kicked in during the long offset load and after clearing the old entries also appended tombstones at the end of the log. The offset load process in the meantime continued and loaded the most recent offsets into the cache, but only to remove those when it saw the tombstones. This explains why the offsets were effectively lost.
Unclear leader election between Kafka brokers where the lead broker of a partition fails during complete replication lag will result in offset rewinds. Kafka message consumers issue fetch requests that specify which offset to consume from. Consumers check their offsets against each topic-partition so they can resume from the last checkpoint should the consumer need to restart. Checkpoints can happen when a consumer fails, restarts, or if partitions are added to the topic and the partition distribution across the consumer instances needs to change. If a consumer fetch contains an offset key out of bounds against the broker's topic log it will receive an OffsetOutOfRange error. The consumer will then set its offset to either the latest or earliest valid offset according to its auto.offset.reset configuration. Koshy notes that
Resetting to the earliest offset will cause duplicate consumption while resetting to the latest offset means potentially losing some messages that may have arrived between the offset reset and the next fetch.
Koshy also highlighted some best practices for earlier offset rewind detection. Monitor the cluster's unclean leader election rate. Monitor and alert on consumer lags in a manner that avoids false-positives. Monitor log compaction metrics, especially the max-dirty-ratio sensor, offset management metrics like offset-cache-size, commit-rate, and group-count sensors. Offsets themselves are stored in a replicated, partitioned and compacted log associated with the internal __consumer_offsets topic. Koshy recommends taking a dump of the internal topic early in the debugging process to avoid log compaction removing potentially useful data. The particular topic is populated with messages any time offset commit requests are sent to offset manager brokers. Consumer and broker logs are also potentially useful.