Kinesis — Handling Duplicates

1 min readApr 17, 2023

A quick run-through of handling duplicates in Kinesis Data Stream, in a nutshell :)

Producers

Retires can create duplicates, usually due to network timeouts. I.E when a producer sends a message to the stream and doesn’t get ‘acknowledgement’, due to the network error. The producer will send the message again, and so forth, until it receives ‘acknowledgement’.
The two records will both have a unique SEQ ID and be seen as two separate records in the stream. To prevent a consumer from consuming the same data twice etc, we need to embed a unique record ID. This can then be de-duplicated on the consumer/application logic.

Retries can cause the application to read data twice.
Consumer retries can happen when the record processors restart: i.e
Application is deployed.
Shards are split or merged.
Worker terminates.
Instances are added or removed.
Fix for this is to make your application idempotent. Such as the suggestion within the producer section above, i.e de-duplicating by unique record IDs and keeping a record of state, i.e if a consumer has ‘consumed’ a record previously.
AWS suggests, if possible to handle duplicates on the final destination rather than on the consumer/application logic. Example, could use the unique record ID as a primary key and a database won’t allow a duplicate primary key.