Kinesis — Handling Duplicates
1 min readApr 17, 2023
A quick run-through of handling duplicates in Kinesis Data Stream, in a nutshell :)
Producers
- Retires can create duplicates, usually due to network timeouts. I.E when a producer sends a message to the stream and doesn’t get ‘acknowledgement’, due to the network error. The producer will send the message again, and so forth, until it receives ‘acknowledgement’.
- The two records will both have a unique SEQ ID and be seen as two separate records in the stream. To prevent a consumer from consuming the same data twice etc, we need to embed a unique record ID. This can then be de-duplicated on the consumer/application logic.
Consumers
- Retries can cause the application to read data twice.
- Consumer retries can happen when the record processors restart: i.e
- Application is deployed.
- Shards are split or merged.
- Worker terminates.
- Instances are added or removed.
- Fix for this is to make your application idempotent. Such as the suggestion within the producer section above, i.e de-duplicating by unique record IDs and keeping a record of state, i.e if a consumer has ‘consumed’ a record previously.
- AWS suggests, if possible to handle duplicates on the final destination rather than on the consumer/application logic. Example, could use the unique record ID as a primary key and a database won’t allow a duplicate primary key.