Kinesis — Handling Duplicates

Joshua Callis
1 min readApr 17, 2023

A quick run-through of handling duplicates in Kinesis Data Stream, in a nutshell :)

Producers

  • Retires can create duplicates, usually due to network timeouts. I.E when a producer sends a message to the stream and doesn’t get ‘acknowledgement’, due to the network error. The producer will send the message again, and so forth, until it receives ‘acknowledgement’.
  • The two records will both have a unique SEQ ID and be seen as two separate records in the stream. To prevent a consumer from consuming the same data twice etc, we need to embed a unique record ID. This can then be de-duplicated on the consumer/application logic.

Consumers

  • Retries can cause the application to read data twice.
  • Consumer retries can happen when the record processors restart: i.e
  • Application is deployed.
  • Shards are split or merged.
  • Worker terminates.
  • Instances are added or removed.
  • Fix for this is to make your application idempotent. Such as the suggestion within the producer section above, i.e de-duplicating by unique record IDs and keeping a record of state, i.e if a consumer has ‘consumed’ a record previously.
  • AWS suggests, if possible to handle duplicates on the final destination rather than on the consumer/application logic. Example, could use the unique record ID as a primary key and a database won’t allow a duplicate primary key.

--

--

Joshua Callis

Converted DevOps Engineer at oso.sh, Previously a Senior Software Engineer.