# Spark Structured Streaming Data Deduplication using State Store
This is a project showing how the spark in-built HDFS backed state store can be used to deduplicate data in a stream. The related blog post can be found at [https://www.barrelsofdata.com/data-deduplication-spark-state-store](https://www.barrelsofdata.com/data-deduplication-spark-state-store)
Ensure your local hadoop cluster is running ([hadoop cluster tutorial](https://www.barrelsofdata.com/apache-hadoop-pseudo-distributed-mode)) and start two kafka brokers ([kafka tutorial](https://www.barrelsofdata.com/apache-kafka-setup)).