Project showing how the spark hdfs backed state store can be used to deduplicate streaming data
Go to file
karthik a0cef20f21
All checks were successful
Tests / reset-status (push) Successful in 2s
Tests / tests (push) Successful in 4m6s
Tests / build (push) Successful in 4m20s
Use gradle version catalogs
2023-10-07 14:01:18 +02:00
.gitea/workflows Use gradle version catalogs 2023-10-07 14:01:18 +02:00
gradle Use gradle version catalogs 2023-10-07 14:01:18 +02:00
src Use gradle version catalogs 2023-10-07 14:01:18 +02:00
.gitignore Use gradle version catalogs 2023-10-07 14:01:18 +02:00
build.gradle.kts Use gradle version catalogs 2023-10-07 14:01:18 +02:00
gradle.properties Use gradle version catalogs 2023-10-07 14:01:18 +02:00
gradlew Use gradle version catalogs 2023-10-07 14:01:18 +02:00
gradlew.bat Use gradle version catalogs 2023-10-07 14:01:18 +02:00
README.md Use gradle version catalogs 2023-10-07 14:01:18 +02:00
settings.gradle.kts Use gradle version catalogs 2023-10-07 14:01:18 +02:00

Tests Build

Spark Structured Streaming Data Deduplication using State Store

This is a project showing how the spark in-built HDFS backed state store can be used to deduplicate data in a stream. The related blog post can be found at https://barrelsofdata.com/data-deduplication-spark-state-store

Build instructions

From the root of the project execute the below commands

  • To clear all compiled classes, build and log directories
./gradlew clean
  • To run tests
./gradlew test
  • To build jar
./gradlew build

Run

Ensure your local hadoop cluster is running (hadoop cluster tutorial) and start two kafka brokers (kafka tutorial).

  • Create kafka topic
kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 2 --partitions 2 --topic streaming-data
  • Start streaming job
spark-submit --master yarn --deploy-mode cluster build/libs/spark-state-store-data-deduplication-1.0.0.jar <KAFKA_BROKER> <KAFKA_TOPIC> <FULL_OUTPUT_PATH> <DEDUPLICATED_OUTPUT_PATH> <WINDOW_SIZE_SECONDS>
Example: spark-submit --master yarn --deploy-mode client build/libs/spark-state-store-data-deduplication-1.0.0.jar localhost:9092 streaming-data fullOutput deduplicatedOutput 5
  • You can feed simulated data to the kafka topic
  • Open new terminal and run the shell script located at src/test/resources/dataProducer.sh
  • Produces the two instances of the following json structure every 1 second: {"ts":1594307307,"usr":"user1","tmp":98}
cd src/test/resources
./dataProducer.sh localhost:9092 streaming-data

View Results

Open a spark-shell and use the following code, do change the paths to where the outputs are stored.

spark.read.parquet("fullOutput").orderBy("user","eventTime").show(truncate = false)

spark.read.parquet("deduplicatedOutput").orderBy("user","eventTime").show(truncate = false)