Project detailing how to write a word count program using spark structured streaming along with unit test cases
Go to file
2020-07-11 10:32:23 -04:00
gradle/wrapper Initial commit 2020-07-09 13:26:50 +00:00
src Initial commit 2020-07-11 10:32:23 -04:00
.gitignore Initial commit 2020-07-09 13:26:50 +00:00
build.gradle Initial commit 2020-07-11 10:32:23 -04:00
gradle.properties Initial commit 2020-07-09 13:26:50 +00:00
gradlew Initial commit 2020-07-09 13:26:50 +00:00
gradlew.bat Initial commit 2020-07-09 13:26:50 +00:00
README.md Initial commit 2020-07-11 10:32:23 -04:00
settings.gradle Initial commit 2020-07-11 10:32:23 -04:00

Spark Structured Streaming Word Count

This is a project detailing how to write a streaming word count program in Apache Spark using Structured Streaming. The related blog post can be found at https://www.barrelsofdata.com/spark-structured-streaming-word-count

Build instructions

From the root of the project execute the below commands

  • To clear all compiled classes, build and log directories
./gradlew clean
  • To run tests
./gradlew test
  • To build jar
./gradlew shadowJar
  • All combined
./gradlew clean test shadowJar

Run

Ensure your local hadoop cluster is running (hadoop cluster tutorial) and start two kafka brokers (kafka tutorial).

  • Create kafka topic
kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 2 --partitions 2 --topic streaming-data
  • Start streaming job
spark-submit --master yarn --deploy-mode cluster build/libs/spark-structured-streaming-wordcount-1.0.jar <KAFKA_BROKER> <KAFKA_TOPIC>
Example: spark-submit --master yarn --deploy-mode client build/libs/spark-structured-streaming-wordcount-1.0.jar localhost:9092 streaming-data
  • You can feed simulated data to the kafka topic
  • Open new terminal and run the shell script located at src/test/resources/dataProducer.sh
  • Produces the following json structure every 1 second: {"ts":1594307307,"str":"This is an example string"}
cd src/test/resources
./dataProducer.sh localhost:9092 streaming-data