karthik
ec4b433e93
Tests / reset-status (push) Successful in 3s
Details
Tests / tests (push) Successful in 4m40s
Details
Tests / build (push) Successful in 4m43s
Details
|
||
---|---|---|
.gitea/workflows | ||
gradle | ||
src | ||
.gitignore | ||
README.md | ||
build.gradle.kts | ||
gradle.properties | ||
gradlew | ||
gradlew.bat | ||
settings.gradle.kts |
README.md
Spark Structured Streaming Word Count
This is a project detailing how to write a streaming word count program in Apache Spark using Structured Streaming. The related blog post can be found at https://barrelsofdata.com/spark-structured-streaming-word-count
Build instructions
From the root of the project execute the below commands
- To clear all compiled classes, build and log directories
./gradlew clean
- To run tests
./gradlew test
- To build jar
./gradlew build
Run
Ensure your local hadoop cluster is running (hadoop cluster tutorial) and start two kafka brokers (kafka tutorial).
- Create kafka topic
kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 2 --partitions 2 --topic streaming-data
- Start streaming job
spark-submit --master yarn --deploy-mode cluster build/libs/spark-structured-streaming-wordcount-1.0.0.jar <KAFKA_BROKER> <KAFKA_TOPIC>
Example: spark-submit --master yarn --deploy-mode client build/libs/spark-structured-streaming-wordcount-1.0.0.jar localhost:9092 streaming-data
- You can feed simulated data to the kafka topic
- Open new terminal and run the shell script located at src/test/resources/dataProducer.sh
- Produces the following json structure every 1 second: {"ts":1594307307,"str":"This is an example string"}
cd src/test/resources
./dataProducer.sh localhost:9092 streaming-data