39 lines
2.1 KiB
Markdown
39 lines
2.1 KiB
Markdown
[![Tests](https://barrelsofdata.com/api/v1/git/action/status/fetch/barrelsofdata/spark-structured-streaming-wordcount/tests)](https://git.barrelsofdata.com/barrelsofdata/spark-structured-streaming-wordcount/actions?workflow=workflow.yaml)
|
|
[![Build](https://barrelsofdata.com/api/v1/git/action/status/fetch/barrelsofdata/spark-structured-streaming-wordcount/build)](https://git.barrelsofdata.com/barrelsofdata/spark-structured-streaming-wordcount/actions?workflow=workflow.yaml)
|
|
|
|
# Spark Structured Streaming Word Count
|
|
This is a project detailing how to write a streaming word count program in Apache Spark using Structured Streaming. The related blog post can be found at [https://barrelsofdata.com/spark-structured-streaming-word-count](https://barrelsofdata.com/spark-structured-streaming-word-count)
|
|
|
|
## Build instructions
|
|
From the root of the project execute the below commands
|
|
- To clear all compiled classes, build and log directories
|
|
```shell script
|
|
./gradlew clean
|
|
```
|
|
- To run tests
|
|
```shell script
|
|
./gradlew test
|
|
```
|
|
- To build jar
|
|
```shell script
|
|
./gradlew build
|
|
```
|
|
|
|
## Run
|
|
Ensure your local hadoop cluster is running ([hadoop cluster tutorial](https://barrelsofdata.com/apache-hadoop-pseudo-distributed-mode)) and start two kafka brokers ([kafka tutorial](https://barrelsofdata.com/apache-kafka-setup)).
|
|
- Create kafka topic
|
|
```shell script
|
|
kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 2 --partitions 2 --topic streaming-data
|
|
```
|
|
- Start streaming job
|
|
```shell script
|
|
spark-submit --master yarn --deploy-mode cluster build/libs/spark-structured-streaming-wordcount-1.0.0.jar <KAFKA_BROKER> <KAFKA_TOPIC>
|
|
Example: spark-submit --master yarn --deploy-mode client build/libs/spark-structured-streaming-wordcount-1.0.0.jar localhost:9092 streaming-data
|
|
```
|
|
- You can feed simulated data to the kafka topic
|
|
- Open new terminal and run the shell script located at src/test/resources/dataProducer.sh
|
|
- Produces the following json structure every 1 second: {"ts":1594307307,"str":"This is an example string"}
|
|
```shell script
|
|
cd src/test/resources
|
|
./dataProducer.sh localhost:9092 streaming-data
|
|
``` |