Project showing how the spark hdfs backed state store can be used to deduplicate streaming data
gradle/wrapper | ||
src | ||
.gitignore | ||
build.gradle | ||
gradle.properties | ||
gradlew | ||
gradlew.bat | ||
README.md | ||
settings.gradle |
Spark Boilerplate
This is a boilerplate project for Apache Spark. The related blog post can be found at https://www.barrelsofdata.com/spark-boilerplate-using-scala
Build instructions
From the root of the project execute the below commands
- To clear all compiled classes, build and log directories
./gradlew clean
- To run tests
./gradlew test
- To build jar
./gradlew shadowJar
- All combined
./gradlew clean test shadowJar
Run
spark-submit --master yarn --deploy-mode cluster build/libs/spark-boilerplate-1.0.jar