Spark Data Platform-101 is a Very basic Apache Spark setup in a Docker Container. This setup is designed for testing Apache spark and for learning purpose as an alternative to VM's which are big in volume and take too much resources. This docker application has all basic features of Apache Spark like:
- Spark Shell
- Pyspark Shell
- Jupyter Notebook http://localhost:4041
- Spark UI http://localhost:4040
- Spark History Server http://localhost:18080
git clone [email protected]:experientlabs/spark-dp-101.git
docker build -t spark-dp-101 .
- Here -t is to tag image with a name:
spark-dp-101
. - Here '.' is to run the build command in current directory. So dockerfile should be located in current directory.
3. Once image is built you need to run following command to run the container in jupyter notebook mode.
hostfolder="$(pwd)"
dockerfolder="/home/sparkuser/app"
docker run --rm -d --name spark-container \
-p 4040:4040 -p 4041:4041 -p 18080:18080 \
-v ${hostfolder}/app:${dockerfolder} -v ${hostfolder}/event_logs:/home/spark/event_logs \
spark-dp-101:latest jupyter
In order to run it in saprk-shell mode use below command (here last parameter is replaced with spark-shell
).
hostfolder="$(pwd)"
dockerfolder="/home/sparkuser/app"
docker run --rm -it --name spark-container \
-p 4040:4040 -p 4041:4041 -p 18080:18080 \
-v ${hostfolder}/app:${dockerfolder} -v ${hostfolder}/event_logs:/home/spark/event_logs \
spark-dp-101:latest spark-shell
hostfolder="$(pwd)"
dockerfolder="/home/sparkuser/app"
docker run --rm -it --name spark-container \
-p 4040:4040 -p 4041:4041 -p 18080:18080 \
-v ${hostfolder}/app:${dockerfolder} -v ${hostfolder}/event_logs:/home/spark/event_logs \
spark-dp-101:latest pyspark
- Jupyter Notebook: http://localhost:4041
- Spark UI: http://localhost:4040
- Spark History Server: http://localhost:18080
Terminal window after running docker run command:
http://127.0.0.1:4041/notebooks/first_notebook.ipynb Running below code in jupyter notebook, in order to ascertain that spark is working fine in the container.
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
# create spark session
spark = SparkSession.builder.appName("SparkSample").getOrCreate()
# read text file
df_text_file = spark.read.text("textfile.txt")
df_text_file.show()
df_total_words = df_text_file.withColumn('wordCount', f.size(f.split(f.col('value'), ' ')))
df_total_words.show()
# Word count example
df_word_count = df_text_file.withColumn('word', f.explode(f.split(f.col('value'), ' '))).groupBy('word').count().sort('count', ascending=False)
df_word_count.show()
Above features can also be accessed using docker-compose commands
- docker-compose up jupyter
- docker-compose up spark-shell
- docker-compose up pyspark
This repository is brough to you by ExperientLabs, if you want to contribute, please feel free to raise a PR or if you come across an issue, don't hesitate to raise it.
docker cp /home/sanjeet/Downloads/unitycatalog-0.1.0.tar.gz be3a8857e400:/home/spark/unitycatalog-0.1.0.tar.gz tar -xf unitycatalog-0.1.0.tar.gz