Spark-DP-101

Spark Data Platform-101 is a Very basic Apache Spark setup in a Docker Container. This setup is designed for testing Apache spark and for learning purpose as an alternative to VM's which are big in volume and take too much resources. This docker application has all basic features of Apache Spark like:

Spark Shell
Pyspark Shell
Jupyter Notebook http://localhost:4041
Spark UI http://localhost:4040
Spark History Server http://localhost:18080

Architecture

How to use it:

1. Clone the repository in your machine using git clone command

git clone [email protected]:experientlabs/spark-dp-101.git

2. Next build the image by running below `docker build` command.

docker build -t spark-dp-101 .

Here -t is to tag image with a name:spark-dp-101.
Here '.' is to run the build command in current directory. So dockerfile should be located in current directory.

3. Once image is built you need to run following command to run the container in jupyter notebook mode.

hostfolder="$(pwd)"
dockerfolder="/home/sparkuser/app"

docker run --rm -d --name spark-container \
-p 4040:4040 -p 4041:4041 -p 18080:18080 \
-v ${hostfolder}/app:${dockerfolder} -v ${hostfolder}/event_logs:/home/spark/event_logs \
spark-dp-101:latest jupyter

In order to run it in saprk-shell mode use below command (here last parameter is replaced with `spark-shell`).

hostfolder="$(pwd)"
dockerfolder="/home/sparkuser/app"

docker run --rm -it --name spark-container \
-p 4040:4040 -p 4041:4041 -p 18080:18080 \
-v ${hostfolder}/app:${dockerfolder} -v ${hostfolder}/event_logs:/home/spark/event_logs \
spark-dp-101:latest spark-shell

Similarly to run pyspark shell use below command (here last parameter is replaced with `pyspark`).

hostfolder="$(pwd)"
dockerfolder="/home/sparkuser/app"

docker run --rm -it --name spark-container \
-p 4040:4040 -p 4041:4041 -p 18080:18080 \
-v ${hostfolder}/app:${dockerfolder} -v ${hostfolder}/event_logs:/home/spark/event_logs \
spark-dp-101:latest pyspark

Once your container is running you can use below urls to access various web UI's

Jupyter Notebook: http://localhost:4041
Spark UI: http://localhost:4040
Spark History Server: http://localhost:18080

Terminal window after running docker run command:

Jupyter Notebook

http://127.0.0.1:4041/notebooks/first_notebook.ipynb Running below code in jupyter notebook, in order to ascertain that spark is working fine in the container.

import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as f

# create spark session
spark = SparkSession.builder.appName("SparkSample").getOrCreate()

# read text file
df_text_file = spark.read.text("textfile.txt")
df_text_file.show()

df_total_words = df_text_file.withColumn('wordCount', f.size(f.split(f.col('value'), ' ')))
df_total_words.show()

# Word count example
df_word_count = df_text_file.withColumn('word', f.explode(f.split(f.col('value'), ' '))).groupBy('word').count().sort('count', ascending=False)
df_word_count.show()

Output of word count example:

Spark UI:

http://localhost:4040/jobs/

Spark History Server:

http://localhost:18080/

Above features can also be accessed using docker-compose commands

docker-compose up jupyter
docker-compose up spark-shell
docker-compose up pyspark

This repository is brough to you by ExperientLabs, if you want to contribute, please feel free to raise a PR or if you come across an issue, don't hesitate to raise it.

docker cp /home/sanjeet/Downloads/unitycatalog-0.1.0.tar.gz be3a8857e400:/home/spark/unitycatalog-0.1.0.tar.gz tar -xf unitycatalog-0.1.0.tar.gz

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
app		app
resources		resources
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spark-DP-101

Architecture

How to use it:

1. Clone the repository in your machine using git clone command

2. Next build the image by running below `docker build` command.

3. Once image is built you need to run following command to run the container in jupyter notebook mode.

In order to run it in saprk-shell mode use below command (here last parameter is replaced with `spark-shell`).

Similarly to run pyspark shell use below command (here last parameter is replaced with `pyspark`).

Once your container is running you can use below urls to access various web UI's

Jupyter Notebook

Output of word count example:

Spark UI:

Spark History Server:

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

experientlabs/spark-dp-101

Folders and files

Latest commit

History

Repository files navigation

Spark-DP-101

Architecture

How to use it:

1. Clone the repository in your machine using git clone command

2. Next build the image by running below docker build command.

3. Once image is built you need to run following command to run the container in jupyter notebook mode.

In order to run it in saprk-shell mode use below command (here last parameter is replaced with spark-shell).

Similarly to run pyspark shell use below command (here last parameter is replaced with pyspark).

Once your container is running you can use below urls to access various web UI's

Jupyter Notebook

Output of word count example:

Spark UI:

Spark History Server:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

2. Next build the image by running below `docker build` command.

In order to run it in saprk-shell mode use below command (here last parameter is replaced with `spark-shell`).

Similarly to run pyspark shell use below command (here last parameter is replaced with `pyspark`).

Packages