This repository was archived by the owner on Jan 29, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 600
Spark Usage
Luke Lovett edited this page Feb 17, 2015
·
24 revisions
This page describes how to use the MongoDB Hadoop Connector with Spark.
- Obtain the MongoDB Hadoop Connector. You can either build it or download the jars. The releases page also includes instructions for use with Maven and Gradle. For Spark, all you need is the "core" jar.
- Get a JAR for the MongoDB Java Driver.
These are the only two depencies for building a project using Spark and MongoDB.
This goes through the basics of setting up your Spark project to use MongoDB as a source or a sink. Although the example code is in Java, the equivalent code in either Python or Scala should work as well. At a high level, here's what we're going to do:
- Create a new
Configuration
so we can set options on the MongoDB Hadoop Connector. - Create a
newAPIHadoopRDD
using thisConfiguration
and theInputFormat
class we want to use, based on whether we want to read from a live cluster or a BSON snapshot. - When we're ready to save data back into MongoDB or a BSON file, we'll call the
saveAsNewAPIHadoopFile
on the RDD with theOutputFormat
class we want.
Here's a basic runthrough:
TODO