Spark Usage

This page describes how to use the MongoDB Hadoop Connector with Spark.

Installation

Obtain the MongoDB Hadoop Connector. You can either build it or download the jars. The releases page also includes instructions for use with Maven and Gradle. For Spark, all you need is the "core" jar.
Get a JAR for the MongoDB Java Driver.

These are the only two depencies for building a project using Spark and MongoDB.

Basic Usage

This goes through the basics of setting up your Spark project to use MongoDB as a source or a sink. Although the example code is in Java, the equivalent code in either Python or Scala should work as well. At a high level, here's what we're going to do:

Create a new Configuration so we can set options on the MongoDB Hadoop Connector.
Create a newAPIHadoopRDD using this Configuration and the InputFormat class we want to use, based on whether we want to read from a live cluster or a BSON snapshot.
When we're ready to save data back into MongoDB or a BSON file, we'll call the saveAsNewAPIHadoopFile on the RDD with the OutputFormat class we want.

Here's a basic runthrough:

TODO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark Usage

Installation

Basic Usage

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally