Skip to content

Commit e16a295

Browse files
aman-ebaytswast
authored andcommitted
Create python-api-walkthrough.md (#1966)
* Create python-api-walkthrough.md This Google Cloud Shell walkthrough is linked to Cloud Dataproc documentation to be published at: https://cloud.google.com/dataproc/docs/tutorials/python-library-example * Update python-api-walkthrough.md
1 parent c611792 commit e16a295

File tree

1 file changed

+165
-0
lines changed

1 file changed

+165
-0
lines changed

dataproc/python-api-walkthrough.md

Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
# Use the Python Client Library to call Cloud Dataproc APIs
2+
3+
Estimated completion time: <walkthrough-tutorial-duration duration="5"></walkthrough-tutorial-duration>
4+
5+
## Overview
6+
7+
This [Cloud Shell](https://cloud.google.com/shell/docs/) walkthrough leads you
8+
through the steps to use the
9+
[Google APIs Client Library for Python](http://code.google.com/p/google-api-python-client/ )
10+
to programmatically interact with [Cloud Dataproc](https://cloud.google.com/dataproc/docs/).
11+
12+
As you follow this walkthrough, you run Python code that calls
13+
[Cloud Dataproc REST API](https://cloud.google.com//dataproc/docs/reference/rest/)
14+
methods to:
15+
16+
* create a Cloud Dataproc cluster
17+
* submit a small PySpark word sort job to run on the cluster
18+
* get job status
19+
* tear down the cluster after job completion
20+
21+
## Using the walkthrough
22+
23+
The `submit_job_to_cluster.py file` used in this walkthrough is opened in the
24+
Cloud Shell editor when you launch the walkthrough. You can view
25+
the code as your follow the walkthrough steps.
26+
27+
**For more information**: See [Cloud Dataproc&rarr;Use the Python Client Library](https://cloud.google.com/dataproc/docs/tutorials/python-library-example) for
28+
an explanation of how the code works.
29+
30+
**To reload this walkthrough:** Run the following command from the
31+
`~/python-docs-samples/dataproc` directory in Cloud Shell:
32+
33+
cloudshell launch-tutorial python-api-walkthrough.md
34+
35+
**To copy and run commands**: Click the "Paste in Cloud Shell" button
36+
(<walkthrough-cloud-shell-icon></walkthrough-cloud-shell-icon>)
37+
on the side of a code box, then press `Enter` to run the command.
38+
39+
## Prerequisites (1)
40+
41+
1. Create or select a Google Cloud Platform project to use for this tutorial.
42+
* <walkthrough-project-billing-setup permissions=""></walkthrough-project-billing-setup>
43+
44+
1. Enable the Cloud Dataproc, Compute Engine, and Cloud Storage APIs in your project.
45+
* <walkthrough-enable-apis apis="dataproc,compute_component,storage-component.googleapis.com"></walkthrough-enable-apis>
46+
47+
## Prerequisites (2)
48+
49+
1. This walkthrough uploads a PySpark file (`pyspark_sort.py`) to a
50+
[Cloud Storage bucket](https://cloud.google.com/storage/docs/key-terms#buckets) in
51+
your project.
52+
* You can use the [Cloud Storage browser page](https://console.cloud.google.com/storage/browser)
53+
in Google Cloud Platform Console to view existing buckets in your project.
54+
55+
&nbsp;&nbsp;&nbsp;&nbsp;**OR**
56+
57+
* To create a new bucket, run the following command. Your bucket name must be unique.
58+
```bash
59+
gsutil mb -p {{project-id}} gs://your-bucket-name
60+
```
61+
62+
1. Set environment variables.
63+
64+
* Set the name of your bucket.
65+
```bash
66+
BUCKET=your-bucket-name
67+
```
68+
69+
## Prerequisites (3)
70+
71+
1. Set up a Python
72+
[virtual environment](https://virtualenv.readthedocs.org/en/latest/)
73+
in Cloud Shell.
74+
75+
* Create the virtual environment.
76+
```bash
77+
virtualenv ENV
78+
```
79+
* Activate the virtual environment.
80+
```bash
81+
source ENV/bin/activate
82+
```
83+
84+
1. Install library dependencies in Cloud Shell.
85+
```bash
86+
pip install -r requirements.txt
87+
```
88+
89+
## Create a cluster and submit a job
90+
91+
1. Set a name for your new cluster.
92+
```bash
93+
CLUSTER=new-cluster-name
94+
```
95+
96+
1. Set a [zone](https://cloud.google.com/compute/docs/regions-zones/#available)
97+
where your new cluster will be located. You can change the
98+
"us-central1-a" zone that is pre-set in the following command.
99+
```bash
100+
ZONE=us-central1-a
101+
```
102+
103+
1. Run `submit_job.py` with the `--create_new_cluster` flag
104+
to create a new cluster and submit the `pyspark_sort.py` job
105+
to the cluster.
106+
107+
```bash
108+
python submit_job_to_cluster.py \
109+
--project_id={{project-id}} \
110+
--cluster_name=$CLUSTER \
111+
--zone=$ZONE \
112+
--gcs_bucket=$BUCKET \
113+
--create_new_cluster
114+
```
115+
116+
## Job Output
117+
118+
Job output in Cloud Shell shows cluster creation, job submission,
119+
job completion, and then tear-down of the cluster.
120+
121+
...
122+
Creating cluster...
123+
Cluster created.
124+
Uploading pyspark file to GCS
125+
new-cluster-name - RUNNING
126+
Submitted job ID ...
127+
Waiting for job to finish...
128+
Job finished.
129+
Downloading output file
130+
.....
131+
['Hello,', 'dog', 'elephant', 'panther', 'world!']
132+
...
133+
Tearing down cluster
134+
```
135+
## Congratulations on Completing the Walkthrough!
136+
<walkthrough-conclusion-trophy></walkthrough-conclusion-trophy>
137+
138+
---
139+
140+
### Next Steps:
141+
142+
* **View job details from the Console.** View job details by selecting the
143+
PySpark job from the Cloud Dataproc
144+
[Jobs page](https://console.cloud.google.com/dataproc/jobs)
145+
in the Google Cloud Platform Console.
146+
147+
* **Delete resources used in the walkthrough.**
148+
The `submit_job.py` job deletes the cluster that it created for this
149+
walkthrough.
150+
151+
If you created a bucket to use for this walkthrough,
152+
you can run the following command to delete the
153+
Cloud Storage bucket (the bucket must be empty).
154+
```bash
155+
gsutil rb gs://$BUCKET
156+
```
157+
You can run the following command to delete the bucket **and all
158+
objects within it. Note: the deleted objects cannot be recovered.**
159+
```bash
160+
gsutil rm -r gs://$BUCKET
161+
```
162+
163+
* **For more information.** See the [Cloud Dataproc documentation](https://cloud.google.com/dataproc/docs/)
164+
for API reference and product feature information.
165+

0 commit comments

Comments
 (0)