Apply suggestions from code review

leahecole · dandhlee · web-flow · commit b2d35bfd8341 · 2022-09-19T16:07:01.000-04:00
Co-authored-by: Dan Lee &lt;71398022+dandhlee@users.noreply.github.com&gt;
diff --git a/composer/2022_airflow_summit/DATAPROC_EXPANSION_README.md b/composer/2022_airflow_summit/DATAPROC_EXPANSION_README.md
@@ -10,7 +10,7 @@ Go through the tutorial to [Run a data analytics DAG in Google Cloud](https://cl
 
 ## About this example 
 
-This directory has a DAG similar to the data analytics DAG found in the [Run a data analytics DAG in Google Cloud](https://cloud.google.com/composer/docs/data-analytics-googlecloud) tutorial, but includes a more complicated data processing step with Dataproc. Instead of answering the question, "How warm was it in Chicago on Thanksgiving for the past 25 years?" we will answer the question, "How have the rainfall patterns changed over the past 25 years in the western part of the US and in Phoenix, AZ?" We define the western part of the US as the [census defined West region](https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf), and chose Phoenix as it is a city that has been affected by climate change in recent years, especially with respect to water.
+This directory has a DAG similar to the data analytics DAG found in the [Run a data analytics DAG in Google Cloud](https://cloud.google.com/composer/docs/data-analytics-googlecloud) tutorial, but includes a more complicated data processing step with Dataproc. Instead of answering the question, "How warm was it in Chicago on Thanksgiving for the past 25 years?" you will answer the question, "How have the rainfall patterns changed over the past 25 years in the western part of the US and in Phoenix, AZ?" For this example, the western part of the US is defined as the [census defined West region](https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf). Phoenix is used in this example because it is a city that has been affected by climate change in recent years, especially with respect to water.
 
 The Dataproc Serverless job uses [arithmetic mean](https://www.weather.gov/abrfc/map#arithmetic_mean) to calculate precipitation and snowfall in the western states, and uses [distance weighting](https://www.weather.gov/abrfc/map#distance_weighting) to focus on the Phoenix specific area.
 
@@ -23,13 +23,13 @@ The DAG has three steps:
     1. Removing any data points that are not from weather stations located in the Western US
     2. Removing any data points that are not about snow or other precipitation (data where `ELEMENT` is not `SNOW` or `PRCP`)
     3. Convert the values in the `ELEMENT` column (now equal to `SNOW` or `PRCP`) to be in mm, instead of tenths of a mm. 
-    4. Extract the year from the date so the Date column is left only with the year
+    4. Extract the year from the date so the `Date` column is left only with the year
     5. Calculate the [arithmetic mean](https://www.weather.gov/abrfc/map#arithmetic_mean) of precipitation and of snowfall
     6. Calculate the [distance weighting](https://www.weather.gov/abrfc/map#distance_weighting) for Phoenix. 
     7. Write the results to tables in BigQuery
 
 ## Running this sample
-* Add `data_analytics_dag_expansion.py` to the Composer environent you used in the previous tutorial
+* Add `data_analytics_dag_expansion.py` to the Composer environment you used in the previous tutorial
 * Add `data_analytics_process_expansion.py` and `ghcn-stations-processed.csv` to the Cloud Storage bucket you created in the previous tutorial
 * Create an empty BigQuery dataset called `precipitation_changes`
 
diff --git a/composer/2022_airflow_summit/data_analytics_dag_expansion.py b/composer/2022_airflow_summit/data_analytics_dag_expansion.py
@@ -37,7 +37,7 @@
 
 # Dataproc configs
 BUCKET_NAME = "{{var.value.gcs_bucket}}"
-PYSPARK_JAR = "gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar"
+PYSPARK_JAR = "gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.26.0.jar"
 PROCESSING_PYTHON_FILE = f"gs://{BUCKET_NAME}/data_analytics_process_expansion.py"
 
 BATCH_ID = "data-processing-{{ ts_nodash | lower}}"  # Dataproc serverless only allows lowercase characters
diff --git a/composer/2022_airflow_summit/data_analytics_process_expansion.py b/composer/2022_airflow_summit/data_analytics_process_expansion.py
@@ -53,7 +53,7 @@
     except Py4JJavaError:
         raise Exception(f"Error reading {READ_TABLE}")
 
-    # Since our goal is to focus on the western US, we first filter out non-western states of the US.
+    # Since the goal is to focus on the western US, you first filter out non-western states of the US.
     # The definition of western US can be found in the following link:
     # https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf
     western_states = ["AZ", "CA", "CO", "ID", "MT", "NM", "NV", "OR", "UT", "WA", "WY"]
@@ -84,7 +84,7 @@
         .agg(avg("VALUE").alias("ANNUAL_PRCP_MEAN"))
         .sort("YEAR")
     )
-    print("prcp mean table")
+    print("PRCP mean table")
     prcp_mean_df.show(n=50)
 
     # Each year's arithmetic mean of snowfall
@@ -94,7 +94,7 @@
         .agg(avg("VALUE").alias("ANNUAL_SNOW_MEAN"))
         .sort("YEAR")
     )
-    print("snow mean table")
+    print("SNOW mean table")
     snow_mean_df.show(n=50)
 
     # Filter out the states to move on to the distance weighting algorithm (DWA)
@@ -140,9 +140,7 @@ def phx_dw_compute(input_list) -> float:
             factor_sum += distance_factor
 
         # Contains the weights of each station
-        weights_list = []
-        for val in factor_list:
-            weights_list.append(val / factor_sum)
+        weights_list = [val / factor_sum for val in factor_list]
 
         dwa_result = 0.0
         for row in input_list: