Make minor editorial changes

zzhengnan · zzhengnan · commit 13b4b8a75a15 · 2020-11-29T23:15:11.000-05:00
diff --git a/04_dataframe.ipynb b/04_dataframe.ipynb
@@ -27,7 +27,7 @@
    "source": [
     "<img src=\"images/pandas_logo.png\" align=\"right\" width=\"28%\">\n",
     "\n",
-    "The `dask.dataframe` module implements a blocked parallel `DataFrame` object that mimics a large subset of the Pandas `DataFrame`. One Dask `DataFrame` is comprised of many in-memory pandas `DataFrames` separated along the index. One operation on a Dask `DataFrame` triggers many pandas operations on the constituent pandas `DataFrame`s in a way that is mindful of potential parallelism and memory constraints.\n",
+    "The `dask.dataframe` module implements a blocked parallel `DataFrame` object that mimics a large subset of the Pandas `DataFrame` API. One Dask `DataFrame` is comprised of many in-memory pandas `DataFrames` separated along the index. One operation on a Dask `DataFrame` triggers many pandas operations on the constituent pandas `DataFrame`s in a way that is mindful of potential parallelism and memory constraints.\n",
     "\n",
     "**Related Documentation**\n",
     "\n",
@@ -500,7 +500,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "But lets try by passing both to a single `compute` call."
+    "But let's try by passing both to a single `compute` call."
    ]
   },
   {
@@ -524,7 +524,7 @@
     "- the filter (`df[~df.Cancelled]`)\n",
     "- some of the necessary reductions (`sum`, `count`)\n",
     "\n",
-    "To see what the merged task graphs between multiple results look like (and what's shared), you can use the `dask.visualize` function (we might want to use `filename='graph.pdf'` to zoom in on the graph better):"
+    "To see what the merged task graphs between multiple results look like (and what's shared), you can use the `dask.visualize` function (we might want to use `filename='graph.pdf'` to save the graph to disk so that we can zoom in more easily):"
    ]
   },
   {
@@ -560,7 +560,7 @@
     "Dask.dataframe operations use `pandas` operations internally.  Generally they run at about the same speed except in the following two cases:\n",
     "\n",
     "1.  Dask introduces a bit of overhead, around 1ms per task.  This is usually negligible.\n",
-    "2.  When Pandas releases the GIL (coming to `groupby` in the next version) `dask.dataframe` can call several pandas operations in parallel within a process, increasing speed somewhat proportional to the number of cores. For operations which don't release the GIL, multiple processes would be needed to get the same speedup."
+    "2.  When Pandas releases the GIL `dask.dataframe` can call several pandas operations in parallel within a process, increasing speed somewhat proportional to the number of cores. For operations which don't release the GIL, multiple processes would be needed to get the same speedup."
    ]
   },
   {
diff --git a/05_distributed.ipynb b/05_distributed.ipynb
@@ -21,32 +21,32 @@
     "As we have seen so far, Dask allows you to simply construct graphs of tasks with dependencies, as well as have graphs created automatically for you using functional, Numpy or Pandas syntax on data collections. None of this would be very useful, if there weren't also a way to execute these graphs, in a parallel and memory-aware way. So far we have been calling `thing.compute()` or `dask.compute(thing)` without worrying what this entails. Now we will discuss the options available for that execution, and in particular, the distributed scheduler, which comes with additional functionality.\n",
     "\n",
     "Dask comes with four available schedulers:\n",
-    "- \"threaded\": a scheduler backed by a thread pool\n",
+    "- \"threaded\" (aka \"threading\"): a scheduler backed by a thread pool\n",
     "- \"processes\": a scheduler backed by a process pool\n",
     "- \"single-threaded\" (aka \"sync\"): a synchronous scheduler, good for debugging\n",
     "- distributed: a distributed scheduler for executing graphs on multiple machines, see below.\n",
     "\n",
     "To select one of these for computation, you can specify at the time of asking for a result, e.g.,\n",
     "```python\n",
     "myvalue.compute(scheduler=\"single-threaded\")  # for debugging\n",
-    "```"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "or set the current default, either temporarily or globally\n",
+    "```\n",
+    "\n",
+    "You can also set a default scheduler either temporarily\n",
     "```python\n",
     "with dask.config.set(scheduler='processes'):\n",
     "    # set temporarily for this block only\n",
+    "    # all compute calls within this block will use the specified scheduler\n",
     "    myvalue.compute()\n",
+    "    anothervalue.compute()\n",
+    "```\n",
     "\n",
-    "dask.config.set(scheduler='processes')\n",
+    "Or globally\n",
+    "```python\n",
     "# set until further notice\n",
+    "dask.config.set(scheduler='processes')\n",
     "```\n",
     "\n",
-    "Lets see the difference for the familiar case of the flights data"
+    "Let's try out a few schedulers on the familiar case of the flights data."
    ]
   },
   {
diff --git a/06_distributed_advanced.ipynb b/06_distributed_advanced.ipynb
@@ -36,9 +36,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In chapter Distributed, we showed that executing a calculation (created using delayed) with the distributed executor is identical to any other executor. However, we now have access to additional functionality, and control over what data is held in memory.\n",
+    "In the previous chapter, we showed that executing a calculation (created using delayed) with the distributed executor is identical to any other executor. However, we now have access to additional functionality, and control over what data is held in memory.\n",
     "\n",
-    "To begin, the `futures` interface (derived from the built-in `concurrent.futures`) allow map-reduce like functionality. We can submit individual functions for evaluation with one set of inputs, or evaluated over a sequence of inputs with `submit()` and `map()`. Notice that the call returns immediately, giving one or more *futures*, whose status begins as \"pending\" and later becomes \"finished\". There is no blocking of the local Python session."
+    "To begin, the `futures` interface (derived from the built-in `concurrent.futures`) allows map-reduce like functionality. We can submit individual functions for evaluation with one set of inputs, or evaluated over a sequence of inputs with `submit()` and `map()`. Notice that the call returns immediately, giving one or more *futures*, whose status begins as \"pending\" and later becomes \"finished\". There is no blocking of the local Python session."
    ]
   },
   {
@@ -542,11 +542,11 @@
     "\n",
     "@delayed\n",
     "def summation(*a):\n",
-    "    return sum(*a)\n",
+    "    return sum(a)\n",
     "\n",
     "ina = [5, 25, 30]\n",
     "inb = [5, 5, 6]\n",
-    "out = summation([ratio(a, b) for (a, b) in zip(ina, inb)])\n",
+    "out = summation(*[ratio(a, b) for (a, b) in zip(ina, inb)])\n",
     "f = c.compute(out)\n",
     "f"
    ]
@@ -586,7 +586,7 @@
    "source": [
     "ina = [5, 25, 30]\n",
     "inb = [5, 0, 6]\n",
-    "out = summation([ratio(a, b) for (a, b) in zip(ina, inb)])\n",
+    "out = summation(*[ratio(a, b) for (a, b) in zip(ina, inb)])\n",
     "f = c.compute(out)\n",
     "c.gather(f)"
    ]
@@ -634,10 +634,10 @@
    "metadata": {},
    "source": [
     "The trouble with this approach is that Dask is meant for the execution of large datasets/computations - you probably can't simply run the whole thing \n",
-    "in one local thread, else you wouldn't have used Dask in the first place. So the code above should only be used on a small part of the data that also exchibits the error. \n",
+    "in one local thread, else you wouldn't have used Dask in the first place. So the code above should only be used on a small part of the data that also exihibits the error. \n",
     "Furthermore, the method will not work when you are dealing with futures (such as `f`, above, or after persisting) instead of delayed-based computations.\n",
     "\n",
-    "As alternative, you can ask the scheduler to analyze your calculation and find the specific sub-task responsible for the error, and pull only it and its dependnecies locally for execution."
+    "As an alternative, you can ask the scheduler to analyze your calculation and find the specific sub-task responsible for the error, and pull only it and its dependnecies locally for execution."
    ]
   },
   {

Original file line number	Diff line number	Diff line change
`@@ -27,7 +27,7 @@`
`27`	`27`	`"source": [`
`28`	`28`	`"<img src=\"images/pandas_logo.png\" align=\"right\" width=\"28%\">\n",`
`29`	`29`	`"\n",`
`30`		- "The `dask.dataframe` module implements a blocked parallel `DataFrame` object that mimics a large subset of the Pandas `DataFrame`. One Dask `DataFrame` is comprised of many in-memory pandas `DataFrames` separated along the index. One operation on a Dask `DataFrame` triggers many pandas operations on the constituent pandas `DataFrame`s in a way that is mindful of potential parallelism and memory constraints.\n",
	`30`	+ "The `dask.dataframe` module implements a blocked parallel `DataFrame` object that mimics a large subset of the Pandas `DataFrame` API. One Dask `DataFrame` is comprised of many in-memory pandas `DataFrames` separated along the index. One operation on a Dask `DataFrame` triggers many pandas operations on the constituent pandas `DataFrame`s in a way that is mindful of potential parallelism and memory constraints.\n",
`31`	`31`	`"\n",`
`32`	`32`	`"Related Documentation\n",`
`33`	`33`	`"\n",`
`@@ -500,7 +500,7 @@`
`500`	`500`	`"cell_type": "markdown",`
`501`	`501`	`"metadata": {},`
`502`	`502`	`"source": [`
`503`		- "But lets try by passing both to a single `compute` call."
	`503`	+ "But let's try by passing both to a single `compute` call."
`504`	`504`	`]`
`505`	`505`	`},`
`506`	`506`	`{`
`@@ -524,7 +524,7 @@`
`524`	`524`	"- the filter (`df[~df.Cancelled]`)\n",
`525`	`525`	"- some of the necessary reductions (`sum`, `count`)\n",
`526`	`526`	`"\n",`
`527`		- "To see what the merged task graphs between multiple results look like (and what's shared), you can use the `dask.visualize` function (we might want to use `filename='graph.pdf'` to zoom in on the graph better):"
	`527`	+ "To see what the merged task graphs between multiple results look like (and what's shared), you can use the `dask.visualize` function (we might want to use `filename='graph.pdf'` to save the graph to disk so that we can zoom in more easily):"
`528`	`528`	`]`
`529`	`529`	`},`
`530`	`530`	`{`
`@@ -560,7 +560,7 @@`
`560`	`560`	"Dask.dataframe operations use `pandas` operations internally. Generally they run at about the same speed except in the following two cases:\n",
`561`	`561`	`"\n",`
`562`	`562`	`"1. Dask introduces a bit of overhead, around 1ms per task. This is usually negligible.\n",`
`563`		- "2. When Pandas releases the GIL (coming to `groupby` in the next version) `dask.dataframe` can call several pandas operations in parallel within a process, increasing speed somewhat proportional to the number of cores. For operations which don't release the GIL, multiple processes would be needed to get the same speedup."
	`563`	+ "2. When Pandas releases the GIL `dask.dataframe` can call several pandas operations in parallel within a process, increasing speed somewhat proportional to the number of cores. For operations which don't release the GIL, multiple processes would be needed to get the same speedup."
`564`	`564`	`]`
`565`	`565`	`},`
`566`	`566`	`{`