|
115 | 115 | "cell_type": "markdown",
|
116 | 116 | "metadata": {},
|
117 | 117 | "source": [
|
118 |
| - "Before using dask, lets consider the concept of blocked algorithms. We can compute the sum of a large number of elements by loading them chunk-by-chunk, and keeping a running total.\n", |
| 118 | + "Before using dask, let's consider the concept of blocked algorithms. We can compute the sum of a large number of elements by loading them chunk-by-chunk, and keeping a running total.\n", |
119 | 119 | "\n",
|
120 | 120 | "Here we compute the sum of this large array on disk by \n",
|
121 | 121 | "\n",
|
|
133 | 133 | "source": [
|
134 | 134 | "# Compute sum of large array, one million numbers at a time\n",
|
135 | 135 | "sums = []\n",
|
136 |
| - "for i in range(0, 1000000000, 1000000):\n", |
137 |
| - " chunk = dset[i: i + 1000000] # pull out numpy array\n", |
| 136 | + "for i in range(0, 1_000_000_000, 1_000_000):\n", |
| 137 | + " chunk = dset[i: i + 1_000_000] # pull out numpy array\n", |
138 | 138 | " sums.append(chunk.sum())\n",
|
139 | 139 | "\n",
|
140 | 140 | "total = sum(sums)\n",
|
|
152 | 152 | "cell_type": "markdown",
|
153 | 153 | "metadata": {},
|
154 | 154 | "source": [
|
155 |
| - "Now that we've seen the simple example above try doing a slightly more complicated problem, compute the mean of the array, assuming for a moment that we don't happen to already know how many elements are in the data. You can do this by changing the code above with the following alterations:\n", |
| 155 | + "Now that we've seen the simple example above, try doing a slightly more complicated problem. Compute the mean of the array, assuming for a moment that we don't happen to already know how many elements are in the data. You can do this by changing the code above with the following alterations:\n", |
156 | 156 | "\n",
|
157 | 157 | "1. Compute the sum of each block\n",
|
158 | 158 | "2. Compute the length of each block\n",
|
|
182 | 182 | "source": [
|
183 | 183 | "sums = []\n",
|
184 | 184 | "lengths = []\n",
|
185 |
| - "for i in range(0, 1000000000, 1000000):\n", |
186 |
| - " chunk = dset[i: i + 1000000] # pull out numpy array\n", |
| 185 | + "for i in range(0, 1_000_000_000, 1_000_000):\n", |
| 186 | + " chunk = dset[i: i + 1_000_000] # pull out numpy array\n", |
187 | 187 | " sums.append(chunk.sum())\n",
|
188 | 188 | " lengths.append(len(chunk))\n",
|
189 | 189 | "\n",
|
|
216 | 216 | "You can create a `dask.array` `Array` object with the `da.from_array` function. This function accepts\n",
|
217 | 217 | "\n",
|
218 | 218 | "1. `data`: Any object that supports NumPy slicing, like `dset`\n",
|
219 |
| - "2. `chunks`: A chunk size to tell us how to block up our array, like `(1000000,)`" |
| 219 | + "2. `chunks`: A chunk size to tell us how to block up our array, like `(1_000_000,)`" |
220 | 220 | ]
|
221 | 221 | },
|
222 | 222 | {
|
|
226 | 226 | "outputs": [],
|
227 | 227 | "source": [
|
228 | 228 | "import dask.array as da\n",
|
229 |
| - "x = da.from_array(dset, chunks=(1000000,))\n", |
| 229 | + "x = da.from_array(dset, chunks=(1_000_000,))\n", |
230 | 230 | "x"
|
231 | 231 | ]
|
232 | 232 | },
|
233 | 233 | {
|
234 | 234 | "cell_type": "markdown",
|
235 | 235 | "metadata": {},
|
236 | 236 | "source": [
|
237 |
| - "** Manipulate `dask.array` object as you would a numpy array**" |
| 237 | + "**Manipulate `dask.array` object as you would a numpy array**" |
238 | 238 | ]
|
239 | 239 | },
|
240 | 240 | {
|
|
243 | 243 | "source": [
|
244 | 244 | "Now that we have an `Array` we perform standard numpy-style computations like arithmetic, mathematics, slicing, reductions, etc..\n",
|
245 | 245 | "\n",
|
246 |
| - "The interface is familiar, but the actual work is different. dask_array.sum() does not do the same thing as numpy_array.sum()." |
| 246 | + "The interface is familiar, but the actual work is different. `dask_array.sum()` does not do the same thing as `numpy_array.sum()`." |
247 | 247 | ]
|
248 | 248 | },
|
249 | 249 | {
|
|
353 | 353 | "1. Use multiple cores in parallel\n",
|
354 | 354 | "2. Chain operations on a single blocks before moving on to the next one\n",
|
355 | 355 | "\n",
|
356 |
| - "Dask.array translates your array operations into a graph of inter-related tasks with data dependencies between them. Dask then executes this graph in parallel with multiple threads. We'll discuss more about this in the next section.\n", |
| 356 | + "`Dask.array` translates your array operations into a graph of inter-related tasks with data dependencies between them. Dask then executes this graph in parallel with multiple threads. We'll discuss more about this in the next section.\n", |
357 | 357 | "\n"
|
358 | 358 | ]
|
359 | 359 | },
|
|
410 | 410 | "cell_type": "markdown",
|
411 | 411 | "metadata": {},
|
412 | 412 | "source": [
|
413 |
| - "Performance comparision\n", |
| 413 | + "Performance comparison\n", |
414 | 414 | "---------------------------\n",
|
415 | 415 | "\n",
|
416 | 416 | "The following experiment was performed on a heavy personal laptop. Your performance may vary. If you attempt the NumPy version then please ensure that you have more than 4GB of main memory."
|
|
544 | 544 | "dsets[0]"
|
545 | 545 | ]
|
546 | 546 | },
|
547 |
| - { |
548 |
| - "cell_type": "code", |
549 |
| - "execution_count": null, |
550 |
| - "metadata": {}, |
551 |
| - "outputs": [], |
552 |
| - "source": [] |
553 |
| - }, |
554 | 547 | {
|
555 | 548 | "cell_type": "code",
|
556 | 549 | "execution_count": null,
|
|
597 | 590 | {
|
598 | 591 | "cell_type": "code",
|
599 | 592 | "execution_count": null,
|
600 |
| - "metadata": {}, |
| 593 | + "metadata": { |
| 594 | + "jupyter": { |
| 595 | + "source_hidden": true |
| 596 | + } |
| 597 | + }, |
601 | 598 | "outputs": [],
|
602 | 599 | "source": [
|
603 | 600 | "arrays = [da.from_array(dset, chunks=(500, 500)) for dset in dsets]\n",
|
|
628 | 625 | {
|
629 | 626 | "cell_type": "code",
|
630 | 627 | "execution_count": null,
|
631 |
| - "metadata": {}, |
| 628 | + "metadata": { |
| 629 | + "jupyter": { |
| 630 | + "source_hidden": true |
| 631 | + } |
| 632 | + }, |
632 | 633 | "outputs": [],
|
633 | 634 | "source": [
|
634 | 635 | "x = da.stack(arrays, axis=0)\n",
|
|
783 | 784 | "cell_type": "markdown",
|
784 | 785 | "metadata": {},
|
785 | 786 | "source": [
|
786 |
| - "The [Lennard-Jones](https://en.wikipedia.org/wiki/Lennard-Jones_potential) is used in partical simuluations in physics, chemistry and engineering. It is highly parallelizable.\n", |
| 787 | + "The [Lennard-Jones potential](https://en.wikipedia.org/wiki/Lennard-Jones_potential) is used in partical simuluations in physics, chemistry and engineering. It is highly parallelizable.\n", |
787 | 788 | "\n",
|
788 | 789 | "First, we'll run and profile the Numpy version on 7,000 particles."
|
789 | 790 | ]
|
|
0 commit comments