|
1 | 1 | {
|
2 | 2 | "cells": [
|
3 | 3 | {
|
4 |
| - "attachments": {}, |
5 | 4 | "cell_type": "markdown",
|
6 | 5 | "id": "e2ac1559-3729-4cf3-acee-d4bb15c6f53d",
|
7 | 6 | "metadata": {
|
8 | 7 | "tags": []
|
9 | 8 | },
|
10 | 9 | "source": [
|
11 |
| - "# Feature Processor Sample Notebook" |
| 10 | + "# Amazon SageMaker Feature Store: Feature Processor Introduction" |
12 | 11 | ]
|
13 | 12 | },
|
14 | 13 | {
|
15 |
| - "attachments": {}, |
16 | 14 | "cell_type": "markdown",
|
17 | 15 | "id": "bfd7d612",
|
18 | 16 | "metadata": {},
|
|
27 | 25 | ]
|
28 | 26 | },
|
29 | 27 | {
|
30 |
| - "attachments": {}, |
| 28 | + "cell_type": "markdown", |
| 29 | + "id": "c339cb18", |
| 30 | + "metadata": {}, |
| 31 | + "source": [ |
| 32 | + "This notebook demonstrates how to get started with Feature Processor using SageMaker python SDK, create feature groups, perform batch transformation and ingest processed input data to feature groups.\n", |
| 33 | + "\n", |
| 34 | + "We first demonstrate how to use `@feature-processor` decorator to run the job locally and then show how to use `@remote` decorator to execute large batch transform and ingestion on SageMaker training job remotely. Besides, the SDK provides APIs to create scheduled pipelines based on transformation code.\n", |
| 35 | + "\n", |
| 36 | + "If you would like to learn more about Feature Processor, see documentation [Feature Processing](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-feature-processing.html) for more info and examples." |
| 37 | + ] |
| 38 | + }, |
| 39 | + { |
31 | 40 | "cell_type": "markdown",
|
32 | 41 | "id": "a8b4ba90-e512-46bf-bfa9-541213021e86",
|
33 | 42 | "metadata": {
|
34 | 43 | "tags": []
|
35 | 44 | },
|
36 | 45 | "source": [
|
37 |
| - "## Setup For Notebook\n", |
38 |
| - "First we create a new kernel to execute this notebook.\n", |
| 46 | + "## Setup For Notebook\n" |
| 47 | + ] |
| 48 | + }, |
| 49 | + { |
| 50 | + "cell_type": "markdown", |
| 51 | + "id": "e45c4dd7", |
| 52 | + "metadata": {}, |
| 53 | + "source": [ |
| 54 | + "### Setup Runtime Environment\n", |
39 | 55 | "\n",
|
| 56 | + "First we create a new kernel to execute this notebook.\n", |
40 | 57 | "1. Launch a new terminal in the current image (the '$_' icon at the top of this notebook).\n",
|
41 | 58 | "2. Execute the commands: \n",
|
42 | 59 | "```\n",
|
|
48 | 65 | "3. Return to this notebook and select the kernel with Image: 'Data Science' and Kernel: 'feature-processing-py-3.9'"
|
49 | 66 | ]
|
50 | 67 | },
|
| 68 | + { |
| 69 | + "cell_type": "markdown", |
| 70 | + "id": "a65db47d", |
| 71 | + "metadata": {}, |
| 72 | + "source": [ |
| 73 | + "Alternatively If you are running this notebook on SageMaker Studio, you can execute the following cell to install runtime dependencies." |
| 74 | + ] |
| 75 | + }, |
51 | 76 | {
|
52 | 77 | "cell_type": "code",
|
53 | 78 | "execution_count": null,
|
54 |
| - "id": "73131cc7-1680-4e31-b47a-58d6f9c9236d", |
| 79 | + "id": "efbd6006", |
| 80 | + "metadata": { |
| 81 | + "tags": [] |
| 82 | + }, |
| 83 | + "outputs": [], |
| 84 | + "source": [ |
| 85 | + "%%capture\n", |
| 86 | + "\n", |
| 87 | + "!apt-get update\n", |
| 88 | + "!apt-get install openjdk-11-jdk -y\n", |
| 89 | + "%pip install ipykernel" |
| 90 | + ] |
| 91 | + }, |
| 92 | + { |
| 93 | + "cell_type": "code", |
| 94 | + "execution_count": null, |
| 95 | + "id": "7351b428", |
55 | 96 | "metadata": {
|
56 |
| - "scrolled": true, |
57 | 97 | "tags": []
|
58 | 98 | },
|
59 | 99 | "outputs": [],
|
|
103 | 143 | " get_ipython().run_cell(cell)"
|
104 | 144 | ]
|
105 | 145 | },
|
| 146 | + { |
| 147 | + "cell_type": "markdown", |
| 148 | + "id": "a303d7bc", |
| 149 | + "metadata": {}, |
| 150 | + "source": [ |
| 151 | + "### Create Feature Groups" |
| 152 | + ] |
| 153 | + }, |
| 154 | + { |
| 155 | + "cell_type": "markdown", |
| 156 | + "id": "f57390a2", |
| 157 | + "metadata": {}, |
| 158 | + "source": [ |
| 159 | + "First we start by creating two feature groups. One feature group is used for storing raw car sales dataset which is located in `data/car_data.csv`. We create another feature group to store aggregated feature values after feature processing, for example average value of `mileage`, `price` and `msrp`." |
| 160 | + ] |
| 161 | + }, |
106 | 162 | {
|
107 | 163 | "cell_type": "code",
|
108 | 164 | "execution_count": null,
|
|
241 | 297 | ]
|
242 | 298 | },
|
243 | 299 | {
|
244 |
| - "attachments": {}, |
245 | 300 | "cell_type": "markdown",
|
246 | 301 | "id": "75d9c534-7b9d-40da-a99b-54aa8f927f8e",
|
247 | 302 | "metadata": {
|
|
252 | 307 | "\n",
|
253 | 308 | "The following example demonstrates how to use the @feature_processor decorator to load data from Amazon S3 to a SageMaker Feature Group. \n",
|
254 | 309 | "\n",
|
255 |
| - "A @feature_processor decorated function automatically loads data from the configured inputs, applies the feature processing code and ingests the transformed data to a feature group." |
| 310 | + "A `@feature_processor` decorated function automatically loads data from the configured inputs, applies the feature processing code and ingests the transformed data to a feature group." |
256 | 311 | ]
|
257 | 312 | },
|
258 | 313 | {
|
|
317 | 372 | ]
|
318 | 373 | },
|
319 | 374 | {
|
320 |
| - "attachments": {}, |
321 | 375 | "cell_type": "markdown",
|
322 | 376 | "id": "23ef02b7-7b38-4c00-99fb-4caed9773321",
|
323 | 377 | "metadata": {},
|
|
326 | 380 | "\n",
|
327 | 381 | "The following example demonstrates how to run your feature processing code remotely.\n",
|
328 | 382 | "\n",
|
329 |
| - "This is useful if you are working with large data sets that require hardware more powerful than locally available. You can decorate your code with the @remote decorator to run your local Python code as a single or multi-node distributed SageMaker training job. For more information on running your code as a SageMaker training job, see [Run your local code as a SageMaker training job](https://docs.aws.amazon.com/sagemaker/latest/dg/train-remote-decorator.html)." |
| 383 | + "This is useful if you are working with large data sets that require hardware more powerful than locally available. You can decorate your code with the `@remote` decorator to run your local Python code as a single or multi-node distributed SageMaker training job. For more information on running your code as a SageMaker training job, see [Run your local code as a SageMaker training job](https://docs.aws.amazon.com/sagemaker/latest/dg/train-remote-decorator.html)." |
330 | 384 | ]
|
331 | 385 | },
|
332 | 386 | {
|
333 | 387 | "cell_type": "code",
|
334 | 388 | "execution_count": null,
|
335 | 389 | "id": "d1f50d11",
|
336 |
| - "metadata": {}, |
| 390 | + "metadata": { |
| 391 | + "tags": [] |
| 392 | + }, |
337 | 393 | "outputs": [],
|
338 | 394 | "source": [
|
339 | 395 | "\"\"\"\n",
|
|
417 | 473 | ]
|
418 | 474 | },
|
419 | 475 | {
|
420 |
| - "attachments": {}, |
421 | 476 | "cell_type": "markdown",
|
422 | 477 | "id": "11e1a26a-35f1-4477-b71f-17c18c604ea7",
|
423 | 478 | "metadata": {},
|
424 | 479 | "source": [
|
425 | 480 | "## `to_pipeline and schedule`\n",
|
426 | 481 | "\n",
|
427 |
| - "The following example demonstrates how to operationalize your feature processor by promoting it to a SageMaker Pipeline and configuring a schedule to execute it on a regular basis. This example uses the aggregate function defined above." |
| 482 | + "The following example demonstrates how to operationalize your feature processor by promoting it to a SageMaker Pipeline and configuring a schedule to execute it on a regular basis. This example uses the aggregate function defined above. Note, in order to create a pipeline, please make sure your method is annotated by both `@remote` and `@feature-processor` decorators." |
428 | 483 | ]
|
429 | 484 | },
|
430 | 485 | {
|
|
468 | 523 | ")"
|
469 | 524 | ]
|
470 | 525 | },
|
| 526 | + { |
| 527 | + "cell_type": "markdown", |
| 528 | + "id": "83ef5ce7", |
| 529 | + "metadata": {}, |
| 530 | + "source": [ |
| 531 | + "In the following example, we will create and schedule the pipeline using `to_pipeline` and `schedule` method. If you want to test the job before scheduling, you can use `execute` to start only one execution.\n", |
| 532 | + "\n", |
| 533 | + "The SDK also provides two extra methods `describe` and `list_pipelines` for you to get insights about the pipeline info." |
| 534 | + ] |
| 535 | + }, |
471 | 536 | {
|
472 | 537 | "cell_type": "code",
|
473 | 538 | "execution_count": null,
|
|
551 | 616 | ]
|
552 | 617 | },
|
553 | 618 | {
|
554 |
| - "attachments": {}, |
555 | 619 | "cell_type": "markdown",
|
556 | 620 | "id": "be2d9751-e288-42db-b5fa-081939be66aa",
|
557 | 621 | "metadata": {},
|
|
564 | 628 | ]
|
565 | 629 | },
|
566 | 630 | {
|
567 |
| - "attachments": {}, |
568 | 631 | "cell_type": "markdown",
|
569 | 632 | "id": "0e9af135",
|
570 | 633 | "metadata": {},
|
|
603 | 666 | ]
|
604 | 667 | },
|
605 | 668 | {
|
606 |
| - "attachments": {}, |
607 | 669 | "cell_type": "markdown",
|
608 | 670 | "id": "6c1ebc50",
|
609 | 671 | "metadata": {},
|
|
645 | 707 | }
|
646 | 708 | ],
|
647 | 709 | "metadata": {
|
| 710 | + "instance_type": "ml.m5.2xlarge", |
648 | 711 | "kernelspec": {
|
649 |
| - "display_name": "Python 3", |
| 712 | + "display_name": "Python 3 (TensorFlow 2.10.0 Python 3.9 CPU Optimized)", |
650 | 713 | "language": "python",
|
651 |
| - "name": "python3" |
| 714 | + "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/tensorflow-2.10.1-cpu-py39-ubuntu20.04-sagemaker-v1.2" |
652 | 715 | },
|
653 | 716 | "language_info": {
|
654 | 717 | "codemirror_mode": {
|
|
660 | 723 | "name": "python",
|
661 | 724 | "nbconvert_exporter": "python",
|
662 | 725 | "pygments_lexer": "ipython3",
|
663 |
| - "version": "3.9.14" |
| 726 | + "version": "3.9.16" |
664 | 727 | }
|
665 | 728 | },
|
666 | 729 | "nbformat": 4,
|
|
0 commit comments