-
Notifications
You must be signed in to change notification settings - Fork 224
App Search to Open Crawler Migration Notebook #416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
9c43136
Adding new App Search to Open Crawler migration Notebook
mattnowzari 8869e5b
Added Notebook to test exemptions list
mattnowzari 90b5fdf
Renamed Notebook to match other App Search notebook naming + removed …
mattnowzari 32bff81
Fixed filename in EXEMPT_NOTEBOOKS
mattnowzari ce7b75a
Added some links in the intro and outro markdown cells
mattnowzari File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
387 changes: 387 additions & 0 deletions
387
notebooks/enterprise-search/app-search-crawler-to-open-crawler-migration.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,387 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "0cccdff9-5ef4-4bc8-a139-6aefa7609f1e", | ||
"metadata": {}, | ||
"source": [ | ||
"## Hello, future Open Crawler user!\n", | ||
"[](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/enterprise-search/app-search-crawler-to-open-crawler-migration.ipynb)\n", | ||
"\n", | ||
"This notebook is designed to help you migrate your [App Search Web Crawler](https://www.elastic.co/guide/en/app-search/current/web-crawler.html) configurations to Open Crawler-friendly YAML!\n", | ||
"\n", | ||
"We recommend running each cell individually in a sequential fashion, as each cell is dependent on previous cells having been run. Furthermore, we recommend that you only run each cell once as re-running cells may result in errors or incorrect YAML files.\n", | ||
"\n", | ||
"### Setup\n", | ||
"First, let's start by making sure `elasticsearch` and other required dependencies are installed and imported by running the following cell:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "db796f1b-ce29-432b-879b-d517b84fbd9f", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!pip install elasticsearch\n", | ||
"\n", | ||
"from getpass import getpass\n", | ||
"from elasticsearch import Elasticsearch\n", | ||
"\n", | ||
"import os\n", | ||
"import yaml" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "08b84987-19c5-4716-9534-1225ea993a9c", | ||
"metadata": {}, | ||
"source": [ | ||
"We are going to need a few things from your Elasticsearch deployment before we can migrate your configurations:\n", | ||
"- Your **Elasticsearch Endpoint URL**\n", | ||
"- Your **Elasticsearch Endpoint Port number**\n", | ||
"- An **API key**\n", | ||
"\n", | ||
"You can find your Endpoint URL and port number by visiting your Elasticsearch Overview page in Kibana.\n", | ||
"\n", | ||
"You can create a new API key from the Stack Management -> API keys menu in Kibana. Be sure to copy or write down your key in a safe place, as it will be displayed only once upon creation." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "5de84ecd-a055-4e73-ae4a-1cc2ad4be0a7", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"ELASTIC_ENDPOINT = getpass(\"Elastic Endpoint: \")\n", | ||
"ELASTIC_PORT = getpass(\"Port\")\n", | ||
"API_KEY = getpass(\"Elastic Api Key: \")\n", | ||
"\n", | ||
"es_client = Elasticsearch(\n", | ||
" \":\".join([ELASTIC_ENDPOINT, ELASTIC_PORT]),\n", | ||
" api_key=API_KEY,\n", | ||
")\n", | ||
"\n", | ||
"# ping ES to make sure we have positive connection\n", | ||
"es_client.info()[\"tagline\"]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "aa8dfff3-713b-44a4-b694-2699eebe665e", | ||
"metadata": {}, | ||
"source": [ | ||
"Hopefully you received our tagline 'You Know, for Search'. If so, we are connected and ready to go!\n", | ||
"\n", | ||
"If not, please double-check your Cloud ID and API key that you provided above. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "4756ffaa-678d-41aa-865b-909038034104", | ||
"metadata": {}, | ||
"source": [ | ||
"### Step 1: Get information on all App Search engines and their Web Crawlers\n", | ||
"\n", | ||
"First, we need to establish what Crawlers you have and their basic configuration details.\n", | ||
"This next cell will attempt to pull configurations for every distinct App Search Engine you have in your Elasticsearch instance." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "2e6d2f86-7aea-451f-bed2-107aaa1d4783", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# in-memory data structure that maintains current state of the configs we've pulled\n", | ||
"inflight_configuration_data = {}\n", | ||
"\n", | ||
"# get each engine's engine_oid\n", | ||
"app_search_engines = es_client.search(\n", | ||
" index=\".ent-search-actastic-engines_v26\",\n", | ||
")\n", | ||
"\n", | ||
"engine_counter = 1\n", | ||
"for engine in app_search_engines[\"hits\"][\"hits\"]:\n", | ||
" # pprint.pprint(engine)\n", | ||
" source = engine[\"_source\"]\n", | ||
" if not source[\"queued_for_deletion\"]:\n", | ||
" engine_oid = source[\"id\"]\n", | ||
" output_index = source[\"name\"]\n", | ||
"\n", | ||
" # populate a temporary hashmap\n", | ||
" temp_conf_map = {\"output_index\": output_index}\n", | ||
" # pre-populate some necessary fields in preparation for upcoming steps\n", | ||
" temp_conf_map[\"domains_temp\"] = {}\n", | ||
" temp_conf_map[\"output_sink\"] = \"elasticsearch\"\n", | ||
" temp_conf_map[\"full_html_extraction_enabled\"] = False\n", | ||
" temp_conf_map[\"elasticsearch\"] = {\n", | ||
" \"host\": \"\",\n", | ||
" \"port\": \"\",\n", | ||
" \"api_key\": \"\",\n", | ||
" }\n", | ||
" # populate the in-memory data structure\n", | ||
" inflight_configuration_data[engine_oid] = temp_conf_map\n", | ||
" print(f\"{engine_counter}.) {output_index}\")\n", | ||
" engine_counter += 1" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "98657282-4027-40bb-9287-02b4a0381aec", | ||
"metadata": {}, | ||
"source": [ | ||
"### Step 2: URLs, Sitemaps, and Crawl Rules\n", | ||
"\n", | ||
"In the next cell, we will need to query Elasticsearch for information about each Crawler's domain URLs, seed URLs, sitemaps, and crawling rules." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "94121748-5944-4dc4-9456-cdc754c7126c", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"crawler_counter = 1\n", | ||
"for engine_oid, crawler_config in inflight_configuration_data.items():\n", | ||
" # get each crawler's domain details\n", | ||
" crawler_domains = es_client.search(\n", | ||
" index=\".ent-search-actastic-crawler_domains_v6\",\n", | ||
" query={\"match\": {\"engine_oid\": engine_oid}},\n", | ||
" _source=[\"crawl_rules\", \"id\", \"name\", \"seed_urls\", \"sitemaps\"],\n", | ||
" )\n", | ||
"\n", | ||
" print(f\"{crawler_counter}.) Engine ID {engine_oid}\")\n", | ||
" crawler_counter += 1\n", | ||
"\n", | ||
" for domain_info in crawler_domains[\"hits\"][\"hits\"]:\n", | ||
" source = domain_info[\"_source\"]\n", | ||
"\n", | ||
" # extract values\n", | ||
" domain_oid = str(source[\"id\"])\n", | ||
" domain_url = source[\"name\"]\n", | ||
" seed_urls = source[\"seed_urls\"]\n", | ||
" sitemap_urls = source[\"sitemaps\"]\n", | ||
" crawl_rules = source[\"crawl_rules\"]\n", | ||
"\n", | ||
" print(f\" Domain {domain_url} found!\")\n", | ||
"\n", | ||
" # transform seed, sitemap, and crawl rules into arrays\n", | ||
" seed_urls_list = []\n", | ||
" for seed_obj in seed_urls:\n", | ||
" seed_urls_list.append(seed_obj[\"url\"])\n", | ||
"\n", | ||
" sitemap_urls_list = []\n", | ||
" for sitemap_obj in sitemap_urls:\n", | ||
" sitemap_urls_list.append(sitemap_obj[\"url\"])\n", | ||
"\n", | ||
" crawl_rules_list = []\n", | ||
" for crawl_rules_obj in crawl_rules:\n", | ||
" crawl_rules_list.append(\n", | ||
" {\n", | ||
" \"policy\": crawl_rules_obj[\"policy\"],\n", | ||
" \"type\": crawl_rules_obj[\"rule\"],\n", | ||
" \"pattern\": crawl_rules_obj[\"pattern\"],\n", | ||
" }\n", | ||
" )\n", | ||
"\n", | ||
" # populate a temporary hashmap\n", | ||
" temp_domain_conf = {\"url\": domain_url}\n", | ||
" if seed_urls_list:\n", | ||
" temp_domain_conf[\"seed_urls\"] = seed_urls_list\n", | ||
" print(f\" Seed URls found: {seed_urls_list}\")\n", | ||
" if sitemap_urls_list:\n", | ||
" temp_domain_conf[\"sitemap_urls\"] = sitemap_urls_list\n", | ||
" print(f\" Sitemap URLs found: {sitemap_urls_list}\")\n", | ||
" if crawl_rules_list:\n", | ||
" temp_domain_conf[\"crawl_rules\"] = crawl_rules_list\n", | ||
" print(f\" Crawl rules found: {crawl_rules_list}\")\n", | ||
"\n", | ||
" # populate the in-memory data structure\n", | ||
" inflight_configuration_data[engine_oid][\"domains_temp\"][\n", | ||
" domain_oid\n", | ||
" ] = temp_domain_conf\n", | ||
" print()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "cd711055-17a5-47c6-ba73-ff7e386a393b", | ||
"metadata": {}, | ||
"source": [ | ||
"### Step 3: Creating the Open Crawler YAML configuration files\n", | ||
"In this final step, we will create the actual YAML files you need to get up and running with Open Crawler!\n", | ||
"\n", | ||
"The next cell performs some final transformations to the in-memory data structure that is keeping track of your configurations." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "e3e080c5-a528-47fd-9b79-0718d17d2560", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Final transform of the in-memory data structure to a form we can dump to YAML\n", | ||
"# for each crawler, collect all of its domain configurations into a list\n", | ||
"for engine_oid, crawler_config in inflight_configuration_data.items():\n", | ||
" all_crawler_domains = []\n", | ||
"\n", | ||
" for domain_config in crawler_config[\"domains_temp\"].values():\n", | ||
" all_crawler_domains.append(domain_config)\n", | ||
" # create a new key called \"domains\" that points to a list of domain configs only - no domain_oid values as keys\n", | ||
" crawler_config[\"domains\"] = all_crawler_domains\n", | ||
" # delete the temporary domain key\n", | ||
" del crawler_config[\"domains_temp\"]\n", | ||
" print(f\"Transform for {engine_oid} complete!\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "7681f3c5-2f88-4e06-ad6a-65bedb5634e0", | ||
"metadata": {}, | ||
"source": [ | ||
"#### **Wait! Before we continue onto creating our YAML files, we're going to need your input on a few things.**\n", | ||
"\n", | ||
"In the next cell, please enter the following details about the _Elasticsearch instance you will be using with Open Crawler_. This instance can be Elastic Cloud Hosted, Serverless, or a local instance.\n", | ||
"\n", | ||
"- The Elasticsearch endpoint URL\n", | ||
"- The port number of your Elasticsearch endpoint _(Optional, will default to 443 if left blank)_\n", | ||
"- An API key" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "e6cfab1b-cf2e-4a41-9b54-492f59f7f61a", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"ENDPOINT = getpass(\"Elasticsearch endpoint URL: \")\n", | ||
"PORT = getpass(\"[OPTIONAL] Elasticsearch endpoint port number: \")\n", | ||
"OUTPUT_API_KEY = getpass(\"Elasticsearch API key: \")\n", | ||
"\n", | ||
"# set the above values in each Crawler's configuration\n", | ||
"for crawler_config in inflight_configuration_data.values():\n", | ||
" crawler_config[\"elasticsearch\"][\"host\"] = ENDPOINT\n", | ||
" crawler_config[\"elasticsearch\"][\"port\"] = int(PORT) if PORT else 443\n", | ||
" crawler_config[\"elasticsearch\"][\"api_key\"] = OUTPUT_API_KEY\n", | ||
"\n", | ||
"# ping ES to make sure we have positive connection\n", | ||
"es_client = Elasticsearch(\n", | ||
" \":\".join([ENDPOINT, PORT]),\n", | ||
" api_key=OUTPUT_API_KEY,\n", | ||
")\n", | ||
"\n", | ||
"es_client.info()[\"tagline\"]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "14ed1150-a965-4739-9111-6dfc435ced4e", | ||
"metadata": {}, | ||
"source": [ | ||
"#### **This is the final step! You have two options here:**\n", | ||
"\n", | ||
"- The \"Write to YAML\" cell will create _n_ number of YAML files, one for each Crawler you have.\n", | ||
"- The \"Print to output\" cell will print each Crawler's configuration YAML in the Notebook, so you can copy-paste them into your Open Crawler YAML files manually.\n", | ||
"\n", | ||
"Feel free to run both! You can run Option 2 first to see the output before running Option 1 to save the configs into YAML files." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "7a203432-d0a6-4929-a109-5fbf5d3f5e56", | ||
"metadata": {}, | ||
"source": [ | ||
"#### Option 1: Write to YAML file" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "0d6df1ff-1b47-45b8-a057-fc10cadf7dc5", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Dump each Crawler's configuration into its own YAML file\n", | ||
"for crawler_config in inflight_configuration_data.values():\n", | ||
" base_dir = os.getcwd()\n", | ||
" file_name = (\n", | ||
" f\"{crawler_config['output_index']}-config.yml\" # autogen a custom filename\n", | ||
" )\n", | ||
" output_path = os.path.join(base_dir, file_name)\n", | ||
"\n", | ||
" if os.path.exists(base_dir):\n", | ||
" with open(output_path, \"w\") as file:\n", | ||
" yaml.safe_dump(crawler_config, file, sort_keys=False)\n", | ||
" print(f\" Wrote {file_name} to {output_path}\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "49cedb05-9d4e-4c09-87a8-e15fb326ce12", | ||
"metadata": {}, | ||
"source": [ | ||
"#### Option 2: Print to output" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "e99d6c91-914b-4b58-ad7d-3af2db8fff4b", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"for crawler_config in inflight_configuration_data.values():\n", | ||
" yaml_out = yaml.safe_dump(crawler_config, sort_keys=False)\n", | ||
"\n", | ||
" print(f\"YAML config => {crawler_config['output_index']}-config.yml\\n--------\")\n", | ||
" print(yaml_out)\n", | ||
" print(\n", | ||
" \"--------------------------------------------------------------------------------\"\n", | ||
" )" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "26b35570-f05d-4aa2-82fb-89ec37f84015", | ||
"metadata": {}, | ||
"source": [ | ||
"### Next Steps\n", | ||
"\n", | ||
"Now that the YAML files have been generated, you can visit the Open Crawler GitHub repository to learn more about how to deploy Open Crawler: https://github.com/elastic/crawler#quickstart\n", | ||
"\n", | ||
"Additionally, you can learn more about Open Crawler via the following blog posts:\n", | ||
"- [Open Crawler's promotion to beta release](https://www.elastic.co/search-labs/blog/elastic-open-crawler-beta-release)\n", | ||
"- [How to use Open Crawler with Semantic Text](https://www.elastic.co/search-labs/blog/semantic-search-open-crawler) to easily crawl websites and make them semantically searchable\n", | ||
"\n", | ||
"If you find any problems with this Notebook, please feel free to create an issue in the elasticsearch-labs repository: https://github.com/elastic/elasticsearch-labs/issues" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.12.8" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.