Skip to content

App Search to Open Crawler Migration Notebook #416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 5, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions bin/find-notebooks-to-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ EXEMPT_NOTEBOOKS=(
"notebooks/integrations/azure-openai/vector-search-azure-openai-elastic.ipynb"
"notebooks/enterprise-search/app-search-engine-exporter.ipynb",
"notebooks/enterprise-search/elastic-crawler-to-open-crawler-migration.ipynb",
"notebooks/enterprise-search/app-search-crawler-to-open-crawler-migration.ipynb",
"notebooks/playground-examples/bedrock-anthropic-elasticsearch-client.ipynb",
"notebooks/playground-examples/openai-elasticsearch-client.ipynb",
"notebooks/integrations/hugging-face/huggingface-integration-millions-of-documents-with-cohere-reranking.ipynb",
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,387 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "0cccdff9-5ef4-4bc8-a139-6aefa7609f1e",
"metadata": {},
"source": [
"## Hello, future Open Crawler user!\n",
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/enterprise-search/app-search-crawler-to-open-crawler-migration.ipynb)\n",
"\n",
"This notebook is designed to help you migrate your [App Search Web Crawler](https://www.elastic.co/guide/en/app-search/current/web-crawler.html) configurations to Open Crawler-friendly YAML!\n",
"\n",
"We recommend running each cell individually in a sequential fashion, as each cell is dependent on previous cells having been run. Furthermore, we recommend that you only run each cell once as re-running cells may result in errors or incorrect YAML files.\n",
"\n",
"### Setup\n",
"First, let's start by making sure `elasticsearch` and other required dependencies are installed and imported by running the following cell:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "db796f1b-ce29-432b-879b-d517b84fbd9f",
"metadata": {},
"outputs": [],
"source": [
"!pip install elasticsearch\n",
"\n",
"from getpass import getpass\n",
"from elasticsearch import Elasticsearch\n",
"\n",
"import os\n",
"import yaml"
]
},
{
"cell_type": "markdown",
"id": "08b84987-19c5-4716-9534-1225ea993a9c",
"metadata": {},
"source": [
"We are going to need a few things from your Elasticsearch deployment before we can migrate your configurations:\n",
"- Your **Elasticsearch Endpoint URL**\n",
"- Your **Elasticsearch Endpoint Port number**\n",
"- An **API key**\n",
"\n",
"You can find your Endpoint URL and port number by visiting your Elasticsearch Overview page in Kibana.\n",
"\n",
"You can create a new API key from the Stack Management -> API keys menu in Kibana. Be sure to copy or write down your key in a safe place, as it will be displayed only once upon creation."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5de84ecd-a055-4e73-ae4a-1cc2ad4be0a7",
"metadata": {},
"outputs": [],
"source": [
"ELASTIC_ENDPOINT = getpass(\"Elastic Endpoint: \")\n",
"ELASTIC_PORT = getpass(\"Port\")\n",
"API_KEY = getpass(\"Elastic Api Key: \")\n",
"\n",
"es_client = Elasticsearch(\n",
" \":\".join([ELASTIC_ENDPOINT, ELASTIC_PORT]),\n",
" api_key=API_KEY,\n",
")\n",
"\n",
"# ping ES to make sure we have positive connection\n",
"es_client.info()[\"tagline\"]"
]
},
{
"cell_type": "markdown",
"id": "aa8dfff3-713b-44a4-b694-2699eebe665e",
"metadata": {},
"source": [
"Hopefully you received our tagline 'You Know, for Search'. If so, we are connected and ready to go!\n",
"\n",
"If not, please double-check your Cloud ID and API key that you provided above. "
]
},
{
"cell_type": "markdown",
"id": "4756ffaa-678d-41aa-865b-909038034104",
"metadata": {},
"source": [
"### Step 1: Get information on all App Search engines and their Web Crawlers\n",
"\n",
"First, we need to establish what Crawlers you have and their basic configuration details.\n",
"This next cell will attempt to pull configurations for every distinct App Search Engine you have in your Elasticsearch instance."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2e6d2f86-7aea-451f-bed2-107aaa1d4783",
"metadata": {},
"outputs": [],
"source": [
"# in-memory data structure that maintains current state of the configs we've pulled\n",
"inflight_configuration_data = {}\n",
"\n",
"# get each engine's engine_oid\n",
"app_search_engines = es_client.search(\n",
" index=\".ent-search-actastic-engines_v26\",\n",
")\n",
"\n",
"engine_counter = 1\n",
"for engine in app_search_engines[\"hits\"][\"hits\"]:\n",
" # pprint.pprint(engine)\n",
" source = engine[\"_source\"]\n",
" if not source[\"queued_for_deletion\"]:\n",
" engine_oid = source[\"id\"]\n",
" output_index = source[\"name\"]\n",
"\n",
" # populate a temporary hashmap\n",
" temp_conf_map = {\"output_index\": output_index}\n",
" # pre-populate some necessary fields in preparation for upcoming steps\n",
" temp_conf_map[\"domains_temp\"] = {}\n",
" temp_conf_map[\"output_sink\"] = \"elasticsearch\"\n",
" temp_conf_map[\"full_html_extraction_enabled\"] = False\n",
" temp_conf_map[\"elasticsearch\"] = {\n",
" \"host\": \"\",\n",
" \"port\": \"\",\n",
" \"api_key\": \"\",\n",
" }\n",
" # populate the in-memory data structure\n",
" inflight_configuration_data[engine_oid] = temp_conf_map\n",
" print(f\"{engine_counter}.) {output_index}\")\n",
" engine_counter += 1"
]
},
{
"cell_type": "markdown",
"id": "98657282-4027-40bb-9287-02b4a0381aec",
"metadata": {},
"source": [
"### Step 2: URLs, Sitemaps, and Crawl Rules\n",
"\n",
"In the next cell, we will need to query Elasticsearch for information about each Crawler's domain URLs, seed URLs, sitemaps, and crawling rules."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "94121748-5944-4dc4-9456-cdc754c7126c",
"metadata": {},
"outputs": [],
"source": [
"crawler_counter = 1\n",
"for engine_oid, crawler_config in inflight_configuration_data.items():\n",
" # get each crawler's domain details\n",
" crawler_domains = es_client.search(\n",
" index=\".ent-search-actastic-crawler_domains_v6\",\n",
" query={\"match\": {\"engine_oid\": engine_oid}},\n",
" _source=[\"crawl_rules\", \"id\", \"name\", \"seed_urls\", \"sitemaps\"],\n",
" )\n",
"\n",
" print(f\"{crawler_counter}.) Engine ID {engine_oid}\")\n",
" crawler_counter += 1\n",
"\n",
" for domain_info in crawler_domains[\"hits\"][\"hits\"]:\n",
" source = domain_info[\"_source\"]\n",
"\n",
" # extract values\n",
" domain_oid = str(source[\"id\"])\n",
" domain_url = source[\"name\"]\n",
" seed_urls = source[\"seed_urls\"]\n",
" sitemap_urls = source[\"sitemaps\"]\n",
" crawl_rules = source[\"crawl_rules\"]\n",
"\n",
" print(f\" Domain {domain_url} found!\")\n",
"\n",
" # transform seed, sitemap, and crawl rules into arrays\n",
" seed_urls_list = []\n",
" for seed_obj in seed_urls:\n",
" seed_urls_list.append(seed_obj[\"url\"])\n",
"\n",
" sitemap_urls_list = []\n",
" for sitemap_obj in sitemap_urls:\n",
" sitemap_urls_list.append(sitemap_obj[\"url\"])\n",
"\n",
" crawl_rules_list = []\n",
" for crawl_rules_obj in crawl_rules:\n",
" crawl_rules_list.append(\n",
" {\n",
" \"policy\": crawl_rules_obj[\"policy\"],\n",
" \"type\": crawl_rules_obj[\"rule\"],\n",
" \"pattern\": crawl_rules_obj[\"pattern\"],\n",
" }\n",
" )\n",
"\n",
" # populate a temporary hashmap\n",
" temp_domain_conf = {\"url\": domain_url}\n",
" if seed_urls_list:\n",
" temp_domain_conf[\"seed_urls\"] = seed_urls_list\n",
" print(f\" Seed URls found: {seed_urls_list}\")\n",
" if sitemap_urls_list:\n",
" temp_domain_conf[\"sitemap_urls\"] = sitemap_urls_list\n",
" print(f\" Sitemap URLs found: {sitemap_urls_list}\")\n",
" if crawl_rules_list:\n",
" temp_domain_conf[\"crawl_rules\"] = crawl_rules_list\n",
" print(f\" Crawl rules found: {crawl_rules_list}\")\n",
"\n",
" # populate the in-memory data structure\n",
" inflight_configuration_data[engine_oid][\"domains_temp\"][\n",
" domain_oid\n",
" ] = temp_domain_conf\n",
" print()"
]
},
{
"cell_type": "markdown",
"id": "cd711055-17a5-47c6-ba73-ff7e386a393b",
"metadata": {},
"source": [
"### Step 3: Creating the Open Crawler YAML configuration files\n",
"In this final step, we will create the actual YAML files you need to get up and running with Open Crawler!\n",
"\n",
"The next cell performs some final transformations to the in-memory data structure that is keeping track of your configurations."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e3e080c5-a528-47fd-9b79-0718d17d2560",
"metadata": {},
"outputs": [],
"source": [
"# Final transform of the in-memory data structure to a form we can dump to YAML\n",
"# for each crawler, collect all of its domain configurations into a list\n",
"for engine_oid, crawler_config in inflight_configuration_data.items():\n",
" all_crawler_domains = []\n",
"\n",
" for domain_config in crawler_config[\"domains_temp\"].values():\n",
" all_crawler_domains.append(domain_config)\n",
" # create a new key called \"domains\" that points to a list of domain configs only - no domain_oid values as keys\n",
" crawler_config[\"domains\"] = all_crawler_domains\n",
" # delete the temporary domain key\n",
" del crawler_config[\"domains_temp\"]\n",
" print(f\"Transform for {engine_oid} complete!\")"
]
},
{
"cell_type": "markdown",
"id": "7681f3c5-2f88-4e06-ad6a-65bedb5634e0",
"metadata": {},
"source": [
"#### **Wait! Before we continue onto creating our YAML files, we're going to need your input on a few things.**\n",
"\n",
"In the next cell, please enter the following details about the _Elasticsearch instance you will be using with Open Crawler_. This instance can be Elastic Cloud Hosted, Serverless, or a local instance.\n",
"\n",
"- The Elasticsearch endpoint URL\n",
"- The port number of your Elasticsearch endpoint _(Optional, will default to 443 if left blank)_\n",
"- An API key"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e6cfab1b-cf2e-4a41-9b54-492f59f7f61a",
"metadata": {},
"outputs": [],
"source": [
"ENDPOINT = getpass(\"Elasticsearch endpoint URL: \")\n",
"PORT = getpass(\"[OPTIONAL] Elasticsearch endpoint port number: \")\n",
"OUTPUT_API_KEY = getpass(\"Elasticsearch API key: \")\n",
"\n",
"# set the above values in each Crawler's configuration\n",
"for crawler_config in inflight_configuration_data.values():\n",
" crawler_config[\"elasticsearch\"][\"host\"] = ENDPOINT\n",
" crawler_config[\"elasticsearch\"][\"port\"] = int(PORT) if PORT else 443\n",
" crawler_config[\"elasticsearch\"][\"api_key\"] = OUTPUT_API_KEY\n",
"\n",
"# ping ES to make sure we have positive connection\n",
"es_client = Elasticsearch(\n",
" \":\".join([ENDPOINT, PORT]),\n",
" api_key=OUTPUT_API_KEY,\n",
")\n",
"\n",
"es_client.info()[\"tagline\"]"
]
},
{
"cell_type": "markdown",
"id": "14ed1150-a965-4739-9111-6dfc435ced4e",
"metadata": {},
"source": [
"#### **This is the final step! You have two options here:**\n",
"\n",
"- The \"Write to YAML\" cell will create _n_ number of YAML files, one for each Crawler you have.\n",
"- The \"Print to output\" cell will print each Crawler's configuration YAML in the Notebook, so you can copy-paste them into your Open Crawler YAML files manually.\n",
"\n",
"Feel free to run both! You can run Option 2 first to see the output before running Option 1 to save the configs into YAML files."
]
},
{
"cell_type": "markdown",
"id": "7a203432-d0a6-4929-a109-5fbf5d3f5e56",
"metadata": {},
"source": [
"#### Option 1: Write to YAML file"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0d6df1ff-1b47-45b8-a057-fc10cadf7dc5",
"metadata": {},
"outputs": [],
"source": [
"# Dump each Crawler's configuration into its own YAML file\n",
"for crawler_config in inflight_configuration_data.values():\n",
" base_dir = os.getcwd()\n",
" file_name = (\n",
" f\"{crawler_config['output_index']}-config.yml\" # autogen a custom filename\n",
" )\n",
" output_path = os.path.join(base_dir, file_name)\n",
"\n",
" if os.path.exists(base_dir):\n",
" with open(output_path, \"w\") as file:\n",
" yaml.safe_dump(crawler_config, file, sort_keys=False)\n",
" print(f\" Wrote {file_name} to {output_path}\")"
]
},
{
"cell_type": "markdown",
"id": "49cedb05-9d4e-4c09-87a8-e15fb326ce12",
"metadata": {},
"source": [
"#### Option 2: Print to output"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e99d6c91-914b-4b58-ad7d-3af2db8fff4b",
"metadata": {},
"outputs": [],
"source": [
"for crawler_config in inflight_configuration_data.values():\n",
" yaml_out = yaml.safe_dump(crawler_config, sort_keys=False)\n",
"\n",
" print(f\"YAML config => {crawler_config['output_index']}-config.yml\\n--------\")\n",
" print(yaml_out)\n",
" print(\n",
" \"--------------------------------------------------------------------------------\"\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "26b35570-f05d-4aa2-82fb-89ec37f84015",
"metadata": {},
"source": [
"### Next Steps\n",
"\n",
"Now that the YAML files have been generated, you can visit the Open Crawler GitHub repository to learn more about how to deploy Open Crawler: https://github.com/elastic/crawler#quickstart\n",
"\n",
"Additionally, you can learn more about Open Crawler via the following blog posts:\n",
"- [Open Crawler's promotion to beta release](https://www.elastic.co/search-labs/blog/elastic-open-crawler-beta-release)\n",
"- [How to use Open Crawler with Semantic Text](https://www.elastic.co/search-labs/blog/semantic-search-open-crawler) to easily crawl websites and make them semantically searchable\n",
"\n",
"If you find any problems with this Notebook, please feel free to create an issue in the elasticsearch-labs repository: https://github.com/elastic/elasticsearch-labs/issues"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}