Skip to content

Commit 4748b24

Browse files
authored
App Search to Open Crawler Migration Notebook (#416)
1 parent 6f0b47d commit 4748b24

File tree

2 files changed

+388
-0
lines changed

2 files changed

+388
-0
lines changed

bin/find-notebooks-to-test.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ EXEMPT_NOTEBOOKS=(
3131
"notebooks/integrations/azure-openai/vector-search-azure-openai-elastic.ipynb"
3232
"notebooks/enterprise-search/app-search-engine-exporter.ipynb",
3333
"notebooks/enterprise-search/elastic-crawler-to-open-crawler-migration.ipynb",
34+
"notebooks/enterprise-search/app-search-crawler-to-open-crawler-migration.ipynb",
3435
"notebooks/playground-examples/bedrock-anthropic-elasticsearch-client.ipynb",
3536
"notebooks/playground-examples/openai-elasticsearch-client.ipynb",
3637
"notebooks/integrations/hugging-face/huggingface-integration-millions-of-documents-with-cohere-reranking.ipynb",
Lines changed: 387 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,387 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "0cccdff9-5ef4-4bc8-a139-6aefa7609f1e",
6+
"metadata": {},
7+
"source": [
8+
"## Hello, future Open Crawler user!\n",
9+
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/enterprise-search/app-search-crawler-to-open-crawler-migration.ipynb)\n",
10+
"\n",
11+
"This notebook is designed to help you migrate your [App Search Web Crawler](https://www.elastic.co/guide/en/app-search/current/web-crawler.html) configurations to Open Crawler-friendly YAML!\n",
12+
"\n",
13+
"We recommend running each cell individually in a sequential fashion, as each cell is dependent on previous cells having been run. Furthermore, we recommend that you only run each cell once as re-running cells may result in errors or incorrect YAML files.\n",
14+
"\n",
15+
"### Setup\n",
16+
"First, let's start by making sure `elasticsearch` and other required dependencies are installed and imported by running the following cell:"
17+
]
18+
},
19+
{
20+
"cell_type": "code",
21+
"execution_count": null,
22+
"id": "db796f1b-ce29-432b-879b-d517b84fbd9f",
23+
"metadata": {},
24+
"outputs": [],
25+
"source": [
26+
"!pip install elasticsearch\n",
27+
"\n",
28+
"from getpass import getpass\n",
29+
"from elasticsearch import Elasticsearch\n",
30+
"\n",
31+
"import os\n",
32+
"import yaml"
33+
]
34+
},
35+
{
36+
"cell_type": "markdown",
37+
"id": "08b84987-19c5-4716-9534-1225ea993a9c",
38+
"metadata": {},
39+
"source": [
40+
"We are going to need a few things from your Elasticsearch deployment before we can migrate your configurations:\n",
41+
"- Your **Elasticsearch Endpoint URL**\n",
42+
"- Your **Elasticsearch Endpoint Port number**\n",
43+
"- An **API key**\n",
44+
"\n",
45+
"You can find your Endpoint URL and port number by visiting your Elasticsearch Overview page in Kibana.\n",
46+
"\n",
47+
"You can create a new API key from the Stack Management -> API keys menu in Kibana. Be sure to copy or write down your key in a safe place, as it will be displayed only once upon creation."
48+
]
49+
},
50+
{
51+
"cell_type": "code",
52+
"execution_count": null,
53+
"id": "5de84ecd-a055-4e73-ae4a-1cc2ad4be0a7",
54+
"metadata": {},
55+
"outputs": [],
56+
"source": [
57+
"ELASTIC_ENDPOINT = getpass(\"Elastic Endpoint: \")\n",
58+
"ELASTIC_PORT = getpass(\"Port\")\n",
59+
"API_KEY = getpass(\"Elastic Api Key: \")\n",
60+
"\n",
61+
"es_client = Elasticsearch(\n",
62+
" \":\".join([ELASTIC_ENDPOINT, ELASTIC_PORT]),\n",
63+
" api_key=API_KEY,\n",
64+
")\n",
65+
"\n",
66+
"# ping ES to make sure we have positive connection\n",
67+
"es_client.info()[\"tagline\"]"
68+
]
69+
},
70+
{
71+
"cell_type": "markdown",
72+
"id": "aa8dfff3-713b-44a4-b694-2699eebe665e",
73+
"metadata": {},
74+
"source": [
75+
"Hopefully you received our tagline 'You Know, for Search'. If so, we are connected and ready to go!\n",
76+
"\n",
77+
"If not, please double-check your Cloud ID and API key that you provided above. "
78+
]
79+
},
80+
{
81+
"cell_type": "markdown",
82+
"id": "4756ffaa-678d-41aa-865b-909038034104",
83+
"metadata": {},
84+
"source": [
85+
"### Step 1: Get information on all App Search engines and their Web Crawlers\n",
86+
"\n",
87+
"First, we need to establish what Crawlers you have and their basic configuration details.\n",
88+
"This next cell will attempt to pull configurations for every distinct App Search Engine you have in your Elasticsearch instance."
89+
]
90+
},
91+
{
92+
"cell_type": "code",
93+
"execution_count": null,
94+
"id": "2e6d2f86-7aea-451f-bed2-107aaa1d4783",
95+
"metadata": {},
96+
"outputs": [],
97+
"source": [
98+
"# in-memory data structure that maintains current state of the configs we've pulled\n",
99+
"inflight_configuration_data = {}\n",
100+
"\n",
101+
"# get each engine's engine_oid\n",
102+
"app_search_engines = es_client.search(\n",
103+
" index=\".ent-search-actastic-engines_v26\",\n",
104+
")\n",
105+
"\n",
106+
"engine_counter = 1\n",
107+
"for engine in app_search_engines[\"hits\"][\"hits\"]:\n",
108+
" # pprint.pprint(engine)\n",
109+
" source = engine[\"_source\"]\n",
110+
" if not source[\"queued_for_deletion\"]:\n",
111+
" engine_oid = source[\"id\"]\n",
112+
" output_index = source[\"name\"]\n",
113+
"\n",
114+
" # populate a temporary hashmap\n",
115+
" temp_conf_map = {\"output_index\": output_index}\n",
116+
" # pre-populate some necessary fields in preparation for upcoming steps\n",
117+
" temp_conf_map[\"domains_temp\"] = {}\n",
118+
" temp_conf_map[\"output_sink\"] = \"elasticsearch\"\n",
119+
" temp_conf_map[\"full_html_extraction_enabled\"] = False\n",
120+
" temp_conf_map[\"elasticsearch\"] = {\n",
121+
" \"host\": \"\",\n",
122+
" \"port\": \"\",\n",
123+
" \"api_key\": \"\",\n",
124+
" }\n",
125+
" # populate the in-memory data structure\n",
126+
" inflight_configuration_data[engine_oid] = temp_conf_map\n",
127+
" print(f\"{engine_counter}.) {output_index}\")\n",
128+
" engine_counter += 1"
129+
]
130+
},
131+
{
132+
"cell_type": "markdown",
133+
"id": "98657282-4027-40bb-9287-02b4a0381aec",
134+
"metadata": {},
135+
"source": [
136+
"### Step 2: URLs, Sitemaps, and Crawl Rules\n",
137+
"\n",
138+
"In the next cell, we will need to query Elasticsearch for information about each Crawler's domain URLs, seed URLs, sitemaps, and crawling rules."
139+
]
140+
},
141+
{
142+
"cell_type": "code",
143+
"execution_count": null,
144+
"id": "94121748-5944-4dc4-9456-cdc754c7126c",
145+
"metadata": {},
146+
"outputs": [],
147+
"source": [
148+
"crawler_counter = 1\n",
149+
"for engine_oid, crawler_config in inflight_configuration_data.items():\n",
150+
" # get each crawler's domain details\n",
151+
" crawler_domains = es_client.search(\n",
152+
" index=\".ent-search-actastic-crawler_domains_v6\",\n",
153+
" query={\"match\": {\"engine_oid\": engine_oid}},\n",
154+
" _source=[\"crawl_rules\", \"id\", \"name\", \"seed_urls\", \"sitemaps\"],\n",
155+
" )\n",
156+
"\n",
157+
" print(f\"{crawler_counter}.) Engine ID {engine_oid}\")\n",
158+
" crawler_counter += 1\n",
159+
"\n",
160+
" for domain_info in crawler_domains[\"hits\"][\"hits\"]:\n",
161+
" source = domain_info[\"_source\"]\n",
162+
"\n",
163+
" # extract values\n",
164+
" domain_oid = str(source[\"id\"])\n",
165+
" domain_url = source[\"name\"]\n",
166+
" seed_urls = source[\"seed_urls\"]\n",
167+
" sitemap_urls = source[\"sitemaps\"]\n",
168+
" crawl_rules = source[\"crawl_rules\"]\n",
169+
"\n",
170+
" print(f\" Domain {domain_url} found!\")\n",
171+
"\n",
172+
" # transform seed, sitemap, and crawl rules into arrays\n",
173+
" seed_urls_list = []\n",
174+
" for seed_obj in seed_urls:\n",
175+
" seed_urls_list.append(seed_obj[\"url\"])\n",
176+
"\n",
177+
" sitemap_urls_list = []\n",
178+
" for sitemap_obj in sitemap_urls:\n",
179+
" sitemap_urls_list.append(sitemap_obj[\"url\"])\n",
180+
"\n",
181+
" crawl_rules_list = []\n",
182+
" for crawl_rules_obj in crawl_rules:\n",
183+
" crawl_rules_list.append(\n",
184+
" {\n",
185+
" \"policy\": crawl_rules_obj[\"policy\"],\n",
186+
" \"type\": crawl_rules_obj[\"rule\"],\n",
187+
" \"pattern\": crawl_rules_obj[\"pattern\"],\n",
188+
" }\n",
189+
" )\n",
190+
"\n",
191+
" # populate a temporary hashmap\n",
192+
" temp_domain_conf = {\"url\": domain_url}\n",
193+
" if seed_urls_list:\n",
194+
" temp_domain_conf[\"seed_urls\"] = seed_urls_list\n",
195+
" print(f\" Seed URls found: {seed_urls_list}\")\n",
196+
" if sitemap_urls_list:\n",
197+
" temp_domain_conf[\"sitemap_urls\"] = sitemap_urls_list\n",
198+
" print(f\" Sitemap URLs found: {sitemap_urls_list}\")\n",
199+
" if crawl_rules_list:\n",
200+
" temp_domain_conf[\"crawl_rules\"] = crawl_rules_list\n",
201+
" print(f\" Crawl rules found: {crawl_rules_list}\")\n",
202+
"\n",
203+
" # populate the in-memory data structure\n",
204+
" inflight_configuration_data[engine_oid][\"domains_temp\"][\n",
205+
" domain_oid\n",
206+
" ] = temp_domain_conf\n",
207+
" print()"
208+
]
209+
},
210+
{
211+
"cell_type": "markdown",
212+
"id": "cd711055-17a5-47c6-ba73-ff7e386a393b",
213+
"metadata": {},
214+
"source": [
215+
"### Step 3: Creating the Open Crawler YAML configuration files\n",
216+
"In this final step, we will create the actual YAML files you need to get up and running with Open Crawler!\n",
217+
"\n",
218+
"The next cell performs some final transformations to the in-memory data structure that is keeping track of your configurations."
219+
]
220+
},
221+
{
222+
"cell_type": "code",
223+
"execution_count": null,
224+
"id": "e3e080c5-a528-47fd-9b79-0718d17d2560",
225+
"metadata": {},
226+
"outputs": [],
227+
"source": [
228+
"# Final transform of the in-memory data structure to a form we can dump to YAML\n",
229+
"# for each crawler, collect all of its domain configurations into a list\n",
230+
"for engine_oid, crawler_config in inflight_configuration_data.items():\n",
231+
" all_crawler_domains = []\n",
232+
"\n",
233+
" for domain_config in crawler_config[\"domains_temp\"].values():\n",
234+
" all_crawler_domains.append(domain_config)\n",
235+
" # create a new key called \"domains\" that points to a list of domain configs only - no domain_oid values as keys\n",
236+
" crawler_config[\"domains\"] = all_crawler_domains\n",
237+
" # delete the temporary domain key\n",
238+
" del crawler_config[\"domains_temp\"]\n",
239+
" print(f\"Transform for {engine_oid} complete!\")"
240+
]
241+
},
242+
{
243+
"cell_type": "markdown",
244+
"id": "7681f3c5-2f88-4e06-ad6a-65bedb5634e0",
245+
"metadata": {},
246+
"source": [
247+
"#### **Wait! Before we continue onto creating our YAML files, we're going to need your input on a few things.**\n",
248+
"\n",
249+
"In the next cell, please enter the following details about the _Elasticsearch instance you will be using with Open Crawler_. This instance can be Elastic Cloud Hosted, Serverless, or a local instance.\n",
250+
"\n",
251+
"- The Elasticsearch endpoint URL\n",
252+
"- The port number of your Elasticsearch endpoint _(Optional, will default to 443 if left blank)_\n",
253+
"- An API key"
254+
]
255+
},
256+
{
257+
"cell_type": "code",
258+
"execution_count": null,
259+
"id": "e6cfab1b-cf2e-4a41-9b54-492f59f7f61a",
260+
"metadata": {},
261+
"outputs": [],
262+
"source": [
263+
"ENDPOINT = getpass(\"Elasticsearch endpoint URL: \")\n",
264+
"PORT = getpass(\"[OPTIONAL] Elasticsearch endpoint port number: \")\n",
265+
"OUTPUT_API_KEY = getpass(\"Elasticsearch API key: \")\n",
266+
"\n",
267+
"# set the above values in each Crawler's configuration\n",
268+
"for crawler_config in inflight_configuration_data.values():\n",
269+
" crawler_config[\"elasticsearch\"][\"host\"] = ENDPOINT\n",
270+
" crawler_config[\"elasticsearch\"][\"port\"] = int(PORT) if PORT else 443\n",
271+
" crawler_config[\"elasticsearch\"][\"api_key\"] = OUTPUT_API_KEY\n",
272+
"\n",
273+
"# ping ES to make sure we have positive connection\n",
274+
"es_client = Elasticsearch(\n",
275+
" \":\".join([ENDPOINT, PORT]),\n",
276+
" api_key=OUTPUT_API_KEY,\n",
277+
")\n",
278+
"\n",
279+
"es_client.info()[\"tagline\"]"
280+
]
281+
},
282+
{
283+
"cell_type": "markdown",
284+
"id": "14ed1150-a965-4739-9111-6dfc435ced4e",
285+
"metadata": {},
286+
"source": [
287+
"#### **This is the final step! You have two options here:**\n",
288+
"\n",
289+
"- The \"Write to YAML\" cell will create _n_ number of YAML files, one for each Crawler you have.\n",
290+
"- The \"Print to output\" cell will print each Crawler's configuration YAML in the Notebook, so you can copy-paste them into your Open Crawler YAML files manually.\n",
291+
"\n",
292+
"Feel free to run both! You can run Option 2 first to see the output before running Option 1 to save the configs into YAML files."
293+
]
294+
},
295+
{
296+
"cell_type": "markdown",
297+
"id": "7a203432-d0a6-4929-a109-5fbf5d3f5e56",
298+
"metadata": {},
299+
"source": [
300+
"#### Option 1: Write to YAML file"
301+
]
302+
},
303+
{
304+
"cell_type": "code",
305+
"execution_count": null,
306+
"id": "0d6df1ff-1b47-45b8-a057-fc10cadf7dc5",
307+
"metadata": {},
308+
"outputs": [],
309+
"source": [
310+
"# Dump each Crawler's configuration into its own YAML file\n",
311+
"for crawler_config in inflight_configuration_data.values():\n",
312+
" base_dir = os.getcwd()\n",
313+
" file_name = (\n",
314+
" f\"{crawler_config['output_index']}-config.yml\" # autogen a custom filename\n",
315+
" )\n",
316+
" output_path = os.path.join(base_dir, file_name)\n",
317+
"\n",
318+
" if os.path.exists(base_dir):\n",
319+
" with open(output_path, \"w\") as file:\n",
320+
" yaml.safe_dump(crawler_config, file, sort_keys=False)\n",
321+
" print(f\" Wrote {file_name} to {output_path}\")"
322+
]
323+
},
324+
{
325+
"cell_type": "markdown",
326+
"id": "49cedb05-9d4e-4c09-87a8-e15fb326ce12",
327+
"metadata": {},
328+
"source": [
329+
"#### Option 2: Print to output"
330+
]
331+
},
332+
{
333+
"cell_type": "code",
334+
"execution_count": null,
335+
"id": "e99d6c91-914b-4b58-ad7d-3af2db8fff4b",
336+
"metadata": {},
337+
"outputs": [],
338+
"source": [
339+
"for crawler_config in inflight_configuration_data.values():\n",
340+
" yaml_out = yaml.safe_dump(crawler_config, sort_keys=False)\n",
341+
"\n",
342+
" print(f\"YAML config => {crawler_config['output_index']}-config.yml\\n--------\")\n",
343+
" print(yaml_out)\n",
344+
" print(\n",
345+
" \"--------------------------------------------------------------------------------\"\n",
346+
" )"
347+
]
348+
},
349+
{
350+
"cell_type": "markdown",
351+
"id": "26b35570-f05d-4aa2-82fb-89ec37f84015",
352+
"metadata": {},
353+
"source": [
354+
"### Next Steps\n",
355+
"\n",
356+
"Now that the YAML files have been generated, you can visit the Open Crawler GitHub repository to learn more about how to deploy Open Crawler: https://github.com/elastic/crawler#quickstart\n",
357+
"\n",
358+
"Additionally, you can learn more about Open Crawler via the following blog posts:\n",
359+
"- [Open Crawler's promotion to beta release](https://www.elastic.co/search-labs/blog/elastic-open-crawler-beta-release)\n",
360+
"- [How to use Open Crawler with Semantic Text](https://www.elastic.co/search-labs/blog/semantic-search-open-crawler) to easily crawl websites and make them semantically searchable\n",
361+
"\n",
362+
"If you find any problems with this Notebook, please feel free to create an issue in the elasticsearch-labs repository: https://github.com/elastic/elasticsearch-labs/issues"
363+
]
364+
}
365+
],
366+
"metadata": {
367+
"kernelspec": {
368+
"display_name": "Python 3 (ipykernel)",
369+
"language": "python",
370+
"name": "python3"
371+
},
372+
"language_info": {
373+
"codemirror_mode": {
374+
"name": "ipython",
375+
"version": 3
376+
},
377+
"file_extension": ".py",
378+
"mimetype": "text/x-python",
379+
"name": "python",
380+
"nbconvert_exporter": "python",
381+
"pygments_lexer": "ipython3",
382+
"version": "3.12.8"
383+
}
384+
},
385+
"nbformat": 4,
386+
"nbformat_minor": 5
387+
}

0 commit comments

Comments
 (0)