Skip to content

Commit d17814e

Browse files
authored
TOPIC: Parsing PDFs table contents for RAG supporting notebook (#341)
1 parent 2868e7d commit d17814e

File tree

6 files changed

+1440
-1
lines changed

6 files changed

+1440
-1
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ Try out Playground in Kibana with the following notebooks:
5151
- [`Document Chunking with Ingest Pipelines`](./notebooks/document-chunking/with-index-pipelines.ipynb)
5252
- [`Document Chunking with LangChain Splitters`](./notebooks/document-chunking/with-langchain-splitters.ipynb)
5353
- [`Calculating tokens for Semantic Search (ELSER and E5)`](./notebooks/document-chunking/tokenization.ipynb)
54-
- [`Fetch surrounding chucks`](./supporting-blog-content/fetch-surrounding-chunks/fetch-surrounding-chunks.ipynb)
54+
- [`Fetch surrounding chunks`](./supporting-blog-content/fetch-surrounding-chunks/fetch-surrounding-chunks.ipynb)
5555

5656
### Search
5757

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
2+
# PDF Parsing - Table Extraction
3+
4+
Python notebook demonstrates an alternative approach to parsing PDFs, particularly focusing on extracting and converting tables into a format suitable for search applications such as Retrieval-Augmented Generation (RAG). The notebook leverages Azure OpenAI to process and convert table data from PDFs into plain text for better searchability and indexing.
5+
6+
## Features
7+
- **PDF Table Extraction**: The notebook identifies and parses tables from PDFs.
8+
- **LLM Integration**: Calls Azure OpenAI models to provide a text representation of the extracted tables.
9+
- **Search Optimization**: The parsed table data is processed into a format that can be more easily indexed and searched in Elasticsearch or other vector-based search systems.
10+
11+
## Getting Started
12+
13+
### Prerequisites
14+
- Python 3.x
15+
- Output Directory
16+
- Example
17+
- `/tmp`
18+
- Parsed output file name
19+
- Example
20+
- `parsed_file.txt`
21+
- Azure Account
22+
- OpenAI deployment
23+
- Key
24+
- Example
25+
- a330xxxxxxxde9xxxxxx
26+
- Completions endpoint such as GPT-4o
27+
- Example
28+
- https://exampledeploy.openai.azure.com/openai/deployments/gpt-35-turbo-16k/chat/completions?api-version=2024-08-01-preview
29+
- For more information on getting started with Azure OpenAI, check out the official [Azure OpenAI ChatGPT Quickstart](https://learn.microsoft.com/en-us/azure/ai-services/openai/chatgpt-quickstart?tabs=command-line%2Ctypescript%2Cpython-new&pivots=programming-language-studio).
30+
31+
32+
## Example Use Case
33+
This notebook is ideal for use cases where PDFs contain structured tables that need to be converted into plain text for indexing and search applications in environments like Elasticsearch or similar search systems.
34+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {
6+
"id": "e9-GuDRKCz_1"
7+
},
8+
"source": [
9+
"# PDF Parsing - Table Extraction\n"
10+
]
11+
},
12+
{
13+
"cell_type": "markdown",
14+
"metadata": {
15+
"id": "MBdflc9G0ICc"
16+
},
17+
"source": [
18+
"## Objective\n",
19+
"This Python script extracts text and tables from a PDF file, converts the tables into a human-readable text format using Azure OpenAI, and writes the processed content to a text file. The script uses pdfplumber to extract text and table data from each page of the PDF. For tables, it sends a cleaned version (handling any missing or None values) to Azure OpenAI, which generates a natural language summary of the table. The extracted non-table text and the summarized table text are then saved to a text file for easy search and readability."
20+
]
21+
},
22+
{
23+
"cell_type": "code",
24+
"execution_count": null,
25+
"metadata": {
26+
"id": "QBwz0_VNL1p6"
27+
},
28+
"outputs": [],
29+
"source": [
30+
"!pip install pdfplumber pandas"
31+
]
32+
},
33+
{
34+
"cell_type": "markdown",
35+
"metadata": {
36+
"id": "QC37eVM90few"
37+
},
38+
"source": [
39+
"This code imports necessary libraries for PDF extraction, data processing, and interacting with Azure OpenAI via API calls. It retrieves the Azure OpenAI API key and endpoint from Google Colab's userdata storage, sets up the required headers, and prepares for sending requests to the Azure OpenAI service."
40+
]
41+
},
42+
{
43+
"cell_type": "code",
44+
"execution_count": null,
45+
"metadata": {
46+
"id": "X3vuHZTjK6l7"
47+
},
48+
"outputs": [],
49+
"source": [
50+
"import pdfplumber\n",
51+
"import pandas as pd\n",
52+
"import requests\n",
53+
"import base64\n",
54+
"import json\n",
55+
"from getpass import getpass\n",
56+
"import io # To create an in-memory file-like object\n",
57+
"import os\n",
58+
"\n",
59+
"# Endpoint example\n",
60+
"# https://my-deploymet.openai.azure.com/openai/deployments/gpt-4o-global/chat/completions?api-version=2024-08-01-preview\n",
61+
"ENDPOINT = getpass(\"Azure OpenAI Completions Endpoint: \")\n",
62+
"\n",
63+
"API_KEY = getpass(\"Azure OpenAI API Key: \")\n",
64+
"\n",
65+
"\n",
66+
"##Directory where parsed output file will be written to\n",
67+
"PARSED_PDF_DIRECTORY = getpass(\"Output directory for parsed PDF: \")\n",
68+
"\n",
69+
"##Name of output parsed file\n",
70+
"PARSED_PDF_FILE_NAME = getpass(\"PARSED PDF File Name: \")\n",
71+
"\n",
72+
"\n",
73+
"headers = {\n",
74+
" \"Content-Type\": \"application/json\",\n",
75+
" \"api-key\": API_KEY,\n",
76+
"}"
77+
]
78+
},
79+
{
80+
"cell_type": "markdown",
81+
"metadata": {
82+
"id": "79VOKKam0leA"
83+
},
84+
"source": [
85+
"This code defines two functions: extract_table_text_from_openai and parse_pdf. The extract_table_text_from_openai function sends a table's plain text to Azure OpenAI for conversion into a human-readable description by building a request payload and handling the response. The parse_pdf function processes a PDF file page by page, extracting both text and tables, and sends the extracted tables to Azure OpenAI for summarization, saving all the content (including summarized tables) to a text file."
86+
]
87+
},
88+
{
89+
"cell_type": "code",
90+
"execution_count": null,
91+
"metadata": {
92+
"id": "CdMm1AKJLKbA"
93+
},
94+
"outputs": [],
95+
"source": [
96+
"def extract_table_text_from_openai(table_text):\n",
97+
" # Payload for the Azure OpenAI request\n",
98+
" payload = {\n",
99+
" \"messages\": [\n",
100+
" {\n",
101+
" \"role\": \"system\",\n",
102+
" \"content\": [\n",
103+
" {\n",
104+
" \"type\": \"text\",\n",
105+
" \"text\": \"You are an AI assistant that helps convert tables into a human-readable text.\",\n",
106+
" }\n",
107+
" ],\n",
108+
" },\n",
109+
" {\n",
110+
" \"role\": \"user\",\n",
111+
" \"content\": f\"Convert this table to a readable text format:\\n{table_text}\",\n",
112+
" },\n",
113+
" ],\n",
114+
" \"temperature\": 0.7,\n",
115+
" \"top_p\": 0.95,\n",
116+
" \"max_tokens\": 4096,\n",
117+
" }\n",
118+
"\n",
119+
" # Send the request to Azure OpenAI\n",
120+
" try:\n",
121+
" response = requests.post(ENDPOINT, headers=headers, json=payload)\n",
122+
" response.raise_for_status() # Raise error if the request fails\n",
123+
" except requests.RequestException as e:\n",
124+
" raise SystemExit(f\"Failed to make the request. Error: {e}\")\n",
125+
"\n",
126+
" # Process the response\n",
127+
" return (\n",
128+
" response.json()\n",
129+
" .get(\"choices\", [{}])[0]\n",
130+
" .get(\"message\", {})\n",
131+
" .get(\"content\", \"\")\n",
132+
" .strip()\n",
133+
" )\n",
134+
"\n",
135+
"\n",
136+
"def parse_pdf_from_url(file_url):\n",
137+
" # Download the PDF file from the URL\n",
138+
" response = requests.get(file_url)\n",
139+
" response.raise_for_status() # Ensure the request was successful\n",
140+
"\n",
141+
" # Open the PDF content with pdfplumber using io.BytesIO\n",
142+
" pdf_content = io.BytesIO(response.content)\n",
143+
"\n",
144+
" # Ensure the directory exists and has write permissions\n",
145+
" os.makedirs(PARSED_PDF_DIRECTORY, mode=0o755, exist_ok=True)\n",
146+
"\n",
147+
" with pdfplumber.open(pdf_content) as pdf, open(\n",
148+
" os.path.join(PARSED_PDF_DIRECTORY, PARSED_PDF_FILE_NAME), \"w\"\n",
149+
" ) as output_file:\n",
150+
" for page_num, page in enumerate(pdf.pages, 1):\n",
151+
" print(f\"Processing page {page_num}\")\n",
152+
"\n",
153+
" # Extract text content\n",
154+
" text = page.extract_text()\n",
155+
" if text:\n",
156+
" output_file.write(f\"Page {page_num} Text:\\n\")\n",
157+
" output_file.write(text + \"\\n\\n\")\n",
158+
" print(\"Text extracted:\", text)\n",
159+
"\n",
160+
" # Extract tables\n",
161+
" tables = page.extract_tables()\n",
162+
" for idx, table in enumerate(tables):\n",
163+
" print(f\"Table {idx + 1} found on page {page_num}\")\n",
164+
"\n",
165+
" # Convert the table into plain text format (handling None values)\n",
166+
" table_text = \"\\n\".join(\n",
167+
" [\n",
168+
" \"\\t\".join(\n",
169+
" [str(cell) if cell is not None else \"\" for cell in row]\n",
170+
" )\n",
171+
" for row in table[1:]\n",
172+
" ]\n",
173+
" )\n",
174+
"\n",
175+
" # Call Azure OpenAI to convert the table into a text representation\n",
176+
" table_description = extract_table_text_from_openai(table_text)\n",
177+
"\n",
178+
" # Write the text representation to the file\n",
179+
" output_file.write(\n",
180+
" f\"Table {idx + 1} (Page {page_num}) Text Representation:\\n\"\n",
181+
" )\n",
182+
" output_file.write(table_description + \"\\n\\n\")\n",
183+
" print(\"Text representation of the table:\", table_description)"
184+
]
185+
},
186+
{
187+
"cell_type": "code",
188+
"execution_count": null,
189+
"metadata": {
190+
"id": "7ig9NSSnLMGt"
191+
},
192+
"outputs": [],
193+
"source": [
194+
"# URL of the PDF file\n",
195+
"file_url = \"https://raw.githubusercontent.com/elastic/elasticsearch-labs/refs/heads/sunman/supporting-blog-content/alternative-approach-for-parsing-pdfs-in-rag/quarterly_report.pdf\"\n",
196+
"\n",
197+
"# Call the function to parse the PDF from the URL\n",
198+
"parse_pdf_from_url(file_url)"
199+
]
200+
}
201+
],
202+
"metadata": {
203+
"colab": {
204+
"provenance": []
205+
},
206+
"kernelspec": {
207+
"display_name": "Python 3",
208+
"name": "python3"
209+
},
210+
"language_info": {
211+
"name": "python"
212+
}
213+
},
214+
"nbformat": 4,
215+
"nbformat_minor": 0
216+
}
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
2+
# Unifying Elastic Vector Database and LLMs for Intelligent Retrieval
3+
4+
## Overview
5+
This notebook demonstrates how to integrate Elasticsearch as a vector database (VectorDB) with search templates and LLM functions to build an intelligent query layer. By leveraging vector search, dynamic query templates, and natural language processing, this approach enhances search precision, adaptability, and efficiency.
6+
7+
## Features
8+
- **Elasticsearch as a VectorDB**: Efficiently stores and retrieves dense vector embeddings for advanced search capabilities.
9+
- **Search Templates**: Dynamically structure queries by mapping user inputs to the appropriate index parameters.
10+
- **LLM Functions**: Extract key search parameters from natural language queries and inject them into search templates.
11+
- **Hybrid Search**: Combines structured filtering with semantic search to improve search accuracy and relevance.
12+
13+
## Components
14+
- **Geocode Location Function**: Converts location names into latitude and longitude for geospatial queries.
15+
- **Handle Extract Hotel Search Parameters Function**: Processes extracted search parameters, ensuring essential values like distance are correctly assigned.
16+
- **Call Elasticsearch Function**: Executes structured search queries using dynamically populated search templates.
17+
- **Format and Print Messages Functions**: Enhances query debugging by formatting and displaying responses in a structured format.
18+
- **Run Conversation Function**: Orchestrates interactions between user queries, LLM functions, and Elasticsearch search execution.
19+
- **Search Template Management Functions**: Defines, creates, and deletes search templates to optimize query processing.
20+
21+
22+
## Usage
23+
1. Set up an Elasticsearch cluster and ensure vector search capabilities are enabled.
24+
2. Define search templates to map query parameters to the index schema.
25+
3. Use LLM functions to extract and refine search parameters from user queries.
26+
4. Run queries using the intelligent query layer to retrieve more relevant and accurate results.
27+
28+
29+
### Prerequisites
30+
- Elastic Cloud instance
31+
- With ML nodes
32+
- Azure OpenAI
33+
- Completions endpoint such as GPT-4o
34+
- For more information on getting started with Azure OpenAI, check out the official [Azure OpenAI ChatGPT Quickstart](https://learn.microsoft.com/en-us/azure/ai-services/openai/chatgpt-quickstart?tabs=command-line%2Ctypescript%2Cpython-new&pivots=programming-language-studio).
35+
- Azure OpenAI Key
36+
- Google Maps API Key
37+
- https://developers.google.com/maps/documentation/embed/get-api-key
38+

0 commit comments

Comments
 (0)