Skip to content

Commit 932df8d

Browse files
authored
Merge pull request #238 from VinciGit00/gpt4-omni
2 parents d76badd + a6e1813 commit 932df8d

25 files changed

+722
-42
lines changed

docs/assets/omniscrapergraph.png

72.2 KB
Loading

docs/assets/omnisearchgraph.png

56.7 KB
Loading

docs/source/scrapers/graph_config.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ Some interesting ones are:
1010
- `headless`: If set to `False`, the web browser will be opened on the URL requested and close right after the HTML is fetched.
1111
- `max_results`: The maximum number of results to be fetched from the search engine. Useful in `SearchGraph`.
1212
- `output_path`: The path where the output files will be saved. Useful in `SpeechGraph`.
13+
- `loader_kwargs`: A dictionary with additional parameters to be passed to the `Loader` class, such as `proxy`.
14+
- `max_images`: The maximum number of images to be analyzed. Useful in `OmniScraperGraph` and `OmniSearchGraph`.
1315

1416
Proxy Rotation
1517
^^^^^^^^^^^^^^

docs/source/scrapers/graphs.rst

Lines changed: 65 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,80 @@ Graphs
33

44
Graphs are scraping pipelines aimed at solving specific tasks. They are composed by nodes which can be configured individually to address different aspects of the task (fetching data, extracting information, etc.).
55

6-
There are currently three types of graphs available in the library:
6+
There are three types of graphs available in the library:
77

88
- **SmartScraperGraph**: one-page scraper that requires a user-defined prompt and a URL (or local file) to extract information from using LLM.
99
- **SearchGraph**: multi-page scraper that only requires a user-defined prompt to extract information from a search engine using LLM. It is built on top of SmartScraperGraph.
1010
- **SpeechGraph**: text-to-speech pipeline that generates an answer as well as a requested audio file. It is built on top of SmartScraperGraph and requires a user-defined prompt and a URL (or local file).
1111

12+
With the introduction of `GPT-4o`, two new powerful graphs have been created:
13+
14+
- **OmniScraperGraph**: similar to `SmartScraperGraph`, but with the ability to scrape images and describe them.
15+
- **OmniSearchGraph**: similar to `SearchGraph`, but with the ability to scrape images and describe them.
16+
1217
.. note::
1318

1419
They all use a graph configuration to set up LLM models and other parameters. To find out more about the configurations, check the :ref:`LLM` and :ref:`Configuration` sections.
1520

21+
OmniScraperGraph
22+
^^^^^^^^^^^^^^^^
23+
24+
.. image:: ../../assets/omniscrapergraph.png
25+
:align: center
26+
:width: 90%
27+
:alt: OmniScraperGraph
28+
|
29+
30+
First we define the graph configuration, which includes the LLM model and other parameters. Then we create an instance of the OmniScraperGraph class, passing the prompt, source, and configuration as arguments. Finally, we run the graph and print the result.
31+
It will fetch the data from the source and extract the information based on the prompt in JSON format.
32+
33+
.. code-block:: python
34+
35+
from scrapegraphai.graphs import OmniScraperGraph
36+
37+
graph_config = {
38+
"llm": {...},
39+
}
40+
41+
omni_scraper_graph = OmniScraperGraph(
42+
prompt="List me all the projects with their titles and image links and descriptions.",
43+
source="https://perinim.github.io/projects",
44+
config=graph_config
45+
)
46+
47+
result = omni_scraper_graph.run()
48+
print(result)
49+
50+
OmniSearchGraph
51+
^^^^^^^^^^^^^^^
52+
53+
.. image:: ../../assets/omnisearchgraph.png
54+
:align: center
55+
:width: 80%
56+
:alt: OmniSearchGraph
57+
|
58+
59+
Similar to OmniScraperGraph, we define the graph configuration, create multiple of the OmniSearchGraph class, and run the graph.
60+
It will create a search query, fetch the first n results from the search engine, run n OmniScraperGraph instances, and return the results in JSON format.
61+
62+
.. code-block:: python
63+
64+
from scrapegraphai.graphs import OmniSearchGraph
65+
66+
graph_config = {
67+
"llm": {...},
68+
}
69+
70+
# Create the OmniSearchGraph instance
71+
omni_search_graph = OmniSearchGraph(
72+
prompt="List me all Chioggia's famous dishes and describe their pictures.",
73+
config=graph_config
74+
)
75+
76+
# Run the graph
77+
result = omni_search_graph.run()
78+
print(result)
79+
1680
SmartScraperGraph
1781
^^^^^^^^^^^^^^^^^
1882

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
"""
2+
Basic example of scraping pipeline using OmniScraper
3+
"""
4+
5+
import os, json
6+
from dotenv import load_dotenv
7+
from scrapegraphai.graphs import OmniScraperGraph
8+
from scrapegraphai.utils import prettify_exec_info
9+
10+
load_dotenv()
11+
12+
13+
# ************************************************
14+
# Define the configuration for the graph
15+
# ************************************************
16+
17+
openai_key = os.getenv("OPENAI_APIKEY")
18+
19+
graph_config = {
20+
"llm": {
21+
"api_key": openai_key,
22+
"model": "gpt-4-turbo",
23+
},
24+
"verbose": True,
25+
"headless": True,
26+
"max_images": 5
27+
}
28+
29+
# ************************************************
30+
# Create the OmniScraperGraph instance and run it
31+
# ************************************************
32+
33+
omni_scraper_graph = OmniScraperGraph(
34+
prompt="List me all the projects with their titles and image links and descriptions.",
35+
# also accepts a string with the already downloaded HTML code
36+
source="https://perinim.github.io/projects/",
37+
config=graph_config
38+
)
39+
40+
result = omni_scraper_graph.run()
41+
print(json.dumps(result, indent=2))
42+
43+
# ************************************************
44+
# Get graph execution info
45+
# ************************************************
46+
47+
graph_exec_info = omni_scraper_graph.get_execution_info()
48+
print(prettify_exec_info(graph_exec_info))
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
"""
2+
Example of OmniSearchGraph
3+
"""
4+
5+
import os, json
6+
from dotenv import load_dotenv
7+
from scrapegraphai.graphs import OmniSearchGraph
8+
from scrapegraphai.utils import prettify_exec_info
9+
load_dotenv()
10+
11+
# ************************************************
12+
# Define the configuration for the graph
13+
# ************************************************
14+
15+
openai_key = os.getenv("OPENAI_APIKEY")
16+
17+
graph_config = {
18+
"llm": {
19+
"api_key": openai_key,
20+
"model": "gpt-4o",
21+
},
22+
"max_results": 2,
23+
"max_images": 5,
24+
"verbose": True,
25+
}
26+
27+
# ************************************************
28+
# Create the OmniSearchGraph instance and run it
29+
# ************************************************
30+
31+
omni_search_graph = OmniSearchGraph(
32+
prompt="List me all Chioggia's famous dishes and describe their pictures.",
33+
config=graph_config
34+
)
35+
36+
result = omni_search_graph.run()
37+
print(json.dumps(result, indent=2))
38+
39+
# ************************************************
40+
# Get graph execution info
41+
# ************************************************
42+
43+
graph_exec_info = omni_search_graph.get_execution_info()
44+
print(prettify_exec_info(graph_exec_info))
45+

examples/openai/smart_scraper_openai.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
graph_config = {
2020
"llm": {
2121
"api_key": openai_key,
22-
"model": "gpt-3.5-turbo",
22+
"model": "gpt-4o",
2323
},
2424
"verbose": True,
2525
"headless": False,
@@ -30,7 +30,7 @@
3030
# ************************************************
3131

3232
smart_scraper_graph = SmartScraperGraph(
33-
prompt="List me all the projects with their description.",
33+
prompt="List me all the projects with their description",
3434
# also accepts a string with the already downloaded HTML code
3535
source="https://perinim.github.io/projects/",
3636
config=graph_config
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
"""
2+
Example of ImageToTextNode
3+
"""
4+
5+
import os
6+
from dotenv import load_dotenv
7+
from scrapegraphai.nodes import ImageToTextNode
8+
from scrapegraphai.models import OpenAIImageToText
9+
10+
load_dotenv()
11+
12+
# ************************************************
13+
# Define the configuration for the graph
14+
# ************************************************
15+
16+
openai_key = os.getenv("OPENAI_APIKEY")
17+
18+
graph_config = {
19+
"llm": {
20+
"api_key": openai_key,
21+
"model": "gpt-4o",
22+
"temperature": 0,
23+
},
24+
}
25+
26+
# ************************************************
27+
# Define the node
28+
# ************************************************
29+
30+
llm_model = OpenAIImageToText(graph_config["llm"])
31+
32+
image_to_text_node = ImageToTextNode(
33+
input="img_url",
34+
output=["img_desc"],
35+
node_config={
36+
"llm_model": llm_model,
37+
"headless": False
38+
}
39+
)
40+
41+
# ************************************************
42+
# Test the node
43+
# ************************************************
44+
45+
state = {
46+
"img_url": [
47+
"https://perinim.github.io/assets/img/rotary_pybullet.jpg",
48+
"https://perinim.github.io/assets/img/value-policy-heatmaps.jpg",
49+
],
50+
}
51+
52+
result = image_to_text_node.execute(state)
53+
54+
print(result)

scrapegraphai/graphs/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,5 @@
1313
from .json_scraper_graph import JSONScraperGraph
1414
from .csv_scraper_graph import CSVScraperGraph
1515
from .pdf_scraper_graph import PDFScraperGraph
16+
from .omni_scraper_graph import OmniScraperGraph
17+
from .omni_search_graph import OmniSearchGraph

scrapegraphai/graphs/csv_scraper_graph.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,8 @@ def _create_graph(self):
3030
Creates the graph of nodes representing the workflow for web scraping.
3131
"""
3232
fetch_node = FetchNode(
33-
input="csv",
34-
output=["doc"],
33+
input="csv | csv_dir",
34+
output=["doc", "link_urls", "img_urls"],
3535
)
3636
parse_node = ParseNode(
3737
input="doc",

scrapegraphai/graphs/deep_scraper_graph.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ def _create_graph(self) -> BaseGraph:
6161
"""
6262
fetch_node = FetchNode(
6363
input="url | local_dir",
64-
output=["doc"]
64+
output=["doc", "link_urls", "img_urls"]
6565
)
6666
parse_node = ParseNode(
6767
input="doc",

scrapegraphai/graphs/json_scraper_graph.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -54,8 +54,8 @@ def _create_graph(self) -> BaseGraph:
5454
"""
5555

5656
fetch_node = FetchNode(
57-
input="json",
58-
output=["doc"],
57+
input="json | json_dir",
58+
output=["doc", "link_urls", "img_urls"],
5959
)
6060
parse_node = ParseNode(
6161
input="doc",

0 commit comments

Comments
 (0)