Skip to content

Gpt-4o integration #238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/assets/omniscrapergraph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/omnisearchgraph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions docs/source/scrapers/graph_config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ Some interesting ones are:
- `headless`: If set to `False`, the web browser will be opened on the URL requested and close right after the HTML is fetched.
- `max_results`: The maximum number of results to be fetched from the search engine. Useful in `SearchGraph`.
- `output_path`: The path where the output files will be saved. Useful in `SpeechGraph`.
- `loader_kwargs`: A dictionary with additional parameters to be passed to the `Loader` class, such as `proxy`.
- `max_images`: The maximum number of images to be analyzed. Useful in `OmniScraperGraph` and `OmniSearchGraph`.

Proxy Rotation
^^^^^^^^^^^^^^
Expand Down
66 changes: 65 additions & 1 deletion docs/source/scrapers/graphs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,80 @@ Graphs

Graphs are scraping pipelines aimed at solving specific tasks. They are composed by nodes which can be configured individually to address different aspects of the task (fetching data, extracting information, etc.).

There are currently three types of graphs available in the library:
There are three types of graphs available in the library:

- **SmartScraperGraph**: one-page scraper that requires a user-defined prompt and a URL (or local file) to extract information from using LLM.
- **SearchGraph**: multi-page scraper that only requires a user-defined prompt to extract information from a search engine using LLM. It is built on top of SmartScraperGraph.
- **SpeechGraph**: text-to-speech pipeline that generates an answer as well as a requested audio file. It is built on top of SmartScraperGraph and requires a user-defined prompt and a URL (or local file).

With the introduction of `GPT-4o`, two new powerful graphs have been created:

- **OmniScraperGraph**: similar to `SmartScraperGraph`, but with the ability to scrape images and describe them.
- **OmniSearchGraph**: similar to `SearchGraph`, but with the ability to scrape images and describe them.

.. note::

They all use a graph configuration to set up LLM models and other parameters. To find out more about the configurations, check the :ref:`LLM` and :ref:`Configuration` sections.

OmniScraperGraph
^^^^^^^^^^^^^^^^

.. image:: ../../assets/omniscrapergraph.png
:align: center
:width: 90%
:alt: OmniScraperGraph
|

First we define the graph configuration, which includes the LLM model and other parameters. Then we create an instance of the OmniScraperGraph class, passing the prompt, source, and configuration as arguments. Finally, we run the graph and print the result.
It will fetch the data from the source and extract the information based on the prompt in JSON format.

.. code-block:: python

from scrapegraphai.graphs import OmniScraperGraph

graph_config = {
"llm": {...},
}

omni_scraper_graph = OmniScraperGraph(
prompt="List me all the projects with their titles and image links and descriptions.",
source="https://perinim.github.io/projects",
config=graph_config
)

result = omni_scraper_graph.run()
print(result)

OmniSearchGraph
^^^^^^^^^^^^^^^

.. image:: ../../assets/omnisearchgraph.png
:align: center
:width: 80%
:alt: OmniSearchGraph
|

Similar to OmniScraperGraph, we define the graph configuration, create multiple of the OmniSearchGraph class, and run the graph.
It will create a search query, fetch the first n results from the search engine, run n OmniScraperGraph instances, and return the results in JSON format.

.. code-block:: python

from scrapegraphai.graphs import OmniSearchGraph

graph_config = {
"llm": {...},
}

# Create the OmniSearchGraph instance
omni_search_graph = OmniSearchGraph(
prompt="List me all Chioggia's famous dishes and describe their pictures.",
config=graph_config
)

# Run the graph
result = omni_search_graph.run()
print(result)

SmartScraperGraph
^^^^^^^^^^^^^^^^^

Expand Down
48 changes: 48 additions & 0 deletions examples/openai/omni_scraper_openai.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
"""
Basic example of scraping pipeline using OmniScraper
"""

import os, json
from dotenv import load_dotenv
from scrapegraphai.graphs import OmniScraperGraph
from scrapegraphai.utils import prettify_exec_info

load_dotenv()


# ************************************************
# Define the configuration for the graph
# ************************************************

openai_key = os.getenv("OPENAI_APIKEY")

graph_config = {
"llm": {
"api_key": openai_key,
"model": "gpt-4-turbo",
},
"verbose": True,
"headless": True,
"max_images": 5
}

# ************************************************
# Create the OmniScraperGraph instance and run it
# ************************************************

omni_scraper_graph = OmniScraperGraph(
prompt="List me all the projects with their titles and image links and descriptions.",
# also accepts a string with the already downloaded HTML code
source="https://perinim.github.io/projects/",
config=graph_config
)

result = omni_scraper_graph.run()
print(json.dumps(result, indent=2))

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = omni_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))
45 changes: 45 additions & 0 deletions examples/openai/omni_search_graph_openai.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
"""
Example of OmniSearchGraph
"""

import os, json
from dotenv import load_dotenv
from scrapegraphai.graphs import OmniSearchGraph
from scrapegraphai.utils import prettify_exec_info
load_dotenv()

# ************************************************
# Define the configuration for the graph
# ************************************************

openai_key = os.getenv("OPENAI_APIKEY")

graph_config = {
"llm": {
"api_key": openai_key,
"model": "gpt-4o",
},
"max_results": 2,
"max_images": 5,
"verbose": True,
}

# ************************************************
# Create the OmniSearchGraph instance and run it
# ************************************************

omni_search_graph = OmniSearchGraph(
prompt="List me all Chioggia's famous dishes and describe their pictures.",
config=graph_config
)

result = omni_search_graph.run()
print(json.dumps(result, indent=2))

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = omni_search_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))

4 changes: 2 additions & 2 deletions examples/openai/smart_scraper_openai.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
graph_config = {
"llm": {
"api_key": openai_key,
"model": "gpt-3.5-turbo",
"model": "gpt-4o",
},
"verbose": True,
"headless": False,
Expand All @@ -30,7 +30,7 @@
# ************************************************

smart_scraper_graph = SmartScraperGraph(
prompt="List me all the projects with their description.",
prompt="List me all the projects with their description",
# also accepts a string with the already downloaded HTML code
source="https://perinim.github.io/projects/",
config=graph_config
Expand Down
54 changes: 54 additions & 0 deletions examples/single_node/image2text_node.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
"""
Example of ImageToTextNode
"""

import os
from dotenv import load_dotenv
from scrapegraphai.nodes import ImageToTextNode
from scrapegraphai.models import OpenAIImageToText

load_dotenv()

# ************************************************
# Define the configuration for the graph
# ************************************************

openai_key = os.getenv("OPENAI_APIKEY")

graph_config = {
"llm": {
"api_key": openai_key,
"model": "gpt-4o",
"temperature": 0,
},
}

# ************************************************
# Define the node
# ************************************************

llm_model = OpenAIImageToText(graph_config["llm"])

image_to_text_node = ImageToTextNode(
input="img_url",
output=["img_desc"],
node_config={
"llm_model": llm_model,
"headless": False
}
)

# ************************************************
# Test the node
# ************************************************

state = {
"img_url": [
"https://perinim.github.io/assets/img/rotary_pybullet.jpg",
"https://perinim.github.io/assets/img/value-policy-heatmaps.jpg",
],
}

result = image_to_text_node.execute(state)

print(result)
2 changes: 2 additions & 0 deletions scrapegraphai/graphs/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,5 @@
from .json_scraper_graph import JSONScraperGraph
from .csv_scraper_graph import CSVScraperGraph
from .pdf_scraper_graph import PDFScraperGraph
from .omni_scraper_graph import OmniScraperGraph
from .omni_search_graph import OmniSearchGraph
4 changes: 2 additions & 2 deletions scrapegraphai/graphs/csv_scraper_graph.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,8 @@ def _create_graph(self):
Creates the graph of nodes representing the workflow for web scraping.
"""
fetch_node = FetchNode(
input="csv",
output=["doc"],
input="csv | csv_dir",
output=["doc", "link_urls", "img_urls"],
)
parse_node = ParseNode(
input="doc",
Expand Down
2 changes: 1 addition & 1 deletion scrapegraphai/graphs/deep_scraper_graph.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ def _create_graph(self) -> BaseGraph:
"""
fetch_node = FetchNode(
input="url | local_dir",
output=["doc"]
output=["doc", "link_urls", "img_urls"]
)
parse_node = ParseNode(
input="doc",
Expand Down
4 changes: 2 additions & 2 deletions scrapegraphai/graphs/json_scraper_graph.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,8 @@ def _create_graph(self) -> BaseGraph:
"""

fetch_node = FetchNode(
input="json",
output=["doc"],
input="json | json_dir",
output=["doc", "link_urls", "img_urls"],
)
parse_node = ParseNode(
input="doc",
Expand Down
Loading
Loading