ScrapeGraphAI
diff --git a/‎.github/workflows/release.yml
Lines changed: 4 additions & 7 deletions b/‎.github/workflows/release.yml
Lines changed: 4 additions & 7 deletions
diff --git a/‎.gitignore
Lines changed: 2 additions & 0 deletions b/‎.gitignore
Lines changed: 2 additions & 0 deletions
diff --git a/‎.python-version
Lines changed: 1 addition & 0 deletions b/‎.python-version
Lines changed: 1 addition & 0 deletions
diff --git a/‎CHANGELOG.md
Lines changed: 237 additions & 0 deletions b/‎CHANGELOG.md
Lines changed: 237 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 59 additions & 125 deletions b/‎README.md
Lines changed: 59 additions & 125 deletions
diff --git a/‎docs/assets/omniscrapergraph.png
72.2 KB b/‎docs/assets/omniscrapergraph.png
72.2 KB
diff --git a/‎docs/assets/omnisearchgraph.png
56.7 KB b/‎docs/assets/omnisearchgraph.png
56.7 KB
diff --git a/‎docs/assets/project_overview_diagram.fig
51.8 KB b/‎docs/assets/project_overview_diagram.fig
51.8 KB
diff --git a/‎docs/assets/project_overview_diagram.png
82 KB b/‎docs/assets/project_overview_diagram.png
82 KB
diff --git a/‎docs/assets/searchgraph.png
53.3 KB b/‎docs/assets/searchgraph.png
53.3 KB
diff --git a/‎docs/assets/serp_api_logo.png
15.1 KB b/‎docs/assets/serp_api_logo.png
15.1 KB
diff --git a/‎docs/assets/smartscrapergraph.png
59.7 KB b/‎docs/assets/smartscrapergraph.png
59.7 KB
diff --git a/‎docs/assets/speechgraph.png
48.2 KB b/‎docs/assets/speechgraph.png
48.2 KB
diff --git a/‎docs/source/conf.py
Lines changed: 22 additions & 6 deletions b/‎docs/source/conf.py
Lines changed: 22 additions & 6 deletions
diff --git a/‎docs/source/getting_started/examples.rst
Lines changed: 5 additions & 2 deletions b/‎docs/source/getting_started/examples.rst
Lines changed: 5 additions & 2 deletions
@@ -14,11 +14,8 @@ jobs:
         run: |
           sudo apt update
           sudo apt install -y git
-      - name: Install Python Env and Poetry
-        uses: actions/setup-python@v5
-        with:
-          python-version: '3.9'
-      - run: pip install poetry
+      - name: Install the latest version of rye
+        uses: eifinger/setup-rye@v3
       - name: Install Node Env
         uses: actions/setup-node@v4
         with:
@@ -30,8 +27,8 @@ jobs:
           persist-credentials: false
       - name: Build app
         run: |
-          poetry install
-          poetry build
+          rye sync --no-lock
+          rye build
         id: build_cache
         if: success()
       - name: Cache build
 
@@ -31,4 +31,6 @@ examples/graph_examples/ScrapeGraphAI_generated_graph
 examples/**/result.csv
 examples/**/result.json
 main.py
+*.python-version
+*.lock
 
@@ -0,0 +1 @@
+3.9.19
@@ -8,14 +8,14 @@
 [![](https://dcbadge.vercel.app/api/server/gkxQDAjfeX)](https://discord.gg/gkxQDAjfeX)
 
 
-ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites, documents and XML files.
+ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, etc.).
+
 Just say which information you want to extract and the library will do it for you!
 
 <p align="center">
   <img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/scrapegraphai_logo.png" alt="Scrapegraph-ai Logo" style="width: 50%;">
 </p>
 
-
 ## 🚀 Quick install
 
 The reference page for Scrapegraph-ai is available on the official page of pypy: [pypi](https://pypi.org/project/scrapegraphai/).
@@ -39,20 +39,23 @@ Try it directly on the web using Google Colab:
 
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sEZBonBMGP44CtO6GQTwAlL0BGJXjtfd?usp=sharing)
 
-Follow the procedure on the following link to setup your OpenAI API key: [link](https://scrapegraph-ai.readthedocs.io/en/latest/index.html).
-
 ## 📖 Documentation
 
 The documentation for ScrapeGraphAI can be found [here](https://scrapegraph-ai.readthedocs.io/en/latest/).
 
-Check out also the docusaurus [documentation](https://scrapegraph-doc.onrender.com/).
+Check out also the Docusaurus [here](https://scrapegraph-doc.onrender.com/).
 
 ## 💻 Usage
-You can use the `SmartScraper` class to extract information from a website using a prompt.
+There are three main scraping pipelines that can be used to extract information from a website (or local file):
+- `SmartScraperGraph`: single-page scraper that only needs a user prompt and an input source;
+- `SearchGraph`: multi-page scraper that extracts information from the top n search results of a search engine;
+- `SpeechGraph`: single-page scraper that extracts information from a website and generates an audio file.
+
+It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, **Azure** and **Gemini**, or local models using **Ollama**.
+
+### Case 1: SmartScraper using Local Models
 
-The `SmartScraper` class is a direct graph implementation that uses the most common nodes present in a web scraping pipeline. For more information, please see the [documentation](https://scrapegraph-ai.readthedocs.io/en/latest/).
-### Case 1: Extracting information using Ollama
-Remember to download the model on Ollama separately!
+Remember to have [Ollama](https://ollama.com/) installed and download the models using the **ollama pull** command.
 
 ```python
 from scrapegraphai.graphs import SmartScraperGraph
@@ -67,11 +70,12 @@ graph_config = {
     "embeddings": {
         "model": "ollama/nomic-embed-text",
         "base_url": "http://localhost:11434",  # set Ollama URL
-    }
+    },
+    "verbose": True,
 }
 
 smart_scraper_graph = SmartScraperGraph(
-    prompt="List me all the articles",
+    prompt="List me all the projects with their descriptions",
     # also accepts a string with the already downloaded HTML code
     source="https://perinim.github.io/projects",
     config=graph_config
@@ -82,160 +86,86 @@ print(result)
 
 ```
 
-### Case 2: Extracting information using Docker
+The output will be a list of projects with their descriptions like the following:
 
-Note: before using the local model remember to create the docker container!
-```text
-    docker-compose up -d
-    docker exec -it ollama ollama pull stablelm-zephyr
-```
-You can use which models available on Ollama or your own model instead of stablelm-zephyr
 ```python
-from scrapegraphai.graphs import SmartScraperGraph
-
-graph_config = {
-    "llm": {
-        "model": "ollama/mistral",
-        "temperature": 0,
-        "format": "json",  # Ollama needs the format to be specified explicitly
-        # "model_tokens": 2000, # set context length arbitrarily
-    },
-}
-
-smart_scraper_graph = SmartScraperGraph(
-    prompt="List me all the articles",
-    # also accepts a string with the already downloaded HTML code
-    source="https://perinim.github.io/projects",  
-    config=graph_config
-)
-
-result = smart_scraper_graph.run()
-print(result)
+{'projects': [{'title': 'Rotary Pendulum RL', 'description': 'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'}, {'title': 'DQN Implementation from scratch', 'description': 'Developed a Deep Q-Network algorithm to train a simple and double pendulum'}, ...]}
 ```
 
+### Case 2: SearchGraph using Mixed Models
 
-### Case 3: Extracting information using Openai model
-```python
-from scrapegraphai.graphs import SmartScraperGraph
-OPENAI_API_KEY = "YOUR_API_KEY"
-
-graph_config = {
-    "llm": {
-        "api_key": OPENAI_API_KEY,
-        "model": "gpt-3.5-turbo",
-    },
-}
-
-smart_scraper_graph = SmartScraperGraph(
-    prompt="List me all the articles",
-    # also accepts a string with the already downloaded HTML code
-    source="https://perinim.github.io/projects",
-    config=graph_config
-)
+We use **Groq** for the LLM and **Ollama** for the embeddings.
 
-result = smart_scraper_graph.run()
-print(result)
-```
-
-### Case 4: Extracting information using Groq
 ```python
-from scrapegraphai.graphs import SmartScraperGraph
-from scrapegraphai.utils import prettify_exec_info
-
-groq_key = os.getenv("GROQ_APIKEY")
+from scrapegraphai.graphs import SearchGraph
 
+# Define the configuration for the graph
 graph_config = {
     "llm": {
         "model": "groq/gemma-7b-it",
-        "api_key": groq_key,
+        "api_key": "GROQ_API_KEY",
         "temperature": 0
     },
     "embeddings": {
         "model": "ollama/nomic-embed-text",
-        "temperature": 0,
-        "base_url": "http://localhost:11434", 
+        "base_url": "http://localhost:11434",  # set ollama URL arbitrarily
     },
-    "headless": False
+    "max_results": 5,
 }
 
-smart_scraper_graph = SmartScraperGraph(
-    prompt="List me all the projects with their description and the author.",
-    source="https://perinim.github.io/projects",
+# Create the SearchGraph instance
+search_graph = SearchGraph(
+    prompt="List me all the traditional recipes from Chioggia",
     config=graph_config
 )
 
-result = smart_scraper_graph.run()
+# Run the graph
+result = search_graph.run()
 print(result)
 ```
 
+The output will be a list of recipes like the following:
 
-### Case 5: Extracting information using Azure
 ```python
-from langchain_openai import AzureChatOpenAI
-from langchain_openai import AzureOpenAIEmbeddings
-
-lm_model_instance = AzureChatOpenAI(
-    openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
-    azure_deployment=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"]
-)
-
-embedder_model_instance = AzureOpenAIEmbeddings(
-    azure_deployment=os.environ["AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME"],
-    openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
-)
-graph_config = {
-    "llm": {"model_instance": llm_model_instance},
-    "embeddings": {"model_instance": embedder_model_instance}
-}
-
-smart_scraper_graph = SmartScraperGraph(
-    prompt="""List me all the events, with the following fields: company_name, event_name, event_start_date, event_start_time, 
-    event_end_date, event_end_time, location, event_mode, event_category, 
-    third_party_redirect, no_of_days, 
-    time_in_hours, hosted_or_attending, refreshments_type, 
-    registration_available, registration_link""",
-    source="https://www.hmhco.com/event",
-    config=graph_config
-)
+{'recipes': [{'name': 'Sarde in Saòre'}, {'name': 'Bigoli in salsa'}, {'name': 'Seppie in umido'}, {'name': 'Moleche frite'}, {'name': 'Risotto alla pescatora'}, {'name': 'Broeto'}, {'name': 'Bibarasse in Cassopipa'}, {'name': 'Risi e bisi'}, {'name': 'Smegiassa Ciosota'}]}
 ```
+### Case 3: SpeechGraph using OpenAI
+
+You just need to pass the OpenAI API key and the model name.
 
-### Case 6: Extracting information using Gemini 
 ```python
-from scrapegraphai.graphs import SmartScraperGraph
-GOOGLE_APIKEY = "YOUR_API_KEY"
+from scrapegraphai.graphs import SpeechGraph
 
-# Define the configuration for the graph
 graph_config = {
     "llm": {
-        "api_key": GOOGLE_APIKEY,
-        "model": "gemini-pro",
+        "api_key": "OPENAI_API_KEY",
+        "model": "gpt-3.5-turbo",
+    },
+    "tts_model": {
+        "api_key": "OPENAI_API_KEY",
+        "model": "tts-1",
+        "voice": "alloy"
     },
+    "output_path": "audio_summary.mp3",
 }
 
-# Create the SmartScraperGraph instance
-smart_scraper_graph = SmartScraperGraph(
-    prompt="List me all the articles",
-    source="https://perinim.github.io/projects",
-    config=graph_config
+# ************************************************
+# Create the SpeechGraph instance and run it
+# ************************************************
+
+speech_graph = SpeechGraph(
+    prompt="Make a detailed audio summary of the projects.",
+    source="https://perinim.github.io/projects/",
+    config=graph_config,
 )
 
-result = smart_scraper_graph.run()
+result = speech_graph.run()
 print(result)
-```
 
-The output for all 3 the cases will be a dictionary with the extracted information, for example:
-
-```bash
-{
-    'titles': [
-        'Rotary Pendulum RL'
-        ],
-    'descriptions': [
-        'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'
-        ]
-}
 ```
 
+The output will be an audio file with the summary of the projects on the page.
+
 ## 🤝 Contributing
 
 Feel free to contribute and join our Discord server to discuss with us improvements and give us suggestions!
@@ -253,6 +183,10 @@ Wanna visualize the roadmap in a more interactive way? Check out the [markmap](h
 
 ## ❤️ Contributors
 [![Contributors](https://contrib.rocks/image?repo=VinciGit00/Scrapegraph-ai)](https://github.com/VinciGit00/Scrapegraph-ai/graphs/contributors)
+## Sponsors
+<p align="center">
+  <a href="https://serpapi.com?utm_source=scrapegraphai"><img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/serp_api_logo.png" alt="SerpAPI" style="width: 10%;"></a>
+</p>
 
 ## 🎓 Citations
 If you have used our library for research purposes please quote us with the following reference:
@@ -269,7 +203,7 @@ If you have used our library for research purposes please quote us with the foll
 ## Authors
 
 <p align="center">
-  <img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/logo_authors.png" alt="Authors Logos">
+  <img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/logo_authors.png" alt="Authors_logos">
 </p>
 
 |                    | Contact Info         |
 
@@ -14,20 +14,36 @@
 # import all the modules
 sys.path.insert(0, os.path.abspath('../../'))
 
-project = 'scrapegraphai'
-copyright = '2024, Marco Vinciguerra'
-author = 'Marco Vinciguerra'
+project = 'ScrapeGraphAI'
+copyright = '2024, ScrapeGraphAI'
+author = 'Marco Vinciguerra, Marco Perini, Lorenzo Padoan'
+
+html_last_updated_fmt = "%b %d, %Y"
 
 # -- General configuration ---------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
 
-extensions = ['sphinx.ext.autodoc', 'sphinx.ext.napoleon']
+extensions = ['sphinx.ext.autodoc', 'sphinx.ext.napoleon','sphinx_wagtail_theme']
 
 templates_path = ['_templates']
 exclude_patterns = []
 
 # -- Options for HTML output -------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
 
-html_theme = 'sphinx_rtd_theme'
-html_static_path = ['_static']
+# html_theme = 'sphinx_rtd_theme'
+html_theme = 'sphinx_wagtail_theme'
+
+html_theme_options = dict(
+    project_name = "ScrapeGraphAI",
+    logo = "scrapegraphai_logo.png",
+    logo_alt = "ScrapeGraphAI",
+    logo_height = 59,
+    logo_url = "https://scrapegraph-ai.readthedocs.io/en/latest/",
+    logo_width = 45,
+    github_url = "https://github.com/VinciGit00/Scrapegraph-ai/tree/main/docs/source/",
+    footer_links = ",".join(
+        ["Landing Page|https://scrapegraphai.com/",
+         "Docusaurus|https://scrapegraph-doc.onrender.com/docs/intro"]
+         ),
+)
@@ -1,7 +1,9 @@
 Examples
 ========
 
-Here some example of the different ways to scrape with ScrapegraphAI
+Let's suppose you want to scrape a website to get a list of projects with their descriptions.
+You can use the `SmartScraperGraph` class to do that.
+The following examples show how to use the `SmartScraperGraph` class with OpenAI models and local models.
 
 OpenAI models
 ^^^^^^^^^^^^^
@@ -78,7 +80,7 @@ After that, you can run the following code, using only your machine resources br
    # ************************************************
 
    smart_scraper_graph = SmartScraperGraph(
-      prompt="List me all the news with their description.",
+      prompt="List me all the projects with their description.",
       # also accepts a string with the already downloaded HTML code
       source="https://perinim.github.io/projects",
       config=graph_config
@@ -87,3 +89,4 @@ After that, you can run the following code, using only your machine resources br
    result = smart_scraper_graph.run()
    print(result)
 
+To find out how you can customize the `graph_config` dictionary, by using different LLM and adding new parameters, check the `Scrapers` section!