ScrapeGraphAI
diff --git a/‎.github/workflows/pylint.yml
Lines changed: 11 additions & 15 deletions b/‎.github/workflows/pylint.yml
Lines changed: 11 additions & 15 deletions
diff --git a/‎docker-compose.yml
Lines changed: 1 addition & 1 deletion b/‎docker-compose.yml
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/benchmarks/SmartScraper/Readme.md
Lines changed: 19 additions & 18 deletions b/‎examples/benchmarks/SmartScraper/Readme.md
Lines changed: 19 additions & 18 deletions
diff --git a/‎examples/benchmarks/SmartScraper/benchmark_openai_gpt4o.py
Lines changed: 53 additions & 0 deletions b/‎examples/benchmarks/SmartScraper/benchmark_openai_gpt4o.py
Lines changed: 53 additions & 0 deletions
diff --git a/‎examples/extras/example.yml
Lines changed: 15 additions & 0 deletions b/‎examples/extras/example.yml
Lines changed: 15 additions & 0 deletions
diff --git a/‎examples/extras/force_mode.py
Lines changed: 54 additions & 0 deletions b/‎examples/extras/force_mode.py
Lines changed: 54 additions & 0 deletions
diff --git a/‎examples/extras/load_yml.py
Lines changed: 32 additions & 0 deletions b/‎examples/extras/load_yml.py
Lines changed: 32 additions & 0 deletions
diff --git a/‎examples/extras/no_cut.py
Lines changed: 43 additions & 0 deletions b/‎examples/extras/no_cut.py
Lines changed: 43 additions & 0 deletions
diff --git a/‎examples/extras/proxy_rotation.py
Lines changed: 48 additions & 0 deletions b/‎examples/extras/proxy_rotation.py
Lines changed: 48 additions & 0 deletions
@@ -1,30 +1,26 @@
-on: [push]
+on:
+  push:
+    paths:
+      - 'scrapegraphai/**'
+      - '.github/workflows/pylint.yml'
 
 jobs:
   build:
     runs-on: ubuntu-latest
-    strategy:
-      matrix:
-        python-version: ["3.10"]
     steps:
       - uses: actions/checkout@v3
-      - name: Set up Python ${{ matrix.python-version }}
-        uses: actions/setup-python@v3
-        with:
-          python-version: ${{ matrix.python-version }}
+      - name: Install the latest version of rye
+        uses: eifinger/setup-rye@v3
       - name: Install dependencies
-        run: |
-          python -m pip install --upgrade pip
-          pip install pylint
-          pip install -r requirements.txt
+        run: rye sync --no-lock
       - name: Analysing the code with pylint
-        run: pylint --disable=C0114,C0115,C0116 --exit-zero scrapegraphai/**/*.py scrapegraphai/*.py
+        run: rye run pylint-ci
       - name: Check Pylint score
         run: |
-          pylint_score=$(pylint --disable=all --enable=metrics --output-format=text scrapegraphai/**/*.py scrapegraphai/*.py | grep 'Raw metrics' | awk '{print $4}')
+          pylint_score=$(rye run pylint-score-ci | grep 'Raw metrics' | awk '{print $4}')
           if (( $(echo "$pylint_score < 8" | bc -l) )); then
             echo "Pylint score is below 8. Blocking commit."
             exit 1
           else
             echo "Pylint score is acceptable."
-          fi
+          fi
@@ -4,7 +4,7 @@ services:
     image: ollama/ollama
     container_name: ollama
     ports:
-      - "5000:5000"
+      - "11434:11434"
     volumes:
       - ollama_volume:/root/.ollama
     restart: unless-stopped
 
@@ -1,16 +1,17 @@
 # Local models
+# Local models
 The two websites benchmark are:
 - Example 1:  https://perinim.github.io/projects
 - Example 2: https://www.wired.com (at 17/4/2024)
 
 Both are strored locally as txt file in .txt format  because in this way we do not have to think about the internet connection
 
-| Hardware           | Model                                   | Example 1 | Example 2 |
-| ------------------ | --------------------------------------- | --------- | --------- |
-| Macbook 14' m1 pro | Mistral on Ollama with nomic-embed-text | 11.60s    | 26.61s    |
-| Macbook m2 max     | Mistral on Ollama with nomic-embed-text | 8.05s     | 12.17s    |
-| Macbook 14' m1 pro | Llama3 on Ollama with nomic-embed-text  | 29.87s    | 35.32s    |
-| Macbook m2 max     | Llama3 on Ollama with nomic-embed-text  | 18.36s    | 78.32s    |
+| Hardware               | Model                                   | Example 1 | Example 2 |
+| ---------------------- | --------------------------------------- | --------- | --------- |
+| Macbook 14' m1 pro     | Mistral on Ollama with nomic-embed-text | 16.291s   | 38.74s    |
+| Macbook m2 max         | Mistral on Ollama with nomic-embed-text |           |           |
+| Macbook 14' m1 pro<br> | Llama3 on Ollama with nomic-embed-text  | 12.88s    | 13.84s    |
+| Macbook m2 max<br>     | Llama3 on Ollama with nomic-embed-text  |           |           |
 
 **Note**: the examples on Docker are not runned on other devices than the Macbook because the performance are to slow (10 times slower than Ollama). Indeed the results are the following:
 
@@ -22,20 +23,20 @@ Both are strored locally as txt file in .txt format  because in this way we do n
 **URL**: https://perinim.github.io/projects
 **Task**: List me all the projects with their description.
 
-| Name                        | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
-| --------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
-| gpt-3.5-turbo               | 25.22                    | 445          | 272           | 173               | 1                   | 0.000754       |
-| gpt-4-turbo-preview         | 9.53                     | 449          | 272           | 177               | 1                   | 0.00803        |
-| Grooq with nomic-embed-text | 1.99                     | 474          | 284           | 190               | 1                   | 0              |
+| Name                            | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
+| ------------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
+| gpt-3.5-turbo                   | 4.132s                   | 438          | 303           | 135               | 1                   | 0.000724       |
+| gpt-4-turbo-preview             | 6.965s                   | 442          | 303           | 139               | 1                   | 0.0072         |
+| gpt-4-o                         | 4.446s                   | 444          | 305           | 139               | 1                   | 0              |
+| Grooq with nomic-embed-text<br> | 1.335s                   | 648          | 482           | 166               | 1                   | 0              |
 
 ### Example 2: Wired
 **URL**: https://www.wired.com
 **Task**: List me all the articles with their description.
 
-| Name                        | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
-| --------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
-| gpt-3.5-turbo               | 25.89                    | 445          | 272           | 173               | 1                   | 0.000754       |
-| gpt-4-turbo-preview         | 64.70                    | 3573         | 2199          | 1374              | 1                   | 0.06321        |
-| Grooq with nomic-embed-text | 3.82                     | 2459         | 2192          | 267               | 1                   | 0              |
-
-
+| Name                            | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
+| ------------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
+| gpt-3.5-turbo                   | 8.836s                   | 1167         | 726           | 441               | 1                   | 0.001971       |
+| gpt-4-turbo-preview             | 21.53s                   | 1205         | 726           | 479               | 1                   | 0.02163        |
+| gpt-4-o                         | 15.27s                   | 1400         | 715           | 685               | 1                   | 0              |
+| Grooq with nomic-embed-text<br> | 3.82s                    | 2459         | 2192          | 267               | 1                   | 0              |
@@ -0,0 +1,53 @@
+""" 
+Basic example of scraping pipeline using SmartScraper from text
+"""
+
+import os
+from dotenv import load_dotenv
+from scrapegraphai.graphs import SmartScraperGraph
+from scrapegraphai.utils import prettify_exec_info
+load_dotenv()
+
+# ************************************************
+# Read the text file
+# ************************************************
+files = ["inputs/example_1.txt", "inputs/example_2.txt"]
+tasks = ["List me all the projects with their description.",
+         "List me all the articles with their description."]
+
+
+# ************************************************
+# Define the configuration for the graph
+# ************************************************
+
+openai_key = os.getenv("OPENAI_APIKEY")
+
+graph_config = {
+    "llm": {
+        "api_key": openai_key,
+        "model": "gpt-4o",
+    },
+}
+
+# ************************************************
+# Create the SmartScraperGraph instance and run it
+# ************************************************
+
+for i in range(0, 2):
+    with open(files[i], 'r', encoding="utf-8") as file:
+        text = file.read()
+
+    smart_scraper_graph = SmartScraperGraph(
+        prompt=tasks[i],
+        source=text,
+        config=graph_config
+    )
+
+    result = smart_scraper_graph.run()
+    print(result)
+    # ************************************************
+    # Get graph execution info
+    # ************************************************
+
+    graph_exec_info = smart_scraper_graph.get_execution_info()
+    print(prettify_exec_info(graph_exec_info))
@@ -0,0 +1,15 @@
+{
+    "llm": {
+        "model": "ollama/llama3",
+        "temperature": 0,
+        "format": "json",
+        # "base_url": "http://localhost:11434", 
+    },
+    "embeddings": {
+        "model": "ollama/nomic-embed-text",
+        "temperature": 0,
+        # "base_url": "http://localhost:11434",  
+    },
+    "verbose": true,
+    "headless": false
+}
@@ -0,0 +1,54 @@
+""" 
+Basic example of scraping pipeline using SmartScraper
+"""
+
+import os
+from dotenv import load_dotenv
+from scrapegraphai.graphs import SmartScraperGraph
+from scrapegraphai.utils import prettify_exec_info
+
+load_dotenv()
+
+
+# ************************************************
+# Define the configuration for the graph
+# ************************************************
+
+openai_key = os.getenv("OPENAI_APIKEY")
+
+graph_config = {
+  "llm": {
+        "model": "ollama/llama3",
+        "temperature": 0,
+        # "format": "json",  # Ollama needs the format to be specified explicitly
+        # "base_url": "http://localhost:11434",  # set ollama URL arbitrarily
+    },
+    "embeddings": {
+        "model": "ollama/nomic-embed-text",
+        "temperature": 0,
+        # "base_url": "http://localhost:11434",  # set ollama URL arbitrarily
+    },
+    "force": True,
+    "caching": True
+}
+
+# ************************************************
+# Create the SmartScraperGraph instance and run it
+# ************************************************
+
+smart_scraper_graph = SmartScraperGraph(
+    prompt="List me all the projects with their description.",
+    # also accepts a string with the already downloaded HTML code
+    source="https://perinim.github.io/projects/",
+    config=graph_config
+)
+
+result = smart_scraper_graph.run()
+print(result)
+
+# ************************************************
+# Get graph execution info
+# ************************************************
+
+graph_exec_info = smart_scraper_graph.get_execution_info()
+print(prettify_exec_info(graph_exec_info))
@@ -0,0 +1,32 @@
+""" 
+Basic example of scraping pipeline using SmartScraper
+"""
+import yaml
+from scrapegraphai.graphs import SmartScraperGraph
+from scrapegraphai.utils import prettify_exec_info
+
+# ************************************************
+# Define the configuration for the graph
+# ************************************************
+with open("example.yml", 'r') as file:
+    graph_config = yaml.safe_load(file)
+
+# ************************************************
+# Create the SmartScraperGraph instance and run it
+# ************************************************
+
+smart_scraper_graph = SmartScraperGraph(
+    prompt="List me all the titles",
+    source="https://sport.sky.it/nba?gr=www",
+    config=graph_config
+)
+
+result = smart_scraper_graph.run()
+print(result)
+
+# ************************************************
+# Get graph execution info
+# ************************************************
+
+graph_exec_info = smart_scraper_graph.get_execution_info()
+print(prettify_exec_info(graph_exec_info))
@@ -0,0 +1,43 @@
+""" 
+This example shows how to do not process the html code in the fetch phase
+"""
+
+import os, json
+from scrapegraphai.graphs import SmartScraperGraph
+from scrapegraphai.utils import prettify_exec_info
+
+
+# ************************************************
+# Define the configuration for the graph
+# ************************************************
+
+
+graph_config = {
+    "llm": {
+        "api_key": "s",
+        "model": "gpt-3.5-turbo",
+    },
+    "cut": False,
+    "verbose": True,
+    "headless": False,
+}
+
+# ************************************************
+# Create the SmartScraperGraph instance and run it
+# ************************************************
+
+smart_scraper_graph = SmartScraperGraph(
+    prompt="Extract me the python code inside the page",
+    source="https://www.exploit-db.com/exploits/51447",
+    config=graph_config
+)
+
+result = smart_scraper_graph.run()
+print(json.dumps(result, indent=4))
+
+# ************************************************
+# Get graph execution info
+# ************************************************
+
+graph_exec_info = smart_scraper_graph.get_execution_info()
+print(prettify_exec_info(graph_exec_info))
@@ -0,0 +1,48 @@
+""" 
+Basic example of scraping pipeline using SmartScraper
+"""
+
+from scrapegraphai.graphs import SmartScraperGraph
+from scrapegraphai.utils import prettify_exec_info
+
+
+# ************************************************
+# Define the configuration for the graph
+# ************************************************
+
+graph_config = {
+    "llm": {
+        "api_key": "API_KEY",
+        "model": "gpt-3.5-turbo",
+    },
+    "loader_kwargs": {
+        "proxy" : {
+            "server": "http:/**********",
+            "username": "********",
+            "password": "***",
+        },
+     },
+    "verbose": True,
+    "headless": False,
+}
+
+# ************************************************
+# Create the SmartScraperGraph instance and run it
+# ************************************************
+
+smart_scraper_graph = SmartScraperGraph(
+    prompt="List me all the projects with their description",
+    # also accepts a string with the already downloaded HTML code
+    source="https://perinim.github.io/projects/",
+    config=graph_config
+)
+
+result = smart_scraper_graph.run()
+print(result)
+
+# ************************************************
+# Get graph execution info
+# ************************************************
+
+graph_exec_info = smart_scraper_graph.get_execution_info()
+print(prettify_exec_info(graph_exec_info))