ScrapeGraphAI · VinciGit00 · Apr 30, 2024 · Apr 29, 2024 · Apr 29, 2024 · Apr 29, 2024
diff --git a/.gitignore b/.gitignore
@@ -29,7 +29,6 @@ venv/
 *.google-cookie
 examples/graph_examples/ScrapeGraphAI_generated_graph
 examples/**/*.csv
-examples/**/*.json
 main.py
 poetry.lock
 

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,53 @@
+## [0.5.0-beta.6](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.5.0-beta.5...v0.5.0-beta.6) (2024-04-30)
+
+
+### Features
+
+* added verbose flag to suppress print statements ([2dd7817](https://github.com/VinciGit00/Scrapegraph-ai/commit/2dd7817cfb37cfbeb7e65b3a24655ab238f48026))
+
+## [0.5.0-beta.5](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.5.0-beta.4...v0.5.0-beta.5) (2024-04-30)
+
+
+### Features
+
+* **refactor:** changed variable names ([8fba7e5](https://github.com/VinciGit00/Scrapegraph-ai/commit/8fba7e5490f916b325588443bba3fff5c0733c17))
+
+## [0.5.0-beta.4](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.5.0-beta.3...v0.5.0-beta.4) (2024-04-30)
+
+
+### Bug Fixes
+
+* script generator and add new benchmarks ([e3d0194](https://github.com/VinciGit00/Scrapegraph-ai/commit/e3d0194dc93b20dc254fc48bba11559bf8a3a185))
+
+## [0.5.0-beta.3](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.5.0-beta.2...v0.5.0-beta.3) (2024-04-30)
+
+
+### Features
+
+* add cluade integration ([e0ffc83](https://github.com/VinciGit00/Scrapegraph-ai/commit/e0ffc838b06c0f024026a275fc7f7b4243ad5cf9))
+
+## [0.5.0-beta.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.5.0-beta.1...v0.5.0-beta.2) (2024-04-30)
+
+
+### Features
+
+* **fetch:** added playwright support ([42ab0aa](https://github.com/VinciGit00/Scrapegraph-ai/commit/42ab0aa1d275b5798ab6fc9feea575fe59b6e767))
+
+## [0.5.0-beta.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.4.1...v0.5.0-beta.1) (2024-04-30)
+
+
+### Features
+
+* add co-author ([719a353](https://github.com/VinciGit00/Scrapegraph-ai/commit/719a353410992cc96f46ec984a5d3ec372e71ad2))
+* base groq + requirements + toml update with groq ([7dd5b1a](https://github.com/VinciGit00/Scrapegraph-ai/commit/7dd5b1a03327750ffa5b2fb647eda6359edd1fc2))
+* **llm:** implemented groq model ([dbbf10f](https://github.com/VinciGit00/Scrapegraph-ai/commit/dbbf10fc77b34d99d64c6cd7f74524b6d8e57fa5))
+* updated requirements.txt ([d368725](https://github.com/VinciGit00/Scrapegraph-ai/commit/d36872518a6d234eba5f8b7ddca7da93797874b2))
+
+
+### CI
+
+* **release:** 0.4.0-beta.3 [skip ci] ([d13321b](https://github.com/VinciGit00/Scrapegraph-ai/commit/d13321b2f86d98e2a3a0c563172ca0dd29cdf5fb))
+
 ## [0.4.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.4.0...v0.4.1) (2024-04-28)
 
 

diff --git a/README.md b/README.md
@@ -23,6 +23,10 @@ The reference page for Scrapegraph-ai is available on the official page of pypy:
 ```bash
 pip install scrapegraphai
 ```
+you will also need to install Playwright for javascript-based scraping:
+```bash
+playwright install
+```
 ## 🔍 Demo
 Official streamlit demo:
 
@@ -46,6 +50,7 @@ You can use the `SmartScraper` class to extract information from a website using
 The `SmartScraper` class is a direct graph implementation that uses the most common nodes present in a web scraping pipeline. For more information, please see the [documentation](https://scrapegraph-ai.readthedocs.io/en/latest/).
 ### Case 1: Extracting information using Ollama
 Remember to download the model on Ollama separately!
+
 ```python
 from scrapegraphai.graphs import SmartScraperGraph
 
@@ -129,7 +134,38 @@ result = smart_scraper_graph.run()
 print(result)
 ```
 
-### Case 4: Extracting information using Gemini 
+### Case 4: Extracting information using Groq
+```python
+from scrapegraphai.graphs import SmartScraperGraph
+from scrapegraphai.utils import prettify_exec_info
+
+groq_key = os.getenv("GROQ_APIKEY")
+
+graph_config = {
+    "llm": {
+        "model": "groq/gemma-7b-it",
+        "api_key": groq_key,
+        "temperature": 0
+    },
+    "embeddings": {
+        "model": "ollama/nomic-embed-text",
+        "temperature": 0,
+        "base_url": "http://localhost:11434", 
+    },
+    "headless": False
+}
+
+smart_scraper_graph = SmartScraperGraph(
+    prompt="List me all the projects with their description and the author.",
+    source="https://perinim.github.io/projects",
+    config=graph_config
+)
+
+result = smart_scraper_graph.run()
+print(result)
+```
+
+### Case 5: Extracting information using Gemini 
 ```python
 from scrapegraphai.graphs import SmartScraperGraph
 GOOGLE_APIKEY = "YOUR_API_KEY"

diff --git a/examples/benchmarks/GenerateScraper/Readme.md b/examples/benchmarks/GenerateScraper/Readme.md
@@ -1,4 +1,5 @@
 # Local models
+# Local models
 The two websites benchmark are:
 - Example 1:  https://perinim.github.io/projects
 - Example 2: https://www.wired.com (at 17/4/2024)
@@ -9,14 +10,12 @@ The time is measured in seconds
 
 The model runned for this benchmark is Mistral on Ollama with nomic-embed-text
 
-In particular, is tested with ScriptCreatorGraph
-
 | Hardware               | Model                                   | Example 1 | Example 2 |
 | ---------------------- | --------------------------------------- | --------- | --------- |
 | Macbook 14' m1 pro     | Mistral on Ollama with nomic-embed-text | 30.54s    | 35.76s    |
-| Macbook m2 max         | Mistral on Ollama with nomic-embed-text | 18,46s    | 19.59     |
-| Macbook 14' m1 pro<br> | Llama3 on Ollama with nomic-embed-text  | 27.82s    | 29.98s    |
-| Macbook m2 max<br>     | Llama3 on Ollama with nomic-embed-text  | 20.83s    | 12.29s    |
+| Macbook m2 max         | Mistral on Ollama with nomic-embed-text |           |           |
+| Macbook 14' m1 pro<br> | Llama3 on Ollama with nomic-embed-text  | 27.82s    | 29.986s   |
+| Macbook m2 max<br>     | Llama3 on Ollama with nomic-embed-text  |           |           |
 
 
 **Note**: the examples on Docker are not runned on other devices than the Macbook because the performance are to slow (10 times slower than Ollama). 
@@ -25,17 +24,20 @@ In particular, is tested with ScriptCreatorGraph
 **URL**: https://perinim.github.io/projects
 **Task**: List me all the projects with their description.
 
-| Name                | Execution time | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
-| ------------------- | ---------------| ------------ | ------------- | ----------------- | ------------------- | -------------- |
-| gpt-3.5-turbo       | 4.50s          | 1897         | 1802          | 95                | 1                   | 0.002893       |
-| gpt-4-turbo         | 7.88s          | 1920         | 1802          | 118               | 1                   | 0.02156        |
+| Name                        | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
+| --------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
+| gpt-3.5-turbo               | 24.21                    | 1892         | 1802          | 90                | 1                   | 0.002883       |
+| gpt-4-turbo-preview         | 6.614                    | 1936         | 1802          | 134               | 1                   | 0.02204        |
+| Grooq with nomic-embed-text | 6.71                     | 2201         | 2024          | 177               | 1                   | 0              |
 
 ### Example 2: Wired
 **URL**: https://www.wired.com
 **Task**: List me all the articles with their description.
 
-| Name                | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
-| ------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
-| gpt-3.5-turbo       |   Error (text too long)  |      -       |      -        |         -         |           -         |        -       |
-| gpt-4-turbo         |   Error (TPM limit reach)|      -       |      -        |         -         |           -         |        -       |
+| Name                        | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
+| --------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
+| gpt-3.5-turbo               |                          |              |               |                   |                     |                |
+| gpt-4-turbo-preview         |                          |              |               |                   |                     |                |
+| Grooq with nomic-embed-text |                          |              |               |                   |                     |                |
+
 
diff --git a/examples/benchmarks/GenerateScraper/benchmark_groq.py b/examples/benchmarks/GenerateScraper/benchmark_groq.py
@@ -0,0 +1,61 @@
+""" 
+Basic example of scraping pipeline using SmartScraper from text
+"""
+import os
+from dotenv import load_dotenv
+from scrapegraphai.graphs import ScriptCreatorGraph
+from scrapegraphai.utils import prettify_exec_info
+
+load_dotenv()
+
+# ************************************************
+# Read the text file
+# ************************************************
+files = ["inputs/example_1.txt", "inputs/example_2.txt"]
+tasks = ["List me all the projects with their description.",
+         "List me all the articles with their description."]
+
+# ************************************************
+# Define the configuration for the graph
+# ************************************************
+
+groq_key = os.getenv("GROQ_APIKEY")
+
+graph_config = {
+    "llm": {
+        "model": "groq/gemma-7b-it",
+        "api_key": groq_key,
+        "temperature": 0
+    },
+    "embeddings": {
+        "model": "ollama/nomic-embed-text",
+        "temperature": 0,
+        "base_url": "http://localhost:11434",  # set ollama URL arbitrarily
+    },
+    "headless": False,
+    "library": "beautifoulsoup"
+}
+
+
+# ************************************************
+# Create the SmartScraperGraph instance and run it
+# ************************************************
+
+for i in range(0, 2):
+    with open(files[i], 'r', encoding="utf-8") as file:
+        text = file.read()
+
+    smart_scraper_graph = ScriptCreatorGraph(
+        prompt=tasks[i],
+        source=text,
+        config=graph_config
+    )
+
+    result = smart_scraper_graph.run()
+    print(result)
+    # ************************************************
+    # Get graph execution info
+    # ************************************************
+
+    graph_exec_info = smart_scraper_graph.get_execution_info()
+    print(prettify_exec_info(graph_exec_info))
diff --git a/examples/benchmarks/GenerateScraper/benchmark_llama3.py b/examples/benchmarks/GenerateScraper/benchmark_llama3.py
@@ -2,11 +2,8 @@
 Basic example of scraping pipeline using SmartScraper from text
 """
 
-import os
-from dotenv import load_dotenv
 from scrapegraphai.graphs import ScriptCreatorGraph
 from scrapegraphai.utils import prettify_exec_info
-load_dotenv()
 
 # ************************************************
 # Read the text file
@@ -19,8 +16,6 @@
 # Define the configuration for the graph
 # ************************************************
 
-openai_key = os.getenv("GPT4_KEY")
-
 
 graph_config = {
     "llm": {

diff --git a/examples/benchmarks/SmartScraper/Readme.md b/examples/benchmarks/SmartScraper/Readme.md
@@ -5,37 +5,37 @@ The two websites benchmark are:
 
 Both are strored locally as txt file in .txt format  because in this way we do not have to think about the internet connection
 
-In particular, is tested with SmartScraper
-
-| Hardware           | Moodel                                  | Example 1 | Example 2 |
+| Hardware           | Model                                   | Example 1 | Example 2 |
 | ------------------ | --------------------------------------- | --------- | --------- |
 | Macbook 14' m1 pro | Mistral on Ollama with nomic-embed-text | 11.60s    | 26.61s    |
 | Macbook m2 max     | Mistral on Ollama with nomic-embed-text | 8.05s     | 12.17s    |
-| Macbook 14' m1 pro | Llama3 on Ollama with nomic-embed-text  | 29.871s   | 35.32s    |
+| Macbook 14' m1 pro | Llama3 on Ollama with nomic-embed-text  | 29.87s    | 35.32s    |
 | Macbook m2 max     | Llama3 on Ollama with nomic-embed-text  | 18.36s    | 78.32s    |
 
-
 **Note**: the examples on Docker are not runned on other devices than the Macbook because the performance are to slow (10 times slower than Ollama). Indeed the results are the following:
 
 | Hardware           | Example 1 | Example 2 |
 | ------------------ | --------- | --------- |
-| Macbook 14' m1 pro | 139.89s   | Too long  |
+| Macbook 14' m1 pro | 139.89    | Too long  |
 # Performance on APIs services
 ### Example 1: personal portfolio 
 **URL**: https://perinim.github.io/projects
 **Task**: List me all the projects with their description.
 
-| Name                | Execution time | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
-| ------------------- | ---------------| ------------ | ------------- | ----------------- | ------------------- | -------------- |
-| gpt-3.5-turbo       | 5.58s          | 445          | 272           | 173               | 1                   | 0.000754       |
-| gpt-4-turbo         | 9.76s          | 445          | 272           | 173               | 1                   | 0.00791        |
+| Name                        | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
+| --------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
+| gpt-3.5-turbo               | 25.22                    | 445          | 272           | 173               | 1                   | 0.000754       |
+| gpt-4-turbo-preview         | 9.53                     | 449          | 272           | 177               | 1                   | 0.00803        |
+| Grooq with nomic-embed-text | 1.99                     | 474          | 284           | 190               | 1                   | 0              |
 
 ### Example 2: Wired
 **URL**: https://www.wired.com
 **Task**: List me all the articles with their description.
 
-| Name                | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
-| ------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
-| gpt-3.5-turbo       | 6.50                     | 2442         | 2199          | 243               | 1                   | 0.003784       |
-| gpt-4-turbo         | 76.07                    | 3521         | 2199          | 1322              | 1                   | 0.06165        |
+| Name                        | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
+| --------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
+| gpt-3.5-turbo               | 25.89                    | 445          | 272           | 173               | 1                   | 0.000754       |
+| gpt-4-turbo-preview         | 64.70                    | 3573         | 2199          | 1374              | 1                   | 0.06321        |
+| Grooq with nomic-embed-text | 3.82                     | 2459         | 2192          | 267               | 1                   | 0              |
+
 
diff --git a/examples/benchmarks/SmartScraper/benchmark_groq.py b/examples/benchmarks/SmartScraper/benchmark_groq.py
@@ -0,0 +1,57 @@
+""" 
+Basic example of scraping pipeline using SmartScraper from text
+"""
+import os
+from dotenv import load_dotenv
+from scrapegraphai.graphs import SmartScraperGraph
+from scrapegraphai.utils import prettify_exec_info
+
+load_dotenv()
+
+files = ["inputs/example_1.txt", "inputs/example_2.txt"]
+tasks = ["List me all the projects with their description.",
+         "List me all the articles with their description."]
+
+
+# ************************************************
+# Define the configuration for the graph
+# ************************************************
+
+groq_key = os.getenv("GROQ_APIKEY")
+
+graph_config = {
+    "llm": {
+        "model": "groq/gemma-7b-it",
+        "api_key": groq_key,
+        "temperature": 0
+    },
+    "embeddings": {
+        "model": "ollama/nomic-embed-text",
+        "temperature": 0,
+        "base_url": "http://localhost:11434",  # set ollama URL arbitrarily
+    },
+    "headless": False
+}
+
+# ************************************************
+# Create the SmartScraperGraph instance and run it
+# ************************************************
+
+for i in range(0, 2):
+    with open(files[i], 'r', encoding="utf-8") as file:
+        text = file.read()
+
+    smart_scraper_graph = SmartScraperGraph(
+        prompt=tasks[i],
+        source=text,
+        config=graph_config
+    )
+
+    result = smart_scraper_graph.run()
+    print(result)
+    # ************************************************
+    # Get graph execution info
+    # ************************************************
+
+    graph_exec_info = smart_scraper_graph.get_execution_info()
+    print(prettify_exec_info(graph_exec_info))
diff --git a/examples/benchmarks/SmartScraper/benchmark_llama3.py b/examples/benchmarks/SmartScraper/benchmark_llama3.py
@@ -2,7 +2,6 @@
 Basic example of scraping pipeline using SmartScraper from text
 """
 
-import os
 from scrapegraphai.graphs import SmartScraperGraph
 from scrapegraphai.utils import prettify_exec_info