Skip to content

v0.5.0 #117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Apr 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
45b2317
add json examples
VinciGit00 Apr 29, 2024
bec79ba
Merge pull request #104 from VinciGit00/main
VinciGit00 Apr 29, 2024
0b25d9a
Update search_link_node.py
VinciGit00 Apr 29, 2024
674e642
add first new graphs
VinciGit00 Apr 29, 2024
3eacc6f
add paths
VinciGit00 Apr 29, 2024
deb920a
fixing json and example
VinciGit00 Apr 29, 2024
f891732
add xml_example
VinciGit00 Apr 29, 2024
0999708
Merge pull request #106 from VinciGit00/refactor_search_node
PeriniM Apr 29, 2024
7dd5b1a
feat: base groq + requirements + toml update with groq
lurenss Apr 29, 2024
ecaef43
Merge pull request #107 from VinciGit00/main
VinciGit00 Apr 29, 2024
a449ed1
Merge branch 'pre/beta' into 93-groq-model-implementation
VinciGit00 Apr 29, 2024
7a48204
Merge pull request #108 from VinciGit00/93-groq-model-implementation
VinciGit00 Apr 29, 2024
6e0c001
updated lock file
PeriniM Apr 30, 2024
dbbf10f
feat(llm): implemented groq model
PeriniM Apr 30, 2024
d368725
feat: updated requirements.txt
PeriniM Apr 30, 2024
719a353
feat: add co-author
PeriniM Apr 30, 2024
ae2971c
Merge pull request #111 from VinciGit00/groq-implementation
PeriniM Apr 30, 2024
450291f
ci(release): 0.5.0-beta.1 [skip ci]
semantic-release-bot Apr 30, 2024
42ab0aa
feat(fetch): added playwright support
PeriniM Apr 30, 2024
e494455
Merge pull request #113 from VinciGit00/playwright
VinciGit00 Apr 30, 2024
ff7d12f
ci(release): 0.5.0-beta.2 [skip ci]
semantic-release-bot Apr 30, 2024
e0ffc83
feat: add cluade integration
VinciGit00 Apr 30, 2024
b79ef22
Merge pull request #114 from VinciGit00/integration_claude
VinciGit00 Apr 30, 2024
7e81f7c
ci(release): 0.5.0-beta.3 [skip ci]
semantic-release-bot Apr 30, 2024
da2c82a
add json and xml scraper
VinciGit00 Apr 30, 2024
e3d0194
fix: script generator and add new benchmarks
VinciGit00 Apr 30, 2024
14e56f6
ci(release): 0.5.0-beta.4 [skip ci]
semantic-release-bot Apr 30, 2024
59594cb
add grow example
VinciGit00 Apr 30, 2024
da95c18
Merge branch 'pre/beta' of https://github.com/VinciGit00/Scrapegraph-…
VinciGit00 Apr 30, 2024
8fba7e5
feat(refactor): changed variable names
PeriniM Apr 30, 2024
d592d27
Merge pull request #115 from VinciGit00/101-scrape-json-files
PeriniM Apr 30, 2024
5ac97e2
ci(release): 0.5.0-beta.5 [skip ci]
semantic-release-bot Apr 30, 2024
2dd7817
feat: added verbose flag to suppress print statements
PeriniM Apr 30, 2024
84e6fac
Merge pull request #116 from VinciGit00/feat/verbose_flag
VinciGit00 Apr 30, 2024
9356124
ci(release): 0.5.0-beta.6 [skip ci]
semantic-release-bot Apr 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,6 @@ venv/
*.google-cookie
examples/graph_examples/ScrapeGraphAI_generated_graph
examples/**/*.csv
examples/**/*.json
main.py
poetry.lock

Expand Down
50 changes: 50 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,53 @@
## [0.5.0-beta.6](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.5.0-beta.5...v0.5.0-beta.6) (2024-04-30)


### Features

* added verbose flag to suppress print statements ([2dd7817](https://github.com/VinciGit00/Scrapegraph-ai/commit/2dd7817cfb37cfbeb7e65b3a24655ab238f48026))

## [0.5.0-beta.5](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.5.0-beta.4...v0.5.0-beta.5) (2024-04-30)


### Features

* **refactor:** changed variable names ([8fba7e5](https://github.com/VinciGit00/Scrapegraph-ai/commit/8fba7e5490f916b325588443bba3fff5c0733c17))

## [0.5.0-beta.4](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.5.0-beta.3...v0.5.0-beta.4) (2024-04-30)


### Bug Fixes

* script generator and add new benchmarks ([e3d0194](https://github.com/VinciGit00/Scrapegraph-ai/commit/e3d0194dc93b20dc254fc48bba11559bf8a3a185))

## [0.5.0-beta.3](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.5.0-beta.2...v0.5.0-beta.3) (2024-04-30)


### Features

* add cluade integration ([e0ffc83](https://github.com/VinciGit00/Scrapegraph-ai/commit/e0ffc838b06c0f024026a275fc7f7b4243ad5cf9))

## [0.5.0-beta.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.5.0-beta.1...v0.5.0-beta.2) (2024-04-30)


### Features

* **fetch:** added playwright support ([42ab0aa](https://github.com/VinciGit00/Scrapegraph-ai/commit/42ab0aa1d275b5798ab6fc9feea575fe59b6e767))

## [0.5.0-beta.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.4.1...v0.5.0-beta.1) (2024-04-30)


### Features

* add co-author ([719a353](https://github.com/VinciGit00/Scrapegraph-ai/commit/719a353410992cc96f46ec984a5d3ec372e71ad2))
* base groq + requirements + toml update with groq ([7dd5b1a](https://github.com/VinciGit00/Scrapegraph-ai/commit/7dd5b1a03327750ffa5b2fb647eda6359edd1fc2))
* **llm:** implemented groq model ([dbbf10f](https://github.com/VinciGit00/Scrapegraph-ai/commit/dbbf10fc77b34d99d64c6cd7f74524b6d8e57fa5))
* updated requirements.txt ([d368725](https://github.com/VinciGit00/Scrapegraph-ai/commit/d36872518a6d234eba5f8b7ddca7da93797874b2))


### CI

* **release:** 0.4.0-beta.3 [skip ci] ([d13321b](https://github.com/VinciGit00/Scrapegraph-ai/commit/d13321b2f86d98e2a3a0c563172ca0dd29cdf5fb))

## [0.4.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.4.0...v0.4.1) (2024-04-28)


Expand Down
38 changes: 37 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,10 @@ The reference page for Scrapegraph-ai is available on the official page of pypy:
```bash
pip install scrapegraphai
```
you will also need to install Playwright for javascript-based scraping:
```bash
playwright install
```
## 🔍 Demo
Official streamlit demo:

Expand All @@ -46,6 +50,7 @@ You can use the `SmartScraper` class to extract information from a website using
The `SmartScraper` class is a direct graph implementation that uses the most common nodes present in a web scraping pipeline. For more information, please see the [documentation](https://scrapegraph-ai.readthedocs.io/en/latest/).
### Case 1: Extracting information using Ollama
Remember to download the model on Ollama separately!

```python
from scrapegraphai.graphs import SmartScraperGraph

Expand Down Expand Up @@ -129,7 +134,38 @@ result = smart_scraper_graph.run()
print(result)
```

### Case 4: Extracting information using Gemini
### Case 4: Extracting information using Groq
```python
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

groq_key = os.getenv("GROQ_APIKEY")

graph_config = {
"llm": {
"model": "groq/gemma-7b-it",
"api_key": groq_key,
"temperature": 0
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"temperature": 0,
"base_url": "http://localhost:11434",
},
"headless": False
}

smart_scraper_graph = SmartScraperGraph(
prompt="List me all the projects with their description and the author.",
source="https://perinim.github.io/projects",
config=graph_config
)

result = smart_scraper_graph.run()
print(result)
```

### Case 5: Extracting information using Gemini
```python
from scrapegraphai.graphs import SmartScraperGraph
GOOGLE_APIKEY = "YOUR_API_KEY"
Expand Down
28 changes: 15 additions & 13 deletions examples/benchmarks/GenerateScraper/Readme.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Local models
# Local models
The two websites benchmark are:
- Example 1: https://perinim.github.io/projects
- Example 2: https://www.wired.com (at 17/4/2024)
Expand All @@ -9,14 +10,12 @@ The time is measured in seconds

The model runned for this benchmark is Mistral on Ollama with nomic-embed-text

In particular, is tested with ScriptCreatorGraph

| Hardware | Model | Example 1 | Example 2 |
| ---------------------- | --------------------------------------- | --------- | --------- |
| Macbook 14' m1 pro | Mistral on Ollama with nomic-embed-text | 30.54s | 35.76s |
| Macbook m2 max | Mistral on Ollama with nomic-embed-text | 18,46s | 19.59 |
| Macbook 14' m1 pro<br> | Llama3 on Ollama with nomic-embed-text | 27.82s | 29.98s |
| Macbook m2 max<br> | Llama3 on Ollama with nomic-embed-text | 20.83s | 12.29s |
| Macbook m2 max | Mistral on Ollama with nomic-embed-text | | |
| Macbook 14' m1 pro<br> | Llama3 on Ollama with nomic-embed-text | 27.82s | 29.986s |
| Macbook m2 max<br> | Llama3 on Ollama with nomic-embed-text | | |


**Note**: the examples on Docker are not runned on other devices than the Macbook because the performance are to slow (10 times slower than Ollama).
Expand All @@ -25,17 +24,20 @@ In particular, is tested with ScriptCreatorGraph
**URL**: https://perinim.github.io/projects
**Task**: List me all the projects with their description.

| Name | Execution time | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
| ------------------- | ---------------| ------------ | ------------- | ----------------- | ------------------- | -------------- |
| gpt-3.5-turbo | 4.50s | 1897 | 1802 | 95 | 1 | 0.002893 |
| gpt-4-turbo | 7.88s | 1920 | 1802 | 118 | 1 | 0.02156 |
| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
| --------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
| gpt-3.5-turbo | 24.21 | 1892 | 1802 | 90 | 1 | 0.002883 |
| gpt-4-turbo-preview | 6.614 | 1936 | 1802 | 134 | 1 | 0.02204 |
| Grooq with nomic-embed-text | 6.71 | 2201 | 2024 | 177 | 1 | 0 |

### Example 2: Wired
**URL**: https://www.wired.com
**Task**: List me all the articles with their description.

| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
| ------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
| gpt-3.5-turbo | Error (text too long) | - | - | - | - | - |
| gpt-4-turbo | Error (TPM limit reach)| - | - | - | - | - |
| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
| --------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
| gpt-3.5-turbo | | | | | | |
| gpt-4-turbo-preview | | | | | | |
| Grooq with nomic-embed-text | | | | | | |


61 changes: 61 additions & 0 deletions examples/benchmarks/GenerateScraper/benchmark_groq.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
"""
Basic example of scraping pipeline using SmartScraper from text
"""
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import ScriptCreatorGraph
from scrapegraphai.utils import prettify_exec_info

load_dotenv()

# ************************************************
# Read the text file
# ************************************************
files = ["inputs/example_1.txt", "inputs/example_2.txt"]
tasks = ["List me all the projects with their description.",
"List me all the articles with their description."]

# ************************************************
# Define the configuration for the graph
# ************************************************

groq_key = os.getenv("GROQ_APIKEY")

graph_config = {
"llm": {
"model": "groq/gemma-7b-it",
"api_key": groq_key,
"temperature": 0
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"temperature": 0,
"base_url": "http://localhost:11434", # set ollama URL arbitrarily
},
"headless": False,
"library": "beautifoulsoup"
}


# ************************************************
# Create the SmartScraperGraph instance and run it
# ************************************************

for i in range(0, 2):
with open(files[i], 'r', encoding="utf-8") as file:
text = file.read()

smart_scraper_graph = ScriptCreatorGraph(
prompt=tasks[i],
source=text,
config=graph_config
)

result = smart_scraper_graph.run()
print(result)
# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))
5 changes: 0 additions & 5 deletions examples/benchmarks/GenerateScraper/benchmark_llama3.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,8 @@
Basic example of scraping pipeline using SmartScraper from text
"""

import os
from dotenv import load_dotenv
from scrapegraphai.graphs import ScriptCreatorGraph
from scrapegraphai.utils import prettify_exec_info
load_dotenv()

# ************************************************
# Read the text file
Expand All @@ -19,8 +16,6 @@
# Define the configuration for the graph
# ************************************************

openai_key = os.getenv("GPT4_KEY")


graph_config = {
"llm": {
Expand Down
28 changes: 14 additions & 14 deletions examples/benchmarks/SmartScraper/Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,37 +5,37 @@ The two websites benchmark are:

Both are strored locally as txt file in .txt format because in this way we do not have to think about the internet connection

In particular, is tested with SmartScraper

| Hardware | Moodel | Example 1 | Example 2 |
| Hardware | Model | Example 1 | Example 2 |
| ------------------ | --------------------------------------- | --------- | --------- |
| Macbook 14' m1 pro | Mistral on Ollama with nomic-embed-text | 11.60s | 26.61s |
| Macbook m2 max | Mistral on Ollama with nomic-embed-text | 8.05s | 12.17s |
| Macbook 14' m1 pro | Llama3 on Ollama with nomic-embed-text | 29.871s | 35.32s |
| Macbook 14' m1 pro | Llama3 on Ollama with nomic-embed-text | 29.87s | 35.32s |
| Macbook m2 max | Llama3 on Ollama with nomic-embed-text | 18.36s | 78.32s |


**Note**: the examples on Docker are not runned on other devices than the Macbook because the performance are to slow (10 times slower than Ollama). Indeed the results are the following:

| Hardware | Example 1 | Example 2 |
| ------------------ | --------- | --------- |
| Macbook 14' m1 pro | 139.89s | Too long |
| Macbook 14' m1 pro | 139.89 | Too long |
# Performance on APIs services
### Example 1: personal portfolio
**URL**: https://perinim.github.io/projects
**Task**: List me all the projects with their description.

| Name | Execution time | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
| ------------------- | ---------------| ------------ | ------------- | ----------------- | ------------------- | -------------- |
| gpt-3.5-turbo | 5.58s | 445 | 272 | 173 | 1 | 0.000754 |
| gpt-4-turbo | 9.76s | 445 | 272 | 173 | 1 | 0.00791 |
| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
| --------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
| gpt-3.5-turbo | 25.22 | 445 | 272 | 173 | 1 | 0.000754 |
| gpt-4-turbo-preview | 9.53 | 449 | 272 | 177 | 1 | 0.00803 |
| Grooq with nomic-embed-text | 1.99 | 474 | 284 | 190 | 1 | 0 |

### Example 2: Wired
**URL**: https://www.wired.com
**Task**: List me all the articles with their description.

| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
| ------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
| gpt-3.5-turbo | 6.50 | 2442 | 2199 | 243 | 1 | 0.003784 |
| gpt-4-turbo | 76.07 | 3521 | 2199 | 1322 | 1 | 0.06165 |
| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
| --------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
| gpt-3.5-turbo | 25.89 | 445 | 272 | 173 | 1 | 0.000754 |
| gpt-4-turbo-preview | 64.70 | 3573 | 2199 | 1374 | 1 | 0.06321 |
| Grooq with nomic-embed-text | 3.82 | 2459 | 2192 | 267 | 1 | 0 |


57 changes: 57 additions & 0 deletions examples/benchmarks/SmartScraper/benchmark_groq.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
"""
Basic example of scraping pipeline using SmartScraper from text
"""
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

load_dotenv()

files = ["inputs/example_1.txt", "inputs/example_2.txt"]
tasks = ["List me all the projects with their description.",
"List me all the articles with their description."]


# ************************************************
# Define the configuration for the graph
# ************************************************

groq_key = os.getenv("GROQ_APIKEY")

graph_config = {
"llm": {
"model": "groq/gemma-7b-it",
"api_key": groq_key,
"temperature": 0
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"temperature": 0,
"base_url": "http://localhost:11434", # set ollama URL arbitrarily
},
"headless": False
}

# ************************************************
# Create the SmartScraperGraph instance and run it
# ************************************************

for i in range(0, 2):
with open(files[i], 'r', encoding="utf-8") as file:
text = file.read()

smart_scraper_graph = SmartScraperGraph(
prompt=tasks[i],
source=text,
config=graph_config
)

result = smart_scraper_graph.run()
print(result)
# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))
1 change: 0 additions & 1 deletion examples/benchmarks/SmartScraper/benchmark_llama3.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
Basic example of scraping pipeline using SmartScraper from text
"""

import os
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

Expand Down
Loading