Skip to content

Commit 008f8d9

Browse files
authored
Merge pull request #248 from VinciGit00/main
reallignment
2 parents 42d2ab1 + cffcf80 commit 008f8d9

37 files changed

+2389
-3420
lines changed

.github/workflows/release.yml

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,8 @@ jobs:
1414
run: |
1515
sudo apt update
1616
sudo apt install -y git
17-
- name: Install Python Env and Poetry
18-
uses: actions/setup-python@v5
19-
with:
20-
python-version: '3.9'
21-
- run: pip install poetry
17+
- name: Install the latest version of rye
18+
uses: eifinger/setup-rye@v3
2219
- name: Install Node Env
2320
uses: actions/setup-node@v4
2421
with:
@@ -30,8 +27,8 @@ jobs:
3027
persist-credentials: false
3128
- name: Build app
3229
run: |
33-
poetry install
34-
poetry build
30+
rye sync --no-lock
31+
rye build
3532
id: build_cache
3633
if: success()
3734
- name: Cache build

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,4 +31,6 @@ examples/graph_examples/ScrapeGraphAI_generated_graph
3131
examples/**/result.csv
3232
examples/**/result.json
3333
main.py
34+
*.python-version
35+
*.lock
3436

.python-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.9.19

CHANGELOG.md

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,116 @@
1+
## [1.1.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.0.1...v1.1.0) (2024-05-15)
2+
3+
4+
### Features
5+
6+
* add turboscraper (alfa) ([51aa109](https://github.com/VinciGit00/Scrapegraph-ai/commit/51aa109e420a71101664906f0849f39ea2a3f91a))
7+
* new search_graph ([67d5fbf](https://github.com/VinciGit00/Scrapegraph-ai/commit/67d5fbf816275940c89802e033b9e7796436c410))
8+
9+
10+
### Docs
11+
12+
* **rye:** replaced poetry with rye ([efb781f](https://github.com/VinciGit00/Scrapegraph-ai/commit/efb781f950b23f442706d54a578230aba9e9796a))
13+
14+
## [1.0.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.0.0...v1.0.1) (2024-05-15)
15+
16+
17+
### Bug Fixes
18+
19+
* **searchgraph:** used shallow copy to serialize obj ([096b665](https://github.com/VinciGit00/Scrapegraph-ai/commit/096b665c0152593c19402e555c0850cdd3b2a2c0))
20+
21+
## [1.0.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.11.1...v1.0.0) (2024-05-15)
22+
23+
24+
### ⚠ BREAKING CHANGES
25+
26+
* **package manager:** move from poetry to rye
27+
28+
### chore
29+
30+
* **package manager:** move from poetry to rye ([8fc2510](https://github.com/VinciGit00/Scrapegraph-ai/commit/8fc2510b3704990ff96f5f74abb5b800bca9af98)), closes [#198](https://github.com/VinciGit00/Scrapegraph-ai/issues/198)
31+
32+
33+
### Docs
34+
35+
* **main-readme:** fixed some typos ([78d1940](https://github.com/VinciGit00/Scrapegraph-ai/commit/78d19402351f18b3ed3a9d7e4200ad22ad0d064a))
36+
37+
## [0.11.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.11.0...v0.11.1) (2024-05-14)
38+
39+
40+
### Bug Fixes
41+
42+
* **docs:** requirements-dev ([b0a67ba](https://github.com/VinciGit00/Scrapegraph-ai/commit/b0a67ba387e7d3a3dca7b82fe3e5b39c6a34c3ba))
43+
44+
## [0.11.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.10.1...v0.11.0) (2024-05-14)
45+
46+
47+
### Features
48+
49+
* **parallel-exeuction:** add asyncio event loop dispatcher with semaphore for parallel graph instances ([627cbee](https://github.com/VinciGit00/Scrapegraph-ai/commit/627cbeeb2096eb4cd5da45015d37fceb7fe7840a))
50+
* **webdriver-backend:** add dynamic import scripts from module and file ([db2234b](https://github.com/VinciGit00/Scrapegraph-ai/commit/db2234bf5d2f2589b080cd4136f33c4f4443bdfb))
51+
* add gpt-4o ([52a4a3b](https://github.com/VinciGit00/Scrapegraph-ai/commit/52a4a3b22d6871b14801a5edbd28aa32a1a2580d)), closes [#232](https://github.com/VinciGit00/Scrapegraph-ai/issues/232)
52+
* add new prompt info ([e2350ed](https://github.com/VinciGit00/Scrapegraph-ai/commit/e2350eda6249d8e121344d12c92645a3887a5b76))
53+
* **proxy-rotation:** add parse (IP address) or search (from broker) functionality for proxy rotation ([2170131](https://github.com/VinciGit00/Scrapegraph-ai/commit/217013181da06abe8d71d9db70e809ea4ebd8236))
54+
* add support for deepseek-chat ([156b67b](https://github.com/VinciGit00/Scrapegraph-ai/commit/156b67b91e1798f67082123e2c0087d358a32d4d)), closes [#222](https://github.com/VinciGit00/Scrapegraph-ai/issues/222)
55+
* Add support for passing pdf path as source ([f10f3b1](https://github.com/VinciGit00/Scrapegraph-ai/commit/f10f3b1438e0c625b7f2fa52faeb5a6c12116113))
56+
* **omni-search:** added omni search graph and updated docs ([fcb3abb](https://github.com/VinciGit00/Scrapegraph-ai/commit/fcb3abb01d505f634309f9ae3c686bbcaab65107))
57+
* added proxy rotation ([0c36a7e](https://github.com/VinciGit00/Scrapegraph-ai/commit/0c36a7ec1f32ee073d9e0f534a2cb97aba3d7a1f))
58+
* **safe-web-driver:** enchanced the original `AsyncChromiumLoader` web driver with proxy protection and flexible kwargs and backend ([768719c](https://github.com/VinciGit00/Scrapegraph-ai/commit/768719cce80953fa6cbe283e442420116c438f16))
59+
* **gpt-4o:** image to text single node test ([90955ca](https://github.com/VinciGit00/Scrapegraph-ai/commit/90955ca52f1e3277072e843fb8d578deea27d09f))
60+
* revert fetch_node ([864aa91](https://github.com/VinciGit00/Scrapegraph-ai/commit/864aa91326c360992326e04811d272e55eac8355))
61+
* **batchsize:** tested different batch sizes and systems ([a8d5e7d](https://github.com/VinciGit00/Scrapegraph-ai/commit/a8d5e7db050e15306780ffca47f998ebaf5c1216))
62+
* update info ([4ed0fb8](https://github.com/VinciGit00/Scrapegraph-ai/commit/4ed0fb89c3e6068190a7775bedcb6ae65ba59d18))
63+
* **omni-scraper:** working OmniScraperGraph with images ([a296927](https://github.com/VinciGit00/Scrapegraph-ai/commit/a2969276245cbedb97741975ea707dab2695f71e))
64+
65+
66+
### Bug Fixes
67+
68+
* **pytest:** add dependency for mocking testing functions ([2f4fd45](https://github.com/VinciGit00/Scrapegraph-ai/commit/2f4fd45700ebf1db0c429b5a6249386d1a111615))
69+
* add json integration ([0ab31c3](https://github.com/VinciGit00/Scrapegraph-ai/commit/0ab31c3fdbd56652ed306e60109301f60e8042d3))
70+
* Augment the information getting fetched from a webpage ([f8ce3d5](https://github.com/VinciGit00/Scrapegraph-ai/commit/f8ce3d5916eab926275d59d4d48b0d89ec9cd43f))
71+
* bug for claude ([d0167de](https://github.com/VinciGit00/Scrapegraph-ai/commit/d0167dee71779a3c1e1e042e17a41134b93b3c78))
72+
* **fetch_node:** bug in handling local files ([a6e1813](https://github.com/VinciGit00/Scrapegraph-ai/commit/a6e1813ddd36cc8d7c915e6ea0525835d64d10a2))
73+
* **chromium-loader:** ensure it subclasses langchain's base loader ([b54d984](https://github.com/VinciGit00/Scrapegraph-ai/commit/b54d984c134c8cbc432fd111bb161d3d53cf4a85))
74+
* fixed bugs for csv and xml ([324e977](https://github.com/VinciGit00/Scrapegraph-ai/commit/324e977b853ecaa55bac4bf86e7cd927f7f43d0d))
75+
* limit python version to < 3.12 ([a37fbbc](https://github.com/VinciGit00/Scrapegraph-ai/commit/a37fbbcbcfc3ddd0cc66f586f279676b52c4abfe))
76+
* **proxy-rotation:** removed duplicated arg and passed the loader_kwarhs correctly to the node ([1e9a564](https://github.com/VinciGit00/Scrapegraph-ai/commit/1e9a56461632999c5dc09f5aa930c14c954025ad))
77+
* **fetch-node:** removed isSoup from default ([0c15947](https://github.com/VinciGit00/Scrapegraph-ai/commit/0c1594737f878ed5672f4c889fdf9b4e0d7ec49a))
78+
* **proxy-rotation:** removed max_shape duplicate ([5d6d996](https://github.com/VinciGit00/Scrapegraph-ai/commit/5d6d996e8f6132101d4c3af835d74f0674baffa1))
79+
* **asyncio:** replaced deepcopy with copy due to serialization problems ([dedc733](https://github.com/VinciGit00/Scrapegraph-ai/commit/dedc73304755c2d540a121d143173f60fb448bbb))
80+
81+
82+
### chore
83+
84+
* update models_tokens.py with new model configurations ([d9752b1](https://github.com/VinciGit00/Scrapegraph-ai/commit/d9752b1619c6f86fdc407c898c8c9b443a50cb07))
85+
86+
87+
### Docs
88+
89+
* add diagram showing general structure/flow of the library ([13ae918](https://github.com/VinciGit00/Scrapegraph-ai/commit/13ae9180ac5e7ef11dad1a210cf8790e797397dd))
90+
* **refactor:** added proxy-rotation usage and refactor readthedocs ([e256b75](https://github.com/VinciGit00/Scrapegraph-ai/commit/e256b758b2ada641f97b23b1cf6c6b0174563d8a))
91+
* **refactor:** changed example ([c7ec114](https://github.com/VinciGit00/Scrapegraph-ai/commit/c7ec114274da64f0b61cee80afe908a36ad26b78))
92+
* **concurrent:** refactor theme and added benchmarck searchgraph ([ced2bbc](https://github.com/VinciGit00/Scrapegraph-ai/commit/ced2bbcdc9672396e3c8afdc1f7f65c4194d29fd))
93+
* update overview diagram with more models ([b441b30](https://github.com/VinciGit00/Scrapegraph-ai/commit/b441b30a5c60dda105964f69bd4cef06825f5c74))
94+
95+
96+
### CI
97+
98+
* **release:** 0.10.0-beta.3 [skip ci] ([ad32298](https://github.com/VinciGit00/Scrapegraph-ai/commit/ad32298e70fc626fd62c897e153b806f79dba9b9))
99+
* **release:** 0.10.0-beta.4 [skip ci] ([548bff9](https://github.com/VinciGit00/Scrapegraph-ai/commit/548bff9d77c8b4d2aadee40e966a06cc9d7fd4ab))
100+
* **release:** 0.10.0-beta.5 [skip ci] ([28c9dce](https://github.com/VinciGit00/Scrapegraph-ai/commit/28c9dce7cbda49750172bafd7767fa48a0c33859))
101+
* **release:** 0.10.0-beta.6 [skip ci] ([460d292](https://github.com/VinciGit00/Scrapegraph-ai/commit/460d292af21fabad3fdd2b66110913ccee22ba91))
102+
* **release:** 0.11.0-beta.1 [skip ci] ([63c0dd9](https://github.com/VinciGit00/Scrapegraph-ai/commit/63c0dd93723c2ab55df0a66b555e7fbb4716ea77))
103+
* **release:** 0.11.0-beta.10 [skip ci] ([218b8ed](https://github.com/VinciGit00/Scrapegraph-ai/commit/218b8ede8a22400fd7ba5d1e302ac270f800e67d)), closes [#232](https://github.com/VinciGit00/Scrapegraph-ai/issues/232)
104+
* **release:** 0.11.0-beta.11 [skip ci] ([8727d03](https://github.com/VinciGit00/Scrapegraph-ai/commit/8727d033841b2a30405f12f19f11cd649ffaf4f1))
105+
* **release:** 0.11.0-beta.2 [skip ci] ([7ae50c0](https://github.com/VinciGit00/Scrapegraph-ai/commit/7ae50c035e87be9a3d7b5eef42232dae6e345914))
106+
* **release:** 0.11.0-beta.3 [skip ci] ([106fb12](https://github.com/VinciGit00/Scrapegraph-ai/commit/106fb125316aa3c6dce889963fa423d11bc2c491)), closes [#222](https://github.com/VinciGit00/Scrapegraph-ai/issues/222)
107+
* **release:** 0.11.0-beta.4 [skip ci] ([4ccddda](https://github.com/VinciGit00/Scrapegraph-ai/commit/4ccddda5ebe8d1b12136571733416ed9f819e4db))
108+
* **release:** 0.11.0-beta.5 [skip ci] ([353382b](https://github.com/VinciGit00/Scrapegraph-ai/commit/353382b4d33511259f28afd72ef08fe8f682b688))
109+
* **release:** 0.11.0-beta.6 [skip ci] ([2724d3d](https://github.com/VinciGit00/Scrapegraph-ai/commit/2724d3dd5f7a7dd308e6d441cd8e7a5e085c30c4))
110+
* **release:** 0.11.0-beta.7 [skip ci] ([f0f7373](https://github.com/VinciGit00/Scrapegraph-ai/commit/f0f73736f75fc28c7bdeb4016ebaca07a40c8c59))
111+
* **release:** 0.11.0-beta.8 [skip ci] ([fa4edb4](https://github.com/VinciGit00/Scrapegraph-ai/commit/fa4edb47033121b81cdcc1c910f0386cba5a2f2e))
112+
* **release:** 0.11.0-beta.9 [skip ci] ([d2877d8](https://github.com/VinciGit00/Scrapegraph-ai/commit/d2877d89e5949a01cc90c80028f58735f1fb522e))
113+
1114
## [0.11.0-beta.11](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.11.0-beta.10...v0.11.0-beta.11) (2024-05-14)
2115

3116

README.md

Lines changed: 5 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -16,11 +16,6 @@ Just say which information you want to extract and the library will do it for yo
1616
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/scrapegraphai_logo.png" alt="Scrapegraph-ai Logo" style="width: 50%;">
1717
</p>
1818

19-
[![My Skills](https://skillicons.dev/icons?i=discord)](https://discord.gg/gkxQDAjfeX)
20-
[![My Skills](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/company/scrapegraphai/)
21-
[![My Skills](https://skillicons.dev/icons?i=twitter)](https://twitter.com/scrapegraphai)
22-
23-
2419
## 🚀 Quick install
2520

2621
The reference page for Scrapegraph-ai is available on the official page of pypy: [pypi](https://pypi.org/project/scrapegraphai/).
@@ -44,13 +39,11 @@ Try it directly on the web using Google Colab:
4439

4540
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sEZBonBMGP44CtO6GQTwAlL0BGJXjtfd?usp=sharing)
4641

47-
Follow the procedure on the following link to setup your OpenAI API key: [link](https://scrapegraph-ai.readthedocs.io/en/latest/index.html).
48-
4942
## 📖 Documentation
5043

5144
The documentation for ScrapeGraphAI can be found [here](https://scrapegraph-ai.readthedocs.io/en/latest/).
5245

53-
Check out also the docusaurus [documentation](https://scrapegraph-doc.onrender.com/).
46+
Check out also the Docusaurus [here](https://scrapegraph-doc.onrender.com/).
5447

5548
## 💻 Usage
5649
There are three main scraping pipelines that can be used to extract information from a website (or local file):
@@ -179,6 +172,10 @@ Feel free to contribute and join our Discord server to discuss with us improveme
179172

180173
Please see the [contributing guidelines](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/CONTRIBUTING.md).
181174

175+
[![My Skills](https://skillicons.dev/icons?i=discord)](https://discord.gg/gkxQDAjfeX)
176+
[![My Skills](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/company/scrapegraphai/)
177+
[![My Skills](https://skillicons.dev/icons?i=twitter)](https://twitter.com/scrapegraphai)
178+
182179
## 📈 Roadmap
183180
Check out the project roadmap [here](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/README.md)! 🚀
184181

docs/source/getting_started/installation.rst

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,13 @@ The library is available on PyPI, so it can be installed using the following com
2525

2626
It is higly recommended to install the library in a virtual environment (conda, venv, etc.)
2727

28-
If your clone the repository, you can install the library using `poetry <https://python-poetry.org/docs/>`_:
28+
If you clone the repository, you can install the library using `rye <https://rye-up.com/>`_. Follow the installation instruction from the website and then run:
2929

3030
.. code-block:: bash
3131
32-
poetry install
32+
rye pin 3.10
33+
rye sync
34+
rye build
3335
3436
Additionally on Windows when using WSL
3537
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

examples/custom_graph_domtree.py

Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
"""
2+
Example of custom graph using existing nodes
3+
"""
4+
5+
import os
6+
from dotenv import load_dotenv
7+
from scrapegraphai.models import OpenAI
8+
from scrapegraphai.graphs import BaseGraph
9+
from scrapegraphai.nodes import FetchNode, GenerateAnswerNode
10+
load_dotenv()
11+
12+
# ************************************************
13+
# Define the configuration for the graph
14+
# ************************************************
15+
16+
openai_key = os.getenv("OPENAI_APIKEY")
17+
18+
graph_config = {
19+
"llm": {
20+
"api_key": openai_key,
21+
"model": "gpt-3.5-turbo",
22+
"temperature": 0,
23+
"streaming": True
24+
},
25+
}
26+
27+
# ************************************************
28+
# Define the graph nodes
29+
# ************************************************
30+
31+
llm_model = OpenAI(graph_config["llm"])
32+
33+
# define the nodes for the graph
34+
fetch_node = FetchNode(
35+
input="url | local_dir",
36+
output=["doc"],
37+
)
38+
generate_answer_node = GenerateAnswerNode(
39+
input="user_prompt & (relevant_chunks | parsed_doc | doc)",
40+
output=["answer"],
41+
node_config={"llm": llm_model},
42+
)
43+
44+
# ************************************************
45+
# Create the graph by defining the connections
46+
# ************************************************
47+
48+
graph = BaseGraph(
49+
nodes={
50+
fetch_node,
51+
generate_answer_node,
52+
},
53+
edges={
54+
(fetch_node, generate_answer_node)
55+
},
56+
entry_point=fetch_node
57+
)
58+
59+
# ************************************************
60+
# Execute the graph
61+
# ************************************************
62+
63+
subtree_text = '''
64+
div>div -> "This is a paragraph" \n
65+
div>ul>li>a>span -> "This is a list item 1" \n
66+
div>ul>li>a>span -> "This is a list item 2" \n
67+
div>ul>li>a>span -> "This is a list item 3"
68+
'''
69+
70+
subtree_simplified_html = '''
71+
<div>
72+
<div>This is a paragraph</div>
73+
<ul>
74+
<li>
75+
<span>This is a list item 1</span>
76+
</li>
77+
<li>
78+
<span>This is a list item 2</span>
79+
</li>
80+
<li>
81+
<span>This is a list item 3</span>
82+
</li>
83+
</ul>
84+
</div>
85+
'''
86+
87+
subtree_dict_simple = {
88+
"div": {
89+
"text": {
90+
"content": "This is a paragraph",
91+
"path_to_fork": "div>div",
92+
},
93+
"ul": {
94+
"path_to_fork": "div>ul",
95+
"texts": [
96+
{
97+
"content": "This is a list item 1",
98+
"path_to_fork": "ul>li>a>span",
99+
},
100+
{
101+
"content": "This is a list item 2",
102+
"path_to_fork": "ul>li>a>span",
103+
},
104+
{
105+
"content": "This is a list item 3",
106+
"path_to_fork": "ul>li>a>span",
107+
}
108+
]
109+
}
110+
}
111+
}
112+
113+
114+
subtree_dict_complex = {
115+
"div": {
116+
"text": {
117+
"content": "This is a paragraph",
118+
"path_to_fork": "div>div",
119+
"attributes": {
120+
"classes": ["paragraph"],
121+
"ids": ["paragraph"],
122+
"hrefs": ["https://www.example.com"]
123+
}
124+
},
125+
"ul": {
126+
"text1":{
127+
"content": "This is a list item 1",
128+
"path_to_fork": "ul>li>a>span",
129+
"attributes": {
130+
"classes": ["list-item", "item-1"],
131+
"ids": ["item-1"],
132+
"hrefs": ["https://www.example.com"]
133+
}
134+
},
135+
"text2":{
136+
"content": "This is a list item 2",
137+
"path_to_fork": "ul>li>a>span",
138+
"attributes": {
139+
"classes": ["list-item", "item-2"],
140+
"ids": ["item-2"],
141+
"hrefs": ["https://www.example.com"]
142+
}
143+
}
144+
}
145+
}
146+
}
147+
148+
from playwright.sync_api import sync_playwright, Playwright
149+
150+
def run(playwright: Playwright):
151+
chromium = playwright.chromium # or "firefox" or "webkit".
152+
browser = chromium.launch()
153+
page = browser.new_page()
154+
page.goto("https://www.wired.com/category/science/")
155+
#get accessibilty tree
156+
accessibility_tree = page.accessibility.snapshot()
157+
158+
result, execution_info = graph.execute({
159+
"user_prompt": "List me all the latest news with their description.",
160+
"local_dir": str(accessibility_tree)
161+
})
162+
163+
# get the answer from the result
164+
result = result.get("answer", "No answer found.")
165+
print(result)
166+
# other actions...
167+
browser.close()
168+
169+
with sync_playwright() as playwright:
170+
run(playwright)
171+

0 commit comments

Comments
 (0)