Skip to content

Commit 1219caa

Browse files
authored
Merge pull request #130 from VinciGit00/main
reallignment
2 parents 431b495 + 2f478f8 commit 1219caa

File tree

8 files changed

+149
-21
lines changed

8 files changed

+149
-21
lines changed

CHANGELOG.md

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,31 @@
1-
## [0.6.1-beta.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.6.0...v0.6.1-beta.1) (2024-05-02)
1+
## [0.6.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.6.1...v0.6.2) (2024-05-02)
22

33

44
### Bug Fixes
55

66
* add to requirements.txt langchain-aws = "^0.1.2" ([1afa319](https://github.com/VinciGit00/Scrapegraph-ai/commit/1afa31910d25b2735abe0ad09dad433d6c2159fb))
77

8+
9+
### Docs
10+
11+
* **tree:** added roadmap ([c8eeff8](https://github.com/VinciGit00/Scrapegraph-ai/commit/c8eeff873db6c8d23c9e4109ddee46edaa68b92b))
12+
* **roadmap:** open contributions ([4441505](https://github.com/VinciGit00/Scrapegraph-ai/commit/4441505b239fa819032469f148115bb3392b15ea))
13+
* typo ([faa3498](https://github.com/VinciGit00/Scrapegraph-ai/commit/faa3498fa7694ee3309eeed479d8f1bc4b1c7b97))
14+
15+
16+
### CI
17+
18+
* **release:** 0.6.1-beta.1 [skip ci] ([75a4042](https://github.com/VinciGit00/Scrapegraph-ai/commit/75a4042a232a5b69fd38d1666fea9633b4fd015e))
19+
20+
## [0.6.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.6.0...v0.6.1) (2024-05-02)
21+
22+
23+
24+
### Bug Fixes
25+
26+
* gemini errror ([2ea54ea](https://github.com/VinciGit00/Scrapegraph-ai/commit/2ea54eab1d070e177c7d5ecfcc032b325fbd7c12))
27+
28+
829
## [0.6.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.5.2...v0.6.0) (2024-05-02)
930

1031

README.md

Lines changed: 38 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -168,7 +168,38 @@ result = smart_scraper_graph.run()
168168
print(result)
169169
```
170170

171-
### Case 5: Extracting information using Gemini
171+
172+
### Case 5: Extracting information using Azure
173+
```python
174+
from langchain_openai import AzureChatOpenAI
175+
from langchain_openai import AzureOpenAIEmbeddings
176+
177+
lm_model_instance = AzureChatOpenAI(
178+
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
179+
azure_deployment=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"]
180+
)
181+
182+
embedder_model_instance = AzureOpenAIEmbeddings(
183+
azure_deployment=os.environ["AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME"],
184+
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
185+
)
186+
graph_config = {
187+
"llm": {"model_instance": llm_model_instance},
188+
"embeddings": {"model_instance": embedder_model_instance}
189+
}
190+
191+
smart_scraper_graph = SmartScraperGraph(
192+
prompt="""List me all the events, with the following fields: company_name, event_name, event_start_date, event_start_time,
193+
event_end_date, event_end_time, location, event_mode, event_category,
194+
third_party_redirect, no_of_days,
195+
time_in_hours, hosted_or_attending, refreshments_type,
196+
registration_available, registration_link""",
197+
source="https://www.hmhco.com/event",
198+
config=graph_config
199+
)
200+
```
201+
202+
### Case 6: Extracting information using Gemini
172203
```python
173204
from scrapegraphai.graphs import SmartScraperGraph
174205
GOOGLE_APIKEY = "YOUR_API_KEY"
@@ -215,6 +246,11 @@ Please see the [contributing guidelines](https://github.com/VinciGit00/Scrapegra
215246
[![My Skills](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/company/scrapegraphai/)
216247
[![My Skills](https://skillicons.dev/icons?i=twitter)](https://twitter.com/scrapegraphai)
217248

249+
## 📈 Roadmap
250+
Check out the project roadmap [here](docs/README.md)! 🚀
251+
252+
Wanna visualize the roadmap in a more interactive way? Check out the [markmap](https://markmap.js.org/repl) visualization by copy pasting the markdown content in the editor!
253+
218254
## ❤️ Contributors
219255
[![Contributors](https://contrib.rocks/image?repo=VinciGit00/Scrapegraph-ai)](https://github.com/VinciGit00/Scrapegraph-ai/graphs/contributors)
220256

@@ -249,4 +285,4 @@ ScrapeGraphAI is licensed under the MIT License. See the [LICENSE](https://githu
249285
## Acknowledgements
250286

251287
- We would like to thank all the contributors to the project and the open-source community for their support.
252-
- ScrapeGraphAI is meant to be used for data exploration and research purposes only. We are not responsible for any misuse of the library.
288+
- ScrapeGraphAI is meant to be used for data exploration and research purposes only. We are not responsible for any misuse of the library.

docs/README.md

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
---
2+
title: ScrapGraphAI Roadmap
3+
markmap:
4+
colorFreezeLevel: 2
5+
maxWidth: 500
6+
---
7+
8+
# **ScrapGraphAI Roadmap**
9+
10+
## **Short-Term Goals**
11+
12+
- Integration with more llm APIs
13+
14+
- Test proxy rotation implementation
15+
16+
- Add more search engines inside the SearchInternetNode
17+
18+
- Improve the documentation (ReadTheDocs)
19+
- [Issue #102](https://github.com/VinciGit00/Scrapegraph-ai/issues/102)
20+
21+
- Create tutorials for the library
22+
23+
## **Medium-Term Goals**
24+
25+
- Node for handling API requests
26+
27+
- Improve SearchGraph to look into the first 5 results of the search engine
28+
29+
- Make scraping more deterministic
30+
- Create DOM tree of the website
31+
- HTML tag text embeddings with tags metadata
32+
- Study tree forks from root node
33+
- How do we use the tags parameters?
34+
35+
- Create scraping folder with report
36+
- Folder contains .scrape files, DOM tree files, report
37+
- Report could be a HTML page with scraping speed, costs, LLM info, scraped content and DOM tree visualization
38+
- We can use pyecharts with R-markdown
39+
40+
- Scrape multiple pages of the same website
41+
- Create new node that instantiate multiple graphs at the same time
42+
- Make graphs run in parallel
43+
- Scrape only relevant URLs from user prompt
44+
- Use the multi dimensional DOM tree of the website for retrieval
45+
- [Issue #112](https://github.com/VinciGit00/Scrapegraph-ai/issues/112)
46+
47+
- Crawler graph
48+
- Scrape all the URLs with the same domain in all the pages
49+
- Build many DOM trees and link them together
50+
- Save the multi dimensional tree in a file
51+
52+
- Compare two DOM trees to assess the similarity
53+
- Save the DOM tree of the scraped website in a file as a sort of cache to be used to compare with future website structure
54+
- Create similarity metrics with multiple DOM trees (overall tree? only relevant tags structure?)
55+
56+
- Nodes for handling authentication
57+
- Use Selenium or Playwright to handle authentication
58+
- Passes the cookies to the other nodes
59+
60+
- Nodes that attaches to an open browser
61+
- Use Selenium or Playwright to attach to an open browser
62+
- Navigate inside the browser and scrape the content
63+
64+
- Nodes for taking screenshots and understanding the page layout
65+
- Use Selenium or Playwright to take screenshots
66+
- Use LLM to asses if it is a block-like page, paragraph-like page, etc.
67+
- [Issue #88](https://github.com/VinciGit00/Scrapegraph-ai/issues/88)
68+
69+
## **Long-Term Goals**
70+
71+
- Automatic generation of scraping pipelines from a given prompt
72+
73+
- Create API for the library
74+
75+
- Finetune a LLM for html content

examples/azure/smart_scraper_azure_openai.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
from scrapegraphai.utils import prettify_exec_info
1111

1212

13-
## required environment variable in .env
13+
# required environment variable in .env
1414
# AZURE_OPENAI_ENDPOINT
1515
# AZURE_OPENAI_CHAT_DEPLOYMENT_NAME
1616
# MODEL_NAME
@@ -45,8 +45,11 @@
4545
}
4646

4747
smart_scraper_graph = SmartScraperGraph(
48-
prompt="List me all the events, with the following fields: company_name, event_name, event_start_date, event_start_time, event_end_date, event_end_time, location, event_mode, event_category, third_party_redirect, no_of_days,
49-
time_in_hours, hosted_or_attending, refreshments_type, registration_available, registration_link",
48+
prompt="""List me all the events, with the following fields: company_name, event_name, event_start_date, event_start_time,
49+
event_end_date, event_end_time, location, event_mode, event_category,
50+
third_party_redirect, no_of_days,
51+
time_in_hours, hosted_or_attending, refreshments_type,
52+
registration_available, registration_link""",
5053
# also accepts a string with the already downloaded HTML code
5154
source="https://www.hmhco.com/event",
5255
config=graph_config

examples/gemini/csv_scraper_gemini.py

Lines changed: 3 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -19,20 +19,13 @@
1919
# ************************************************
2020
# Define the configuration for the graph
2121
# ************************************************
22+
gemini_key = os.getenv("GOOGLE_APIKEY")
2223

2324
graph_config = {
2425
"llm": {
25-
"model": "ollama/mistral",
26-
"temperature": 0,
27-
"format": "json", # Ollama needs the format to be specified explicitly
28-
# "model_tokens": 2000, # set context length arbitrarily
29-
"base_url": "http://localhost:11434",
26+
"api_key": gemini_key,
27+
"model": "gemini-pro",
3028
},
31-
"embeddings": {
32-
"model": "ollama/nomic-embed-text",
33-
"temperature": 0,
34-
"base_url": "http://localhost:11434",
35-
}
3629
}
3730

3831
# ************************************************

examples/gemini/smart_scraper_gemini.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
graph_config = {
1919
"llm": {
2020
"api_key": gemini_key,
21-
"model": "gpt-3.5-turbo",
21+
"model": "gemini-pro",
2222
},
2323
}
2424

pyproject.toml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
11
[tool.poetry]
22
name = "scrapegraphai"
33

4-
version = "0.6.1b1"
5-
4+
version = "0.6.2"
65

76
description = "A web scraping library based on LangChain which uses LLM and direct graph logic to create scraping pipelines."
87
authors = [

scrapegraphai/helpers/models_tokens.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,9 @@
2020
"gpt-4-32k-0613": 32768,
2121
},
2222
"azure": {
23-
"gpt-3.5-turbo": 4096
23+
"gpt-3.5-turbo": 4096,
24+
"gpt-4": 8192,
25+
"gpt-4-32k": 32768
2426
},
2527
"gemini": {
2628
"gemini-pro": 128000,
@@ -65,4 +67,3 @@
6567
"cohere.embed-multilingual-v3": 512
6668
}
6769
}
68-

0 commit comments

Comments
 (0)