Skip to content

reallignment #130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 22 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,31 @@
## [0.6.1-beta.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.6.0...v0.6.1-beta.1) (2024-05-02)
## [0.6.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.6.1...v0.6.2) (2024-05-02)


### Bug Fixes

* add to requirements.txt langchain-aws = "^0.1.2" ([1afa319](https://github.com/VinciGit00/Scrapegraph-ai/commit/1afa31910d25b2735abe0ad09dad433d6c2159fb))


### Docs

* **tree:** added roadmap ([c8eeff8](https://github.com/VinciGit00/Scrapegraph-ai/commit/c8eeff873db6c8d23c9e4109ddee46edaa68b92b))
* **roadmap:** open contributions ([4441505](https://github.com/VinciGit00/Scrapegraph-ai/commit/4441505b239fa819032469f148115bb3392b15ea))
* typo ([faa3498](https://github.com/VinciGit00/Scrapegraph-ai/commit/faa3498fa7694ee3309eeed479d8f1bc4b1c7b97))


### CI

* **release:** 0.6.1-beta.1 [skip ci] ([75a4042](https://github.com/VinciGit00/Scrapegraph-ai/commit/75a4042a232a5b69fd38d1666fea9633b4fd015e))

## [0.6.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.6.0...v0.6.1) (2024-05-02)



### Bug Fixes

* gemini errror ([2ea54ea](https://github.com/VinciGit00/Scrapegraph-ai/commit/2ea54eab1d070e177c7d5ecfcc032b325fbd7c12))


## [0.6.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.5.2...v0.6.0) (2024-05-02)


Expand Down
40 changes: 38 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,38 @@ result = smart_scraper_graph.run()
print(result)
```

### Case 5: Extracting information using Gemini

### Case 5: Extracting information using Azure
```python
from langchain_openai import AzureChatOpenAI
from langchain_openai import AzureOpenAIEmbeddings

lm_model_instance = AzureChatOpenAI(
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
azure_deployment=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"]
)

embedder_model_instance = AzureOpenAIEmbeddings(
azure_deployment=os.environ["AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME"],
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
)
graph_config = {
"llm": {"model_instance": llm_model_instance},
"embeddings": {"model_instance": embedder_model_instance}
}

smart_scraper_graph = SmartScraperGraph(
prompt="""List me all the events, with the following fields: company_name, event_name, event_start_date, event_start_time,
event_end_date, event_end_time, location, event_mode, event_category,
third_party_redirect, no_of_days,
time_in_hours, hosted_or_attending, refreshments_type,
registration_available, registration_link""",
source="https://www.hmhco.com/event",
config=graph_config
)
```

### Case 6: Extracting information using Gemini
```python
from scrapegraphai.graphs import SmartScraperGraph
GOOGLE_APIKEY = "YOUR_API_KEY"
Expand Down Expand Up @@ -215,6 +246,11 @@ Please see the [contributing guidelines](https://github.com/VinciGit00/Scrapegra
[![My Skills](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/company/scrapegraphai/)
[![My Skills](https://skillicons.dev/icons?i=twitter)](https://twitter.com/scrapegraphai)

## 📈 Roadmap
Check out the project roadmap [here](docs/README.md)! 🚀

Wanna visualize the roadmap in a more interactive way? Check out the [markmap](https://markmap.js.org/repl) visualization by copy pasting the markdown content in the editor!

## ❤️ Contributors
[![Contributors](https://contrib.rocks/image?repo=VinciGit00/Scrapegraph-ai)](https://github.com/VinciGit00/Scrapegraph-ai/graphs/contributors)

Expand Down Expand Up @@ -249,4 +285,4 @@ ScrapeGraphAI is licensed under the MIT License. See the [LICENSE](https://githu
## Acknowledgements

- We would like to thank all the contributors to the project and the open-source community for their support.
- ScrapeGraphAI is meant to be used for data exploration and research purposes only. We are not responsible for any misuse of the library.
- ScrapeGraphAI is meant to be used for data exploration and research purposes only. We are not responsible for any misuse of the library.
75 changes: 75 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
---
title: ScrapGraphAI Roadmap
markmap:
colorFreezeLevel: 2
maxWidth: 500
---

# **ScrapGraphAI Roadmap**

## **Short-Term Goals**

- Integration with more llm APIs

- Test proxy rotation implementation

- Add more search engines inside the SearchInternetNode

- Improve the documentation (ReadTheDocs)
- [Issue #102](https://github.com/VinciGit00/Scrapegraph-ai/issues/102)

- Create tutorials for the library

## **Medium-Term Goals**

- Node for handling API requests

- Improve SearchGraph to look into the first 5 results of the search engine

- Make scraping more deterministic
- Create DOM tree of the website
- HTML tag text embeddings with tags metadata
- Study tree forks from root node
- How do we use the tags parameters?

- Create scraping folder with report
- Folder contains .scrape files, DOM tree files, report
- Report could be a HTML page with scraping speed, costs, LLM info, scraped content and DOM tree visualization
- We can use pyecharts with R-markdown

- Scrape multiple pages of the same website
- Create new node that instantiate multiple graphs at the same time
- Make graphs run in parallel
- Scrape only relevant URLs from user prompt
- Use the multi dimensional DOM tree of the website for retrieval
- [Issue #112](https://github.com/VinciGit00/Scrapegraph-ai/issues/112)

- Crawler graph
- Scrape all the URLs with the same domain in all the pages
- Build many DOM trees and link them together
- Save the multi dimensional tree in a file

- Compare two DOM trees to assess the similarity
- Save the DOM tree of the scraped website in a file as a sort of cache to be used to compare with future website structure
- Create similarity metrics with multiple DOM trees (overall tree? only relevant tags structure?)

- Nodes for handling authentication
- Use Selenium or Playwright to handle authentication
- Passes the cookies to the other nodes

- Nodes that attaches to an open browser
- Use Selenium or Playwright to attach to an open browser
- Navigate inside the browser and scrape the content

- Nodes for taking screenshots and understanding the page layout
- Use Selenium or Playwright to take screenshots
- Use LLM to asses if it is a block-like page, paragraph-like page, etc.
- [Issue #88](https://github.com/VinciGit00/Scrapegraph-ai/issues/88)

## **Long-Term Goals**

- Automatic generation of scraping pipelines from a given prompt

- Create API for the library

- Finetune a LLM for html content
9 changes: 6 additions & 3 deletions examples/azure/smart_scraper_azure_openai.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from scrapegraphai.utils import prettify_exec_info


## required environment variable in .env
# required environment variable in .env
# AZURE_OPENAI_ENDPOINT
# AZURE_OPENAI_CHAT_DEPLOYMENT_NAME
# MODEL_NAME
Expand Down Expand Up @@ -45,8 +45,11 @@
}

smart_scraper_graph = SmartScraperGraph(
prompt="List me all the events, with the following fields: company_name, event_name, event_start_date, event_start_time, event_end_date, event_end_time, location, event_mode, event_category, third_party_redirect, no_of_days,
time_in_hours, hosted_or_attending, refreshments_type, registration_available, registration_link",
prompt="""List me all the events, with the following fields: company_name, event_name, event_start_date, event_start_time,
event_end_date, event_end_time, location, event_mode, event_category,
third_party_redirect, no_of_days,
time_in_hours, hosted_or_attending, refreshments_type,
registration_available, registration_link""",
# also accepts a string with the already downloaded HTML code
source="https://www.hmhco.com/event",
config=graph_config
Expand Down
13 changes: 3 additions & 10 deletions examples/gemini/csv_scraper_gemini.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,20 +19,13 @@
# ************************************************
# Define the configuration for the graph
# ************************************************
gemini_key = os.getenv("GOOGLE_APIKEY")

graph_config = {
"llm": {
"model": "ollama/mistral",
"temperature": 0,
"format": "json", # Ollama needs the format to be specified explicitly
# "model_tokens": 2000, # set context length arbitrarily
"base_url": "http://localhost:11434",
"api_key": gemini_key,
"model": "gemini-pro",
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"temperature": 0,
"base_url": "http://localhost:11434",
}
}

# ************************************************
Expand Down
2 changes: 1 addition & 1 deletion examples/gemini/smart_scraper_gemini.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
graph_config = {
"llm": {
"api_key": gemini_key,
"model": "gpt-3.5-turbo",
"model": "gemini-pro",
},
}

Expand Down
3 changes: 1 addition & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
[tool.poetry]
name = "scrapegraphai"

version = "0.6.1b1"

version = "0.6.2"

description = "A web scraping library based on LangChain which uses LLM and direct graph logic to create scraping pipelines."
authors = [
Expand Down
5 changes: 3 additions & 2 deletions scrapegraphai/helpers/models_tokens.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,9 @@
"gpt-4-32k-0613": 32768,
},
"azure": {
"gpt-3.5-turbo": 4096
"gpt-3.5-turbo": 4096,
"gpt-4": 8192,
"gpt-4-32k": 32768
},
"gemini": {
"gemini-pro": 128000,
Expand Down Expand Up @@ -65,4 +67,3 @@
"cohere.embed-multilingual-v3": 512
}
}