Skip to content

Commit 066e77d

Browse files
authored
Merge branch 'main' into pre/beta
2 parents 407f1ce + 11ae717 commit 066e77d

File tree

7 files changed

+77
-16
lines changed

7 files changed

+77
-16
lines changed

CHANGELOG.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,8 @@
3333

3434
* implement ScrapeGraph class for only web scraping automation ([612c644](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/612c644623fa6f4fe77a64a5f1a6a4d6cd5f4254))
3535
* Implement SmartScraperMultiParseMergeFirstGraph class that scrapes a list of URLs and merge the content first and finally generates answers to a given prompt. ([3e3e1b2](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/3e3e1b2f3ae8ed803d03b3b44b199e139baa68d4))
36+
=======
37+
## [1.26.7](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.26.6...v1.26.7) (2024-10-19)
3638

3739

3840
### Bug Fixes
@@ -70,6 +72,8 @@
7072
* add conditional node structure to the smart_scraper_graph and implemented a structured way to check condition ([cacd9cd](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/cacd9cde004dace1a7dcc27981245632a78b95f3))
7173

7274

75+
* removed tokenizer ([a184716](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/a18471688f0b79f06fb7078b01b68eeddc88eae4))
76+
7377
## [1.26.6](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.26.5...v1.26.6) (2024-10-18)
7478

7579
## [1.26.6-beta.1](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.26.5...v1.26.6-beta.1) (2024-10-14)
@@ -79,7 +83,6 @@
7983
* remove variable "max_result" not being used in the code ([e76a68a](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/e76a68a782e5bce48d421cb620d0b7bffa412918))
8084

8185
* refactoring of gpt2 tokenizer ([44c3f9c](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/44c3f9c98939c44caa86dc582242819a7c6a0f80))
82-
>>>>>>> main
8386

8487
## [1.26.5](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.26.4...v1.26.5) (2024-10-13)
8588

docs/source/introduction/overview.rst

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,45 @@ This flexibility ensures that scrapers remain functional even when website layou
2222
We support many LLMs including **GPT, Gemini, Groq, Azure, Hugging Face** etc.
2323
as well as local models which can run on your machine using **Ollama**.
2424

25+
AI Models and Token Limits
26+
==========================
27+
28+
ScrapGraphAI supports a wide range of AI models from various providers. Each model has a specific token limit, which is important to consider when designing your scraping pipelines. Here's an overview of the supported models and their token limits:
29+
30+
OpenAI Models
31+
-------------
32+
- GPT-3.5 Turbo (16,385 tokens)
33+
- GPT-4 (8,192 tokens)
34+
- GPT-4 Turbo Preview (128,000 tokens)
35+
36+
Azure OpenAI Models
37+
-------------------
38+
- GPT-3.5 Turbo (16,385 tokens)
39+
- GPT-4 (8,192 tokens)
40+
- GPT-4 Turbo Preview (128,000 tokens)
41+
42+
Google AI Models
43+
----------------
44+
- Gemini Pro (128,000 tokens)
45+
- Gemini 1.5 Pro (128,000 tokens)
46+
47+
Anthropic Models
48+
----------------
49+
- Claude Instant (100,000 tokens)
50+
- Claude 2 (200,000 tokens)
51+
- Claude 3 (200,000 tokens)
52+
53+
Mistral AI Models
54+
-----------------
55+
- Mistral Large (128,000 tokens)
56+
- Open Mistral 7B (32,000 tokens)
57+
- Open Mixtral 8x7B (32,000 tokens)
58+
59+
For a complete list of supported models and their token limits, please refer to the API documentation.
60+
61+
Understanding token limits is crucial for optimizing your scraping tasks. Larger token limits allow for processing more text in a single API call, which can be beneficial for scraping lengthy web pages or documents.
62+
63+
2564
Library Diagram
2665
===============
2766

@@ -95,4 +134,4 @@ Sponsors
95134
.. image:: ../../assets/transparent_stat.png
96135
:width: 15%
97136
:alt: Stat Proxies
98-
:target: https://dashboard.statproxies.com/?refferal=scrapegraph
137+
:target: https://dashboard.statproxies.com/?refferal=scrapegraph

docs/source/modules/modules.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,6 @@ scrapegraphai
55
:maxdepth: 4
66

77
scrapegraphai
8+
9+
scrapegraphai.helpers.models_tokens
10+
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
scrapegraphai.helpers.models_tokens module
2+
==========================================
3+
4+
.. automodule:: scrapegraphai.helpers.models_tokens
5+
:members:
6+
:undoc-members:
7+
:show-inheritance:
8+
9+
This module contains a comprehensive dictionary of AI models and their corresponding token limits. The `models_tokens` dictionary is organized by provider (e.g., OpenAI, Azure OpenAI, Google AI, etc.) and includes various models with their maximum token counts.
10+
11+
Example usage:
12+
13+
.. code-block:: python
14+
15+
from scrapegraphai.helpers.models_tokens import models_tokens
16+
17+
# Get the token limit for GPT-4
18+
gpt4_limit = models_tokens['openai']['gpt-4']
19+
print(f"GPT-4 token limit: {gpt4_limit}")
20+
21+
# Check the token limit for a specific model
22+
model_name = "gpt-3.5-turbo"
23+
if model_name in models_tokens['openai']:
24+
print(f"{model_name} token limit: {models_tokens['openai'][model_name]}")
25+
else:
26+
print(f"{model_name} not found in the models list")
27+
28+
This information is crucial for users to understand the capabilities and limitations of different AI models when designing their scraping pipelines.

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ name = "scrapegraphai"
44
version = "1.27.0b7"
55

66

7+
78
description = "A web scraping library based on LangChain which uses LLM and direct graph logic to create scraping pipelines."
89
authors = [
910
{ name = "Marco Vinciguerra", email = "[email protected]" },

scrapegraphai/utils/tokenizer.py

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@
66
from langchain_ollama import ChatOllama
77
from langchain_mistralai import ChatMistralAI
88
from langchain_core.language_models.chat_models import BaseChatModel
9-
from transformers import GPT2TokenizerFast
109

1110
def num_tokens_calculus(string: str, llm_model: BaseChatModel) -> int:
1211
"""
@@ -24,13 +23,6 @@ def num_tokens_calculus(string: str, llm_model: BaseChatModel) -> int:
2423
from .tokenizers.tokenizer_ollama import num_tokens_ollama
2524
num_tokens_fn = num_tokens_ollama
2625

27-
elif isinstance(llm_model, GPT2TokenizerFast):
28-
def num_tokens_gpt2(text: str, model: BaseChatModel) -> int:
29-
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
30-
tokens = tokenizer.encode(text)
31-
return len(tokens)
32-
num_tokens_fn = num_tokens_gpt2
33-
3426
else:
3527
from .tokenizers.tokenizer_openai import num_tokens_openai
3628
num_tokens_fn = num_tokens_openai

scrapegraphai/utils/tokenizers/tokenizer_ollama.py

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,6 @@
33
"""
44
from langchain_core.language_models.chat_models import BaseChatModel
55
from ..logging import get_logger
6-
from transformers import GPT2TokenizerFast
76

87
def num_tokens_ollama(text: str, llm_model:BaseChatModel) -> int:
98
"""
@@ -22,12 +21,8 @@ def num_tokens_ollama(text: str, llm_model:BaseChatModel) -> int:
2221

2322
logger.debug(f"Counting tokens for text of {len(text)} characters")
2423

25-
if isinstance(llm_model, GPT2TokenizerFast):
26-
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
27-
tokens = tokenizer.encode(text)
28-
return len(tokens)
29-
3024
# Use langchain token count implementation
3125
# NB: https://github.com/ollama/ollama/issues/1716#issuecomment-2074265507
3226
tokens = llm_model.get_num_tokens(text)
3327
return tokens
28+

0 commit comments

Comments
 (0)