Skip to content

SearchScraper #30

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Feb 3, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 12 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,15 @@
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/api-banner.png" alt="ScrapeGraph API Banner" style="width: 70%;">
</p>

Official SDKs for the ScrapeGraph AI API - Intelligent web scraping powered by AI. Extract structured data from any webpage with natural language prompts.
Official SDKs for the ScrapeGraph AI API - Intelligent web scraping and search powered by AI. Extract structured data from any webpage or perform AI-powered web searches with natural language prompts.

Get your [API key](https://scrapegraphai.com)!

## πŸš€ Quick Links

- [Python SDK Documentation](scrapegraph-py/README.md)
- [JavaScript SDK Documentation](scrapegraph-js/README.md)
- [API Documentation](https://docs.scrapegraphai.com)
- [API Documentation](https://docs.scrapegraphai.com)
- [Website](https://scrapegraphai.com)

## πŸ“¦ Installation
Expand All @@ -34,31 +34,31 @@ npm install scrapegraph-js

## 🎯 Core Features

- πŸ€– **AI-Powered Extraction**: Use natural language to describe what data you want
- πŸ€– **AI-Powered Extraction & Search**: Use natural language to extract data or search the web
- πŸ“Š **Structured Output**: Get clean, structured data with optional schema validation
- πŸ”„ **Multiple Formats**: Extract data as JSON, Markdown, or custom schemas
- ⚑ **High Performance**: Concurrent processing and automatic retries
- πŸ”’ **Enterprise Ready**: Production-grade security and rate limiting

## πŸ› οΈ Available Endpoints

### πŸ” SmartScraper
Extract structured data from any webpage using natural language prompts.
### πŸ€– SmartScraper
Using AI to extract structured data from any webpage or HTML content with natural language prompts.

### πŸ” SearchScraper
Perform AI-powered web searches with structured results and reference URLs.

### πŸ“ Markdownify
Convert any webpage into clean, formatted markdown.

### πŸ’» LocalScraper
Extract information from a local HTML file using AI.


## 🌟 Key Benefits

- πŸ“ **Natural Language Queries**: No complex selectors or XPath needed
- 🎯 **Precise Extraction**: AI understands context and structure
- πŸ”„ **Adaptive Scraping**: Works with dynamic and static content
- πŸ”„ **Adaptive Processing**: Works with both web content and direct HTML
- πŸ“Š **Schema Validation**: Ensure data consistency with Pydantic/TypeScript
- ⚑ **Async Support**: Handle multiple requests efficiently
- πŸ” **Source Attribution**: Get reference URLs for search results

## πŸ’‘ Use Cases

Expand All @@ -67,13 +67,14 @@ Extract information from a local HTML file using AI.
- πŸ“° **Content Aggregation**: Convert articles to structured formats
- πŸ” **Data Mining**: Extract specific information from multiple sources
- πŸ“± **App Integration**: Feed clean data into your applications
- 🌐 **Web Research**: Perform AI-powered searches with structured results

## πŸ“– Documentation

For detailed documentation and examples, visit:
- [Python SDK Guide](scrapegraph-py/README.md)
- [JavaScript SDK Guide](scrapegraph-js/README.md)
- [API Documentation](https://docs.scrapegraphai.com)
- [API Documentation](https://docs.scrapegraphai.com)

## πŸ’¬ Support & Feedback

Expand Down
81 changes: 53 additions & 28 deletions scrapegraph-py/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
[![Python Support](https://img.shields.io/pypi/pyversions/scrapegraph-py.svg)](https://pypi.org/project/scrapegraph-py/)
[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Documentation Status](https://readthedocs.org/projects/scrapegraph-py/badge/?version=latest)](https://docs.scrapegraphai.com)
[![Documentation Status](https://readthedocs.org/projects/scrapegraph-py/badge/?version=latest)](https://docs.scrapegraphai.com)

<p align="left">
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/api-banner.png" alt="ScrapeGraph API Banner" style="width: 70%;">
Expand All @@ -20,7 +20,7 @@ pip install scrapegraph-py

## πŸš€ Features

- πŸ€– AI-powered web scraping
- πŸ€– AI-powered web scraping and search
- πŸ”„ Both sync and async clients
- πŸ“Š Structured output with Pydantic schemas
- πŸ” Detailed logging
Expand All @@ -40,21 +40,36 @@ client = Client(api_key="your-api-key-here")

## πŸ“š Available Endpoints

### πŸ” SmartScraper
### πŸ€– SmartScraper

Scrapes any webpage using AI to extract specific information.
Extract structured data from any webpage or HTML content using AI.

```python
from scrapegraph_py import Client

client = Client(api_key="your-api-key-here")

# Basic usage
# Using a URL
response = client.smartscraper(
website_url="https://example.com",
user_prompt="Extract the main heading and description"
)

# Or using HTML content
html_content = """
<html>
<body>
<h1>Company Name</h1>
<p>We are a technology company focused on AI solutions.</p>
</body>
</html>
"""

response = client.smartscraper(
website_html=html_content,
user_prompt="Extract the company description"
)

print(response)
```

Expand All @@ -80,46 +95,56 @@ response = client.smartscraper(

</details>

### πŸ“ Markdownify
### πŸ” SearchScraper

Converts any webpage into clean, formatted markdown.
Perform AI-powered web searches with structured results and reference URLs.

```python
from scrapegraph_py import Client

client = Client(api_key="your-api-key-here")

response = client.markdownify(
website_url="https://example.com"
response = client.searchscraper(
user_prompt="What is the latest version of Python and its main features?"
)

print(response)
print(f"Answer: {response['result']}")
print(f"Sources: {response['reference_urls']}")
```

### πŸ’» LocalScraper

Extracts information from HTML content using AI.
<details>
<summary>Output Schema (Optional)</summary>

```python
from pydantic import BaseModel, Field
from scrapegraph_py import Client

client = Client(api_key="your-api-key-here")

html_content = """
<html>
<body>
<h1>Company Name</h1>
<p>We are a technology company focused on AI solutions.</p>
<div class="contact">
<p>Email: [email protected]</p>
</div>
</body>
</html>
"""
class PythonVersionInfo(BaseModel):
version: str = Field(description="The latest Python version number")
release_date: str = Field(description="When this version was released")
major_features: list[str] = Field(description="List of main features")

response = client.searchscraper(
user_prompt="What is the latest version of Python and its main features?",
output_schema=PythonVersionInfo
)
```

</details>

response = client.localscraper(
user_prompt="Extract the company description",
website_html=html_content
### πŸ“ Markdownify

Converts any webpage into clean, formatted markdown.

```python
from scrapegraph_py import Client

client = Client(api_key="your-api-key-here")

response = client.markdownify(
website_url="https://example.com"
)

print(response)
Expand Down Expand Up @@ -177,7 +202,7 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
## πŸ”— Links

- [Website](https://scrapegraphai.com)
- [Documentation](https://docs.scrapegraphai.com)
- [Documentation](https://docs.scrapegraphai.com)
- [GitHub](https://github.com/ScrapeGraphAI/scrapegraph-sdk)

---
Expand Down
46 changes: 46 additions & 0 deletions scrapegraph-py/examples/async/async_searchscraper_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
"""
Example of using the async searchscraper functionality to search for information concurrently.
"""

import asyncio

from scrapegraph_py import AsyncClient
from scrapegraph_py.logger import sgai_logger

sgai_logger.set_logging(level="INFO")


async def main():
# Initialize async client
sgai_client = AsyncClient(api_key="your-api-key-here")

# List of search queries
queries = [
"What is the latest version of Python and what are its main features?",
"What are the key differences between Python 2 and Python 3?",
"What is Python's GIL and how does it work?",
]

# Create tasks for concurrent execution
tasks = [sgai_client.searchscraper(user_prompt=query) for query in queries]

# Execute requests concurrently
responses = await asyncio.gather(*tasks, return_exceptions=True)

# Process results
for i, response in enumerate(responses):
if isinstance(response, Exception):
print(f"\nError for query {i+1}: {response}")
else:
print(f"\nSearch {i+1}:")
print(f"Query: {queries[i]}")
print(f"Result: {response['result']}")
print("Reference URLs:")
for url in response["reference_urls"]:
print(f"- {url}")

await sgai_client.close()


if __name__ == "__main__":
asyncio.run(main())
119 changes: 119 additions & 0 deletions scrapegraph-py/examples/async/async_searchscraper_schema_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
"""
Example of using the async searchscraper functionality with output schemas for extraction.
"""

import asyncio
from typing import List

from pydantic import BaseModel

from scrapegraph_py import AsyncClient
from scrapegraph_py.logger import sgai_logger

sgai_logger.set_logging(level="INFO")


# Define schemas for extracting structured data
class PythonVersionInfo(BaseModel):
version: str
release_date: str
major_features: List[str]


class PythonComparison(BaseModel):
key_differences: List[str]
backward_compatible: bool
migration_difficulty: str


class GILInfo(BaseModel):
definition: str
purpose: str
limitations: List[str]
workarounds: List[str]


async def main():
# Initialize async client
sgai_client = AsyncClient(api_key="your-api-key-here")

# Define search queries with their corresponding schemas
searches = [
{
"prompt": "What is the latest version of Python? Include the release date and main features.",
"schema": PythonVersionInfo,
},
{
"prompt": "Compare Python 2 and Python 3, including backward compatibility and migration difficulty.",
"schema": PythonComparison,
},
{
"prompt": "Explain Python's GIL, its purpose, limitations, and possible workarounds.",
"schema": GILInfo,
},
]

# Create tasks for concurrent execution
tasks = [
sgai_client.searchscraper(
user_prompt=search["prompt"],
output_schema=search["schema"],
)
for search in searches
]

# Execute requests concurrently
responses = await asyncio.gather(*tasks, return_exceptions=True)

# Process results
for i, response in enumerate(responses):
if isinstance(response, Exception):
print(f"\nError for search {i+1}: {response}")
else:
print(f"\nSearch {i+1}:")
print(f"Query: {searches[i]['prompt']}")
# print(f"Raw Result: {response['result']}")

try:
# Try to extract structured data using the schema
result = searches[i]["schema"].model_validate(response["result"])

# Print extracted structured data
if isinstance(result, PythonVersionInfo):
print("\nExtracted Data:")
print(f"Python Version: {result.version}")
print(f"Release Date: {result.release_date}")
print("Major Features:")
for feature in result.major_features:
print(f"- {feature}")

elif isinstance(result, PythonComparison):
print("\nExtracted Data:")
print("Key Differences:")
for diff in result.key_differences:
print(f"- {diff}")
print(f"Backward Compatible: {result.backward_compatible}")
print(f"Migration Difficulty: {result.migration_difficulty}")

elif isinstance(result, GILInfo):
print("\nExtracted Data:")
print(f"Definition: {result.definition}")
print(f"Purpose: {result.purpose}")
print("Limitations:")
for limit in result.limitations:
print(f"- {limit}")
print("Workarounds:")
for workaround in result.workarounds:
print(f"- {workaround}")
except Exception as e:
print(f"\nCould not extract structured data: {e}")

print("\nReference URLs:")
for url in response["reference_urls"]:
print(f"- {url}")

await sgai_client.close()


if __name__ == "__main__":
asyncio.run(main())
Loading