-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
687 smartscrapermulticoncatgraph error with bedrock #689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
687 smartscrapermulticoncatgraph error with bedrock #689
Conversation
…ttps://github.com/ScrapeGraphAI/Scrapegraph-ai into 687-smartscrapermulticoncatgraph-error-with-bedrock
@VinciGit00 smart_scraper_instance = SmartScraperGraph(
prompt="",
source="",
config=self.copy_config,
schema=self.copy_schema
)
graph_iterator_node = GraphIteratorNode(
input="user_prompt & urls",
output=["results"],
node_config={
"graph_instance": smart_scraper_instance,
}
)
graph_iterator_node = GraphIteratorNode(
input="user_prompt & urls",
output=["results"],
node_config={
"graph_instance": SmartScraperGraph,
"scraper_config": self.copy_config,
"scraper_schema": self.copy_schema,
}
) NOTE: I have assumed the "scraper_schema" key, not sure that is how schema gets passed in |
hi @rjbks take a look now at the changed files |
@VinciGit00 graph_iterator_node = GraphIteratorNode(
input="user_prompt & urls",
output=["results"],
node_config={
"graph_instance": SmartScraperGraph,
"scraper_config": self.copy_config,
"scraper_schema": self.copy_schema,
}
) SHould be this: graph_iterator_node = GraphIteratorNode(
input="user_prompt & urls",
output=["results"],
node_config={
"graph_instance": SmartScraperGraph,
"scraper_config": self.copy_config,
},
schema=self.copy_schema,
) Aside from that, I have applied your updates (and fixed my mistake), and it looks better Here is the output: {'products': {'item_1': AIMessage(content='{\n "fellowshipPrograms": "NA"\n}', additional_kwargs={'usage': {'prompt_tokens': 147, 'completion_tokens': 17, 'total_tokens': 164}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20240620-v1:0'}, response_metadata={'usage': {'prompt_tokens': 147, 'completion_tokens': 17, 'total_tokens': 164}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20240620-v1:0'}, id='run-a1065115-b8d9-424e-ac9d-c7a331a52beb-0', usage_metadata={'input_tokens': 147, 'output_tokens': 17, 'total_tokens': 164}),
'item_2': AIMessage(content='Based on the scraped content, here is the information on the Infectious Disease Fellowship program offered:\n\n{\n "program_name": "Infectious Disease Fellowship Program",\n "institution": "MedStar Washington Hospital Center",\n "duration": "2 years",\n "current_status": "Active",\n "date_range": "NA",\n "key_features": [\n "Based at largest and busiest hospital in Washington D.C.",\n "Level I trauma center",\n "Rotations at National Institutes of Health and Children\'s National Medical Center",\n "Large outpatient HIV clinic",\n "Designated Ebola Treatment Center",\n "On-site microbiology laboratory",\n "Transplant and device-associated infections service",\n "Funding for conferences and board review courses",\n "Antimicrobial stewardship curriculum"\n ]\n}', additional_kwargs={'usage': {'prompt_tokens': 4225, 'completion_tokens': 209, 'total_tokens': 4434}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20240620-v1:0'}, response_metadata={'usage': {'prompt_tokens': 4225, 'completion_tokens': 209, 'total_tokens': 4434}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20240620-v1:0'}, id='run-e57f3e2b-0ef7-42ca-b6c5-bb891b9dc729-0', usage_metadata={'input_tokens': 4225, 'output_tokens': 209, 'total_tokens': 4434})}} One thing however, usually the output is the parsed JSON, but in this case it looks like an intermediate result, I guess this is because no schema was supplied? Although the contents of |
@rjbks look at it now please |
Looks good, any thoughts on my last question? |
are you using a schema? |
I think the main reason is because of the model |
No schema, is that why? |
You think it is because I am using Claude 3.5 sonnet? |
Please use the schema |
i have to close the pr, let me know if everything is ok |
🎉 This PR is included in version 1.21.2-beta.2 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
@VinciGit00 Even with a schema this returns the same response. It does not respect the schema provided. import os
import json
import enum
from typing import List, Optional, Literal
from pydantic import BaseModel, Field
import boto3
from scrapegraphai.graphs import SmartScraperMultiConcatGraph
client = boto3.client("bedrock-runtime", region_name="us-west-2")
graph_config = {
"llm": {
"client": client,
"model": "bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0",
"temperature": 0.0
},
'format': 'json',
'verbose': False
}
class ProgramType(str, enum.Enum):
RESIDENCY = "residency"
FELLOWSHIP = "fellowship"
class Program(BaseModel):
link: str = Field("The link (url) where this program was found")
type: ProgramType = Field(description="The type of program this is.") #, pattern=r'residency|fellowship')
specialty: str = Field(description="The medical specialty on which this program focuses.")
sub_specialty: Optional[str] = Field(description="The medical sub-specialty on which this program focuses.")
name: str = Field(description='The name of the program.')
institution_name: Optional[str] = Field(description='The name of the institution hosting the residency/fellowship program.')
institution_link: Optional[str] = Field(description="The link (url) to the institution hosting the residency program.")
institution_address: Optional[str] = Field(description='The address of the residency or fellowship program.')
director: Optional[str] = Field(description='The director\'s name running/managing the program.')
director_phone: Optional[str] = Field(description='The phone number of the residency or fellowship program.')
director_email: Optional[str] = Field(description='The email address of the residency or fellowship program.')
class Result(BaseModel):
program_links: Optional[List[str]] = Field(description="Links to fellowship or residency programs offered by this institution")
programs: List[Program] = Field(description="List of programs (fellowships or residencies) found on this page.")
graph = SmartScraperMultiConcatGraph(
prompt="Find information on all Fellowship and Residency programs offered, current and historic.",
source=[
"https://www.childrensdmc.org/health-professionals/just-for-doctors/fellowships/infectious-diseases",
"https://www.medstarhealth.org/education/fellowship-programs/infectious-disease"
],
schema=Result,
config=graph_config
)
print(graph.run()) Output: {'products': {'item_1': AIMessage(content='{\n "Fellowship_Programs": "NA",\n "Residency_Programs": "NA",\n "Current_Programs": "NA",\n "Historic_Programs": "NA"\n}', additional_kwargs={'usage': {'prompt_tokens': 145, 'completion_tokens': 48, 'total_tokens': 193}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20240620-v1:0'}, response_metadata={'usage': {'prompt_tokens': 145, 'completion_tokens': 48, 'total_tokens': 193}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20240620-v1:0'}, id='run-84a21bcf-6af6-4be9-a0fb-2f8a2f952d63-0', usage_metadata={'input_tokens': 145, 'output_tokens': 48, 'total_tokens': 193}), 'item_2': AIMessage(content='Here is the JSON response based on the website content:\n\n{\n "Fellowship Programs": [\n {\n "Name": "Infectious Disease Fellowship Program",\n "Duration": "2 years",\n "Location": "MedStar Washington Hospital Center",\n "Description": "The MedStar Washington Hospital Center Infectious Diseases program is a 2-year training program based in the largest and busiest hospital in our nation\'s capital. It offers clinical experience in general ID consults, transplant and device-associated infections, and outpatient HIV care.",\n "Rotations": [\n "General consult team",\n "Transplant / orthopedic and cardiac device infection team", \n "NIH (stem cell transplant and highly-immunocompromised patients)",\n "Pediatric ID consult service at Children\'s National Medical Center"\n ],\n "Special Features": [\n "Joint curriculum with NIH",\n "Ryan White-funded HIV clinic",\n "Designated Ebola Treatment Center",\n "On-site microbiology laboratory",\n "Funding for conferences and board review courses",\n "Antimicrobial stewardship curriculum"\n ]\n }\n ],\n "Residency Programs": "NA"\n}', additional_kwargs={'usage': {'prompt_tokens': 4223, 'completion_tokens': 295, 'total_tokens': 4518}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20240620-v1:0'}, response_metadata={'usage': {'prompt_tokens': 4223, 'completion_tokens': 295, 'total_tokens': 4518}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20240620-v1:0'}, id='run-b871daac-9436-4615-9032-bf6d19e2272d-0', usage_metadata={'input_tokens': 4223, 'output_tokens': 295, 'total_tokens': 4518})}} |
Looks like in generate answer node line 56, the bedrock client always resolves to the last if self.node_config.get("schema", None) is not None:
if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)):
self.llm_model = self.llm_model.with_structured_output(
schema=self.node_config["schema"]
)
output_parser = get_structured_output_parser(self.node_config["schema"])
format_instructions = "NA"
else:
if not isinstance(self.llm_model, ChatBedrock):
output_parser = get_pydantic_output_parser(self.node_config["schema"])
format_instructions = output_parser.get_format_instructions()
else:
output_parser = None
format_instructions = ""
else:
if not isinstance(self.llm_model, ChatBedrock):
output_parser = JsonOutputParser()
format_instructions = output_parser.get_format_instructions()
else:
output_parser = None
format_instructions = "" UPDATE1: if self.node_config.get("schema", None) is not None:
if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)):
self.llm_model = self.llm_model.with_structured_output(
schema=self.node_config["schema"]
)
output_parser = get_structured_output_parser(self.node_config["schema"])
format_instructions = "NA"
else:
output_parser = get_pydantic_output_parser(self.node_config["schema"])
format_instructions = output_parser.get_format_instructions() Plus altering the prompt with additional instruction for JSON output, provided json correctly: {'products': {'item_1': {'program_links': None, 'programs': []}, 'item_2': {'program_links': None, 'programs': [{'link': 'https://www.medstarhealth.org/education/fellowship-programs/infectious-disease', 'type': 'fellowship', 'specialty': 'Infectious Disease', 'sub_specialty': None, 'name': 'Infectious Disease Fellowship Program', 'institution_name': 'MedStar Washington Hospital Center', 'institution_link': 'https://www.medstarhealth.org/', 'institution_address': 'Washington, D.C.', 'director': 'Saumil Doshi', 'director_phone': '(202) 877-7164', 'director_email': '[email protected]'}]}}} This is not ideal, as the idea here is not to need on rely on prompt tuning to ensure JSON output. I have had success using instructor along with anthropicbedrock client to ensure JSON output. This is available from the anthropic library directly. |
ok, @rjbks please can you make the pull request to adjust the error? |
@VinciGit00 if not isinstance(self.llm_model, ChatBedrock): Is there a specific JSON output parser for ChatBedrock, or is there no output parser for bedrock and if so what was the reason? If there is no output parser in place for this, my "fix" involves removing the conditional mentioned above and this suggestion from anthropic which is essentially telling the LLM "JSON please" and appending that to the user prompt. I would think we don't want this. |
No there is not a specific output parser but it's generic |
🎉 This PR is included in version 1.24.0 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
No description provided.