Skip to content

687 smartscrapermulticoncatgraph error with bedrock #689

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

VinciGit00
Copy link
Collaborator

No description provided.

@rjbks
Copy link

rjbks commented Sep 22, 2024

@VinciGit00
To fully resolve this issue (aside from further JSON formatting problem in issue), there needs ot be an update to SmartScraperMultiConcatGraph code in _create_graph line 63:

smart_scraper_instance = SmartScraperGraph(
            prompt="",
            source="",
            config=self.copy_config,
            schema=self.copy_schema
        )

        graph_iterator_node = GraphIteratorNode(
            input="user_prompt & urls",
            output=["results"],
            node_config={
                "graph_instance": smart_scraper_instance,
            }
        )

smart_scraper_instance is not needed, instead, pass in uninstantiated SmartScraperGraph to GraphIteratorNode like this:

graph_iterator_node = GraphIteratorNode(
            input="user_prompt & urls",
            output=["results"],
            node_config={
                "graph_instance": SmartScraperGraph,
                "scraper_config": self.copy_config,
                "scraper_schema": self.copy_schema,
            }
        )

NOTE: I have assumed the "scraper_schema" key, not sure that is how schema gets passed in

@VinciGit00
Copy link
Collaborator Author

hi @rjbks take a look now at the changed files

@rjbks
Copy link

rjbks commented Sep 22, 2024

hi @rjbks take a look now at the changed files

@VinciGit00
I made a mistake, this:

graph_iterator_node = GraphIteratorNode(
            input="user_prompt & urls",
            output=["results"],
            node_config={
                "graph_instance": SmartScraperGraph,
                "scraper_config": self.copy_config,
                "scraper_schema": self.copy_schema,
            }
        )

SHould be this:

graph_iterator_node = GraphIteratorNode(
            input="user_prompt & urls",
            output=["results"],
            node_config={
                "graph_instance": SmartScraperGraph,
                "scraper_config": self.copy_config,
            },
            schema=self.copy_schema,
        )

Aside from that, I have applied your updates (and fixed my mistake), and it looks better Here is the output:

{'products': {'item_1': AIMessage(content='{\n  "fellowshipPrograms": "NA"\n}', additional_kwargs={'usage': {'prompt_tokens': 147, 'completion_tokens': 17, 'total_tokens': 164}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20240620-v1:0'}, response_metadata={'usage': {'prompt_tokens': 147, 'completion_tokens': 17, 'total_tokens': 164}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20240620-v1:0'}, id='run-a1065115-b8d9-424e-ac9d-c7a331a52beb-0', usage_metadata={'input_tokens': 147, 'output_tokens': 17, 'total_tokens': 164}),
  'item_2': AIMessage(content='Based on the scraped content, here is the information on the Infectious Disease Fellowship program offered:\n\n{\n  "program_name": "Infectious Disease Fellowship Program",\n  "institution": "MedStar Washington Hospital Center",\n  "duration": "2 years",\n  "current_status": "Active",\n  "date_range": "NA",\n  "key_features": [\n    "Based at largest and busiest hospital in Washington D.C.",\n    "Level I trauma center",\n    "Rotations at National Institutes of Health and Children\'s National Medical Center",\n    "Large outpatient HIV clinic",\n    "Designated Ebola Treatment Center",\n    "On-site microbiology laboratory",\n    "Transplant and device-associated infections service",\n    "Funding for conferences and board review courses",\n    "Antimicrobial stewardship curriculum"\n  ]\n}', additional_kwargs={'usage': {'prompt_tokens': 4225, 'completion_tokens': 209, 'total_tokens': 4434}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20240620-v1:0'}, response_metadata={'usage': {'prompt_tokens': 4225, 'completion_tokens': 209, 'total_tokens': 4434}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20240620-v1:0'}, id='run-e57f3e2b-0ef7-42ca-b6c5-bb891b9dc729-0', usage_metadata={'input_tokens': 4225, 'output_tokens': 209, 'total_tokens': 4434})}}

One thing however, usually the output is the parsed JSON, but in this case it looks like an intermediate result, I guess this is because no schema was supplied? Although the contents of graph.run()["item_2"].content are still not properly formatted.

@VinciGit00
Copy link
Collaborator Author

@rjbks look at it now please

@rjbks
Copy link

rjbks commented Sep 22, 2024

@rjbks look at it now please

Looks good, any thoughts on my last question?

@VinciGit00
Copy link
Collaborator Author

are you using a schema?

@VinciGit00
Copy link
Collaborator Author

I think the main reason is because of the model

@rjbks
Copy link

rjbks commented Sep 22, 2024

are you using a schema?

No schema, is that why?

@rjbks
Copy link

rjbks commented Sep 22, 2024

I think the main reason is because of the model

You think it is because I am using Claude 3.5 sonnet?

@VinciGit00
Copy link
Collaborator Author

Please use the schema

@VinciGit00
Copy link
Collaborator Author

i have to close the pr, let me know if everything is ok

@VinciGit00 VinciGit00 merged commit 390ad82 into pre/beta Sep 23, 2024
3 checks passed
@VinciGit00 VinciGit00 deleted the 687-smartscrapermulticoncatgraph-error-with-bedrock branch September 23, 2024 06:22
Copy link

🎉 This PR is included in version 1.21.2-beta.2 🎉

The release is available on:

Your semantic-release bot 📦🚀

@rjbks
Copy link

rjbks commented Sep 23, 2024

@VinciGit00 Even with a schema this returns the same response. It does not respect the schema provided.

import os
import json
import enum
from typing import List, Optional, Literal

from pydantic import BaseModel, Field

import boto3

from scrapegraphai.graphs import SmartScraperMultiConcatGraph

client = boto3.client("bedrock-runtime", region_name="us-west-2")
graph_config = {
    "llm": {
        "client": client,
        "model": "bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0",
        "temperature": 0.0
    },
    'format': 'json',
    'verbose': False
}


class ProgramType(str, enum.Enum):
    RESIDENCY = "residency"
    FELLOWSHIP = "fellowship"


class Program(BaseModel):
    link: str = Field("The link (url) where this program was found")
    type: ProgramType = Field(description="The type of program this is.")  #, pattern=r'residency|fellowship')
    specialty: str = Field(description="The medical specialty on which this program focuses.")
    sub_specialty: Optional[str] = Field(description="The medical sub-specialty on which this program focuses.")
    name: str = Field(description='The name of the program.')
    institution_name: Optional[str] = Field(description='The name of the institution hosting the residency/fellowship program.')
    institution_link: Optional[str] = Field(description="The link (url) to the institution hosting the residency program.")
    institution_address: Optional[str] = Field(description='The address of the residency or fellowship program.')
    director: Optional[str] = Field(description='The director\'s name running/managing the program.')
    director_phone: Optional[str] = Field(description='The phone number of the residency or fellowship program.')
    director_email: Optional[str] = Field(description='The email address of the residency or fellowship program.')



class Result(BaseModel):
    program_links: Optional[List[str]] = Field(description="Links to fellowship or residency programs offered by this institution")
    programs: List[Program] = Field(description="List of programs (fellowships or residencies) found on this page.")


graph = SmartScraperMultiConcatGraph(
    prompt="Find information on all Fellowship and Residency programs offered, current and historic.",
    source=[
        "https://www.childrensdmc.org/health-professionals/just-for-doctors/fellowships/infectious-diseases",
        "https://www.medstarhealth.org/education/fellowship-programs/infectious-disease"
    ],
    schema=Result,
    config=graph_config
)
print(graph.run())

Output:

{'products': {'item_1': AIMessage(content='{\n  "Fellowship_Programs": "NA",\n  "Residency_Programs": "NA",\n  "Current_Programs": "NA",\n  "Historic_Programs": "NA"\n}', additional_kwargs={'usage': {'prompt_tokens': 145, 'completion_tokens': 48, 'total_tokens': 193}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20240620-v1:0'}, response_metadata={'usage': {'prompt_tokens': 145, 'completion_tokens': 48, 'total_tokens': 193}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20240620-v1:0'}, id='run-84a21bcf-6af6-4be9-a0fb-2f8a2f952d63-0', usage_metadata={'input_tokens': 145, 'output_tokens': 48, 'total_tokens': 193}), 'item_2': AIMessage(content='Here is the JSON response based on the website content:\n\n{\n  "Fellowship Programs": [\n    {\n      "Name": "Infectious Disease Fellowship Program",\n      "Duration": "2 years",\n      "Location": "MedStar Washington Hospital Center",\n      "Description": "The MedStar Washington Hospital Center Infectious Diseases program is a 2-year training program based in the largest and busiest hospital in our nation\'s capital. It offers clinical experience in general ID consults, transplant and device-associated infections, and outpatient HIV care.",\n      "Rotations": [\n        "General consult team",\n        "Transplant / orthopedic and cardiac device infection team", \n        "NIH (stem cell transplant and highly-immunocompromised patients)",\n        "Pediatric ID consult service at Children\'s National Medical Center"\n      ],\n      "Special Features": [\n        "Joint curriculum with NIH",\n        "Ryan White-funded HIV clinic",\n        "Designated Ebola Treatment Center",\n        "On-site microbiology laboratory",\n        "Funding for conferences and board review courses",\n        "Antimicrobial stewardship curriculum"\n      ]\n    }\n  ],\n  "Residency Programs": "NA"\n}', additional_kwargs={'usage': {'prompt_tokens': 4223, 'completion_tokens': 295, 'total_tokens': 4518}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20240620-v1:0'}, response_metadata={'usage': {'prompt_tokens': 4223, 'completion_tokens': 295, 'total_tokens': 4518}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20240620-v1:0'}, id='run-b871daac-9436-4615-9032-bf6d19e2272d-0', usage_metadata={'input_tokens': 4223, 'output_tokens': 295, 'total_tokens': 4518})}}

@rjbks
Copy link

rjbks commented Sep 23, 2024

@VinciGit00

Looks like in generate answer node line 56, the bedrock client always resolves to the last else statement, where output_parser and format_instructions are None and "" respectively:

        if self.node_config.get("schema", None) is not None:
            if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)):
                self.llm_model = self.llm_model.with_structured_output(
                    schema=self.node_config["schema"]
                )
                output_parser = get_structured_output_parser(self.node_config["schema"])
                format_instructions = "NA"
            else:
                if not isinstance(self.llm_model, ChatBedrock):
                    output_parser = get_pydantic_output_parser(self.node_config["schema"])
                    format_instructions = output_parser.get_format_instructions()
                else:
                    output_parser = None
                    format_instructions = ""
        else:
            if not isinstance(self.llm_model, ChatBedrock):
                output_parser = JsonOutputParser()
                format_instructions = output_parser.get_format_instructions()
            else:
                output_parser = None
                format_instructions = ""

UPDATE1:
In GenerateAnswerNode.execute, changing this on line 45:

       if self.node_config.get("schema", None) is not None:
            if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)):
                self.llm_model = self.llm_model.with_structured_output(
                    schema=self.node_config["schema"]
                )
                output_parser = get_structured_output_parser(self.node_config["schema"])
                format_instructions = "NA"
            else:
                output_parser = get_pydantic_output_parser(self.node_config["schema"])
                format_instructions = output_parser.get_format_instructions()

Plus altering the prompt with additional instruction for JSON output, provided json correctly:

{'products': {'item_1': {'program_links': None, 'programs': []}, 'item_2': {'program_links': None, 'programs': [{'link': 'https://www.medstarhealth.org/education/fellowship-programs/infectious-disease', 'type': 'fellowship', 'specialty': 'Infectious Disease', 'sub_specialty': None, 'name': 'Infectious Disease Fellowship Program', 'institution_name': 'MedStar Washington Hospital Center', 'institution_link': 'https://www.medstarhealth.org/', 'institution_address': 'Washington, D.C.', 'director': 'Saumil Doshi', 'director_phone': '(202) 877-7164', 'director_email': '[email protected]'}]}}}

This is not ideal, as the idea here is not to need on rely on prompt tuning to ensure JSON output. I have had success using instructor along with anthropicbedrock client to ensure JSON output. This is available from the anthropic library directly.

@VinciGit00
Copy link
Collaborator Author

ok, @rjbks please can you make the pull request to adjust the error?

@rjbks
Copy link

rjbks commented Sep 23, 2024

ok, @rjbks please can you make the pull request to adjust the error?

@VinciGit00
Yes, I would need a little bit of guidance however. For example, why is there this conditional check:

if not isinstance(self.llm_model, ChatBedrock):

Is there a specific JSON output parser for ChatBedrock, or is there no output parser for bedrock and if so what was the reason? If there is no output parser in place for this, my "fix" involves removing the conditional mentioned above and this suggestion from anthropic which is essentially telling the LLM "JSON please" and appending that to the user prompt. I would think we don't want this.

@VinciGit00
Copy link
Collaborator Author

No there is not a specific output parser but it's generic

Copy link

🎉 This PR is included in version 1.24.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants