687 smartscrapermulticoncatgraph error with bedrock #689

VinciGit00 · 2024-09-22T16:55:30Z

No description provided.

…ttps://github.com/ScrapeGraphAI/Scrapegraph-ai into 687-smartscrapermulticoncatgraph-error-with-bedrock

rjbks · 2024-09-22T18:01:07Z

@VinciGit00
To fully resolve this issue (aside from further JSON formatting problem in issue), there needs ot be an update to SmartScraperMultiConcatGraph code in _create_graph line 63:

smart_scraper_instance = SmartScraperGraph(
            prompt="",
            source="",
            config=self.copy_config,
            schema=self.copy_schema
        )

        graph_iterator_node = GraphIteratorNode(
            input="user_prompt & urls",
            output=["results"],
            node_config={
                "graph_instance": smart_scraper_instance,
            }
        )

smart_scraper_instance is not needed, instead, pass in uninstantiated SmartScraperGraph to GraphIteratorNode like this:

graph_iterator_node = GraphIteratorNode(
            input="user_prompt & urls",
            output=["results"],
            node_config={
                "graph_instance": SmartScraperGraph,
                "scraper_config": self.copy_config,
                "scraper_schema": self.copy_schema,
            }
        )

NOTE: I have assumed the "scraper_schema" key, not sure that is how schema gets passed in

VinciGit00 · 2024-09-22T19:17:17Z

hi @rjbks take a look now at the changed files

rjbks · 2024-09-22T19:49:00Z

hi @rjbks take a look now at the changed files

@VinciGit00
I made a mistake, this:

graph_iterator_node = GraphIteratorNode(
            input="user_prompt & urls",
            output=["results"],
            node_config={
                "graph_instance": SmartScraperGraph,
                "scraper_config": self.copy_config,
                "scraper_schema": self.copy_schema,
            }
        )

SHould be this:

graph_iterator_node = GraphIteratorNode(
            input="user_prompt & urls",
            output=["results"],
            node_config={
                "graph_instance": SmartScraperGraph,
                "scraper_config": self.copy_config,
            },
            schema=self.copy_schema,
        )

Aside from that, I have applied your updates (and fixed my mistake), and it looks better Here is the output:

{'products': {'item_1': AIMessage(content='{\n  "fellowshipPrograms": "NA"\n}', additional_kwargs={'usage': {'prompt_tokens': 147, 'completion_tokens': 17, 'total_tokens': 164}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20240620-v1:0'}, response_metadata={'usage': {'prompt_tokens': 147, 'completion_tokens': 17, 'total_tokens': 164}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20240620-v1:0'}, id='run-a1065115-b8d9-424e-ac9d-c7a331a52beb-0', usage_metadata={'input_tokens': 147, 'output_tokens': 17, 'total_tokens': 164}),
  'item_2': AIMessage(content='Based on the scraped content, here is the information on the Infectious Disease Fellowship program offered:\n\n{\n  "program_name": "Infectious Disease Fellowship Program",\n  "institution": "MedStar Washington Hospital Center",\n  "duration": "2 years",\n  "current_status": "Active",\n  "date_range": "NA",\n  "key_features": [\n    "Based at largest and busiest hospital in Washington D.C.",\n    "Level I trauma center",\n    "Rotations at National Institutes of Health and Children\'s National Medical Center",\n    "Large outpatient HIV clinic",\n    "Designated Ebola Treatment Center",\n    "On-site microbiology laboratory",\n    "Transplant and device-associated infections service",\n    "Funding for conferences and board review courses",\n    "Antimicrobial stewardship curriculum"\n  ]\n}', additional_kwargs={'usage': {'prompt_tokens': 4225, 'completion_tokens': 209, 'total_tokens': 4434}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20240620-v1:0'}, response_metadata={'usage': {'prompt_tokens': 4225, 'completion_tokens': 209, 'total_tokens': 4434}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20240620-v1:0'}, id='run-e57f3e2b-0ef7-42ca-b6c5-bb891b9dc729-0', usage_metadata={'input_tokens': 4225, 'output_tokens': 209, 'total_tokens': 4434})}}

One thing however, usually the output is the parsed JSON, but in this case it looks like an intermediate result, I guess this is because no schema was supplied? Although the contents of graph.run()["item_2"].content are still not properly formatted.

VinciGit00 · 2024-09-22T20:25:14Z

@rjbks look at it now please

rjbks · 2024-09-22T20:31:23Z

@rjbks look at it now please

Looks good, any thoughts on my last question?

VinciGit00 · 2024-09-22T20:41:28Z

are you using a schema?

VinciGit00 · 2024-09-22T20:51:00Z

I think the main reason is because of the model

rjbks · 2024-09-22T20:52:14Z

are you using a schema?

No schema, is that why?

rjbks · 2024-09-22T20:52:51Z

I think the main reason is because of the model

You think it is because I am using Claude 3.5 sonnet?

VinciGit00 · 2024-09-23T05:49:03Z

Please use the schema

VinciGit00 · 2024-09-23T06:20:57Z

i have to close the pr, let me know if everything is ok

github-actions · 2024-09-23T06:23:30Z

🎉 This PR is included in version 1.21.2-beta.2 🎉

The release is available on:

v1.21.2-beta.2
GitHub release

Your semantic-release bot 📦🚀

rjbks · 2024-09-23T16:24:02Z

@VinciGit00 Even with a schema this returns the same response. It does not respect the schema provided.

import os
import json
import enum
from typing import List, Optional, Literal

from pydantic import BaseModel, Field

import boto3

from scrapegraphai.graphs import SmartScraperMultiConcatGraph

client = boto3.client("bedrock-runtime", region_name="us-west-2")
graph_config = {
    "llm": {
        "client": client,
        "model": "bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0",
        "temperature": 0.0
    },
    'format': 'json',
    'verbose': False
}


class ProgramType(str, enum.Enum):
    RESIDENCY = "residency"
    FELLOWSHIP = "fellowship"


class Program(BaseModel):
    link: str = Field("The link (url) where this program was found")
    type: ProgramType = Field(description="The type of program this is.")  #, pattern=r'residency|fellowship')
    specialty: str = Field(description="The medical specialty on which this program focuses.")
    sub_specialty: Optional[str] = Field(description="The medical sub-specialty on which this program focuses.")
    name: str = Field(description='The name of the program.')
    institution_name: Optional[str] = Field(description='The name of the institution hosting the residency/fellowship program.')
    institution_link: Optional[str] = Field(description="The link (url) to the institution hosting the residency program.")
    institution_address: Optional[str] = Field(description='The address of the residency or fellowship program.')
    director: Optional[str] = Field(description='The director\'s name running/managing the program.')
    director_phone: Optional[str] = Field(description='The phone number of the residency or fellowship program.')
    director_email: Optional[str] = Field(description='The email address of the residency or fellowship program.')



class Result(BaseModel):
    program_links: Optional[List[str]] = Field(description="Links to fellowship or residency programs offered by this institution")
    programs: List[Program] = Field(description="List of programs (fellowships or residencies) found on this page.")


graph = SmartScraperMultiConcatGraph(
    prompt="Find information on all Fellowship and Residency programs offered, current and historic.",
    source=[
        "https://www.childrensdmc.org/health-professionals/just-for-doctors/fellowships/infectious-diseases",
        "https://www.medstarhealth.org/education/fellowship-programs/infectious-disease"
    ],
    schema=Result,
    config=graph_config
)
print(graph.run())

Output:

{'products': {'item_1': AIMessage(content='{\n  "Fellowship_Programs": "NA",\n  "Residency_Programs": "NA",\n  "Current_Programs": "NA",\n  "Historic_Programs": "NA"\n}', additional_kwargs={'usage': {'prompt_tokens': 145, 'completion_tokens': 48, 'total_tokens': 193}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20240620-v1:0'}, response_metadata={'usage': {'prompt_tokens': 145, 'completion_tokens': 48, 'total_tokens': 193}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20240620-v1:0'}, id='run-84a21bcf-6af6-4be9-a0fb-2f8a2f952d63-0', usage_metadata={'input_tokens': 145, 'output_tokens': 48, 'total_tokens': 193}), 'item_2': AIMessage(content='Here is the JSON response based on the website content:\n\n{\n  "Fellowship Programs": [\n    {\n      "Name": "Infectious Disease Fellowship Program",\n      "Duration": "2 years",\n      "Location": "MedStar Washington Hospital Center",\n      "Description": "The MedStar Washington Hospital Center Infectious Diseases program is a 2-year training program based in the largest and busiest hospital in our nation\'s capital. It offers clinical experience in general ID consults, transplant and device-associated infections, and outpatient HIV care.",\n      "Rotations": [\n        "General consult team",\n        "Transplant / orthopedic and cardiac device infection team", \n        "NIH (stem cell transplant and highly-immunocompromised patients)",\n        "Pediatric ID consult service at Children\'s National Medical Center"\n      ],\n      "Special Features": [\n        "Joint curriculum with NIH",\n        "Ryan White-funded HIV clinic",\n        "Designated Ebola Treatment Center",\n        "On-site microbiology laboratory",\n        "Funding for conferences and board review courses",\n        "Antimicrobial stewardship curriculum"\n      ]\n    }\n  ],\n  "Residency Programs": "NA"\n}', additional_kwargs={'usage': {'prompt_tokens': 4223, 'completion_tokens': 295, 'total_tokens': 4518}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20240620-v1:0'}, response_metadata={'usage': {'prompt_tokens': 4223, 'completion_tokens': 295, 'total_tokens': 4518}, 'stop_reason': 'end_turn', 'model_id': 'anthropic.claude-3-5-sonnet-20240620-v1:0'}, id='run-b871daac-9436-4615-9032-bf6d19e2272d-0', usage_metadata={'input_tokens': 4223, 'output_tokens': 295, 'total_tokens': 4518})}}

rjbks · 2024-09-23T16:33:04Z

@VinciGit00

Looks like in generate answer node line 56, the bedrock client always resolves to the last else statement, where output_parser and format_instructions are None and "" respectively:

        if self.node_config.get("schema", None) is not None:
            if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)):
                self.llm_model = self.llm_model.with_structured_output(
                    schema=self.node_config["schema"]
                )
                output_parser = get_structured_output_parser(self.node_config["schema"])
                format_instructions = "NA"
            else:
                if not isinstance(self.llm_model, ChatBedrock):
                    output_parser = get_pydantic_output_parser(self.node_config["schema"])
                    format_instructions = output_parser.get_format_instructions()
                else:
                    output_parser = None
                    format_instructions = ""
        else:
            if not isinstance(self.llm_model, ChatBedrock):
                output_parser = JsonOutputParser()
                format_instructions = output_parser.get_format_instructions()
            else:
                output_parser = None
                format_instructions = ""

UPDATE1:
In GenerateAnswerNode.execute, changing this on line 45:

       if self.node_config.get("schema", None) is not None:
            if isinstance(self.llm_model, (ChatOpenAI, ChatMistralAI)):
                self.llm_model = self.llm_model.with_structured_output(
                    schema=self.node_config["schema"]
                )
                output_parser = get_structured_output_parser(self.node_config["schema"])
                format_instructions = "NA"
            else:
                output_parser = get_pydantic_output_parser(self.node_config["schema"])
                format_instructions = output_parser.get_format_instructions()

Plus altering the prompt with additional instruction for JSON output, provided json correctly:

{'products': {'item_1': {'program_links': None, 'programs': []}, 'item_2': {'program_links': None, 'programs': [{'link': 'https://www.medstarhealth.org/education/fellowship-programs/infectious-disease', 'type': 'fellowship', 'specialty': 'Infectious Disease', 'sub_specialty': None, 'name': 'Infectious Disease Fellowship Program', 'institution_name': 'MedStar Washington Hospital Center', 'institution_link': 'https://www.medstarhealth.org/', 'institution_address': 'Washington, D.C.', 'director': 'Saumil Doshi', 'director_phone': '(202) 877-7164', 'director_email': '[email protected]'}]}}}

This is not ideal, as the idea here is not to need on rely on prompt tuning to ensure JSON output. I have had success using instructor along with anthropicbedrock client to ensure JSON output. This is available from the anthropic library directly.

VinciGit00 · 2024-09-23T19:45:18Z

ok, @rjbks please can you make the pull request to adjust the error?

rjbks · 2024-09-23T19:48:30Z

ok, @rjbks please can you make the pull request to adjust the error?

@VinciGit00
Yes, I would need a little bit of guidance however. For example, why is there this conditional check:

if not isinstance(self.llm_model, ChatBedrock):

Is there a specific JSON output parser for ChatBedrock, or is there no output parser for bedrock and if so what was the reason? If there is no output parser in place for this, my "fix" involves removing the conditional mentioned above and this suggestion from anthropic which is essentially telling the LLM "JSON please" and appending that to the user prompt. I would think we don't want this.

VinciGit00 · 2024-09-23T21:14:16Z

No there is not a specific output parser but it's generic

github-actions · 2024-09-26T06:08:15Z

🎉 This PR is included in version 1.24.0 🎉

The release is available on:

v1.24.0
GitHub release

Your semantic-release bot 📦🚀

VinciGit00 added 2 commits September 22, 2024 18:55

fix: issue about parser

7eda6bc

Merge branch '687-smartscrapermulticoncatgraph-error-with-bedrock' of h…

ec9f8ff

…ttps://github.com/ScrapeGraphAI/Scrapegraph-ai into 687-smartscrapermulticoncatgraph-error-with-bedrock

VinciGit00 added 2 commits September 22, 2024 21:10

Update smart_scraper_multi_concat_graph.py

65b8675

Update generate_answer_node.py

69880b6

fix: graph Iterator node

8ce08ba

VinciGit00 merged commit 390ad82 into pre/beta Sep 23, 2024
3 checks passed

VinciGit00 deleted the 687-smartscrapermulticoncatgraph-error-with-bedrock branch September 23, 2024 06:22

github-actions bot added the released on @dev label Sep 23, 2024

github-actions bot added the released on @stable label Sep 26, 2024

Uh oh!

687 smartscrapermulticoncatgraph error with bedrock #689

687 smartscrapermulticoncatgraph error with bedrock #689

Uh oh!

Conversation

VinciGit00 commented Sep 22, 2024

Uh oh!

rjbks commented Sep 22, 2024

Uh oh!

VinciGit00 commented Sep 22, 2024

Uh oh!

rjbks commented Sep 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VinciGit00 commented Sep 22, 2024

Uh oh!

rjbks commented Sep 22, 2024

Uh oh!

VinciGit00 commented Sep 22, 2024

Uh oh!

VinciGit00 commented Sep 22, 2024

Uh oh!

rjbks commented Sep 22, 2024

Uh oh!

rjbks commented Sep 22, 2024

Uh oh!

VinciGit00 commented Sep 23, 2024

Uh oh!

VinciGit00 commented Sep 23, 2024

Uh oh!

Uh oh!

github-actions bot commented Sep 23, 2024

Uh oh!

rjbks commented Sep 23, 2024

Uh oh!

rjbks commented Sep 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VinciGit00 commented Sep 23, 2024

Uh oh!

rjbks commented Sep 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VinciGit00 commented Sep 23, 2024

Uh oh!

github-actions bot commented Sep 26, 2024

Uh oh!

Uh oh!

rjbks commented Sep 22, 2024 •

edited

Loading

rjbks commented Sep 23, 2024 •

edited

Loading

rjbks commented Sep 23, 2024 •

edited

Loading