Skip to content

Screenshot scraper integration #558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Aug 20, 2024
Merged

Screenshot scraper integration #558

merged 10 commits into from
Aug 20, 2024

Conversation

VinciGit00
Copy link
Collaborator

No description provided.

browser.close()

for screenshot_data in screenshot_data_list:
screenshots.append(screenshot_data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why having duplicate data structures screenshots and screenshot_data_list?

keep only one of the two, and drop the other


capture_screenshot(0, screenshot_counter)
screenshot_counter += 1
capture_screenshot(viewport_height, screenshot_counter)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infinite-scrolling web pages (with dynamic JS rendering) may have undefined behavior with the viewport height, as that is bound to change as well as you scroll through the page in the browser.

looks fine for now, but I would maybe add a disclaimer in the docs saying screenshot capturing might not work as well for those; collecting only two screenshots is fine as well, but that number might better be a function of the viewport height, so that we don't miss parts of the page content in case a page is very very long (although not infinite-scrolling)


api_key = self.node_config.get("config", {}).get("llm", {}).get("api_key", "")

supported_models = ["gpt-4o", "gpt-4o-mini", "gpt-4-turbo"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all open VLMs would work fine as well if run with ollama or similar, same goes for gemini-pro

why such a strict set of models?

P.S. - turn it into a set instead of a list

}

payload = {
"model": "gpt-4o-mini",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why hardcoding the mini version if a support model was passed by the user?

is not supported. Supported models are:
{', '.join(supported_models)}.""")

for image_data in images:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better off using the async API and processing the images in parallel so not to waste time for nothing

"""
fetch_screen_node = FetchScreenNode(
input="url",
output=["imgs"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why putting the key here if the node doesn't use it and hardcodes it as screenshots?

)
generate_answer_from_image_node = GenerateAnswerFromImageNode(
input="doc",
output=["parsed_doc"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same goes here, neither the input nor output arguments are really used by the node

}

return state
def execute(self, state: dict) -> dict:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrap the function with a safer guard on the running event loop, e.g.:

def execute(self, state: dict) -> dict:
    """
    Wrapper to run the asynchronous execute_async function in a synchronous context.
    """
    try:
        eventloop = asyncio.get_event_loop()
    except RuntimeError:
        eventloop = None

    if eventloop and eventloop.is_running():
        state = eventloop.run_until_complete(self.execute_async(state))
    else:
        state = asyncio.run(self.execute_async(state))

    return state

@VinciGit00 VinciGit00 merged commit 860fde8 into pre/beta Aug 20, 2024
3 checks passed
@VinciGit00 VinciGit00 deleted the screenshot_scraper branch August 20, 2024 09:34
Copy link

🎉 This issue has been resolved in version 1.14.0-beta.13 🎉

The release is available on:

Your semantic-release bot 📦🚀

Copy link

🎉 This issue has been resolved in version 1.14.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants