RAG Web Browser avatar

RAG Web Browser

Try for free

No credit card required

Go to Store
RAG Web Browser

RAG Web Browser

apify/rag-web-browser
Try for free

No credit card required

Web browser for OpenAI Assistants API and RAG pipelines, similar to a web browser in ChatGPT. It queries Google Search, scrapes the top N pages from the results, and returns their cleaned content as Markdown for further processing by an LLM. It can also scrape individual URLs.

ZO

Doesnt seem to remove images from returned markdown

Open
zognotadog opened this issue
21 days ago

The crawler seems to be outputting image data in the markdown even though its meant to strip it. This was confirmed with a few runs on our own site an apify.com

Has something gone wrong? The Website Content Crawler does not have this issue.

-------- RAG Web Browser Result ---------------

Apify: Full-stack web scraping and data extraction platformStar apify/crawlee on GitHubRib StichBack ButtonSearch IconFilter Icon Skip to content Contact sales Log in Get started # Your full‑stack platform for web scraping Apify is the largest ecosystem where developers build, deploy, and publish data extraction and web automation tools

-------- Website Content Cralwer Result for Apify.com --------

Full-stack web scraping and data extraction platform Apify is the largest ecosystem where developers build, deploy, and publish data extraction and web automation tools. We call them Actors. TikTok Data Extractor clockworks/free-tiktok-scraper Extract data about videos, users, and channels based on hashtags or...

jiri.spilka avatar

Hi, thank you for using RAG Web Browser.
I appreciate your detailed explanation.

The RAG Web Browser has a slightly different configuration. To keep settings simple, it outputs raw page content without transformation, unlike the Website Content Crawler, which uses the readableText option. This option can sometimes remove content and isn’t 100% reliable. Instead, in RAG Web Browser, we let the LLM determine what content is useful by setting "htmlTransformer": "none".

When I run Website Content Crawler with "htmlTransformer": "none", I receive similar output to the RAG Web Browser.

  • RAG Web Browser: run
    "Apify: Full-stack web scraping and data extraction platformStar apify/crawlee on GitHubRib StichBack ButtonSearch IconFilter Icon\n\nSkip to content
  • Website Content Crawler: run
    "Apify: Full-stack web scraping and data extraction platform\n\nSkip to content

Interestingly, there is a bit more processing Website Content Crawler is doing. If you want both Actors to produce identical output, it should be possible. However, I encountered an issue when testing this and couldn't quickly figure out the cause.

Would you like to use RAG Web Browser with this configuration? If so, I can look into that further.

Apologies for any inconvenience. Jiri

Developer
Maintained by Apify

Actor Metrics

  • 231 monthly users

  • 49 bookmarks

  • 78% runs succeeded

  • 20 days response time

  • Created in Sep 2024

  • Modified 17 hours ago

Categories