
Website Content Crawler
Pricing
Pay per usage

Website Content Crawler
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
3.7 (41)
Pricing
Pay per usage
1481
Total users
57K
Monthly users
8K
Runs succeeded
>99%
Issues response
7.8 days
Last modified
2 days ago
Incomplete Web Scraping Results for a Webflow website
Closed
I attempted to scrape the homepage of my website, which is built using Webflow. While the output httpStatusCode shows 200 and some text from the website is successfully extracted, a significant portion of the data is missing from the output. I have enabled the "Save screenshots (headless browser only)" option, and the screenshots confirm that the website is fully loaded during the scraping process.
Despite setting the "Wait for dynamic content" option to 60 seconds, the scraping results remain incomplete. Below are the full input settings I used for the scraping attempt.
The issue is that not all content from the website is being captured, even though it appears fully loaded in the screenshots. This seems to be related to dynamic content or other rendering issues with Webflow-based websites?
Below is my input configuration:
{ "aggressivePrune": false, "clickElementsCssSelector": "[aria-expanded="false"]", "clientSideMinChangePercentage": 15, "crawlerType": "playwright:firefox", "debugLog": true, "debugMode": true, "dynamicContentWaitSecs": 60, "expandIframes": true, "ignoreCanonicalUrl": false, "keepUrlFragments": false, "maxCrawlDepth": 0, "maxScrollHeightPixels": 10000000, "proxyConfiguration": { "useApifyProxy": true, "apifyProxyGroups": [] }, "readableTextCharThreshold": 100, "removeCookieWarnings": true, "removeElementsCssSelector": "footer, script, style, noscript, img[src^='data:'], [role="alertdialog"]\n", "renderingTypeDetectionPercentage": 10, "requestTimeoutSecs": 120, "respectRobotsTxtFile": true, "saveFiles": false, "saveHtml": false, "saveHtmlAsFile": true, "saveMarkdown": true, "saveScreenshots": true, "startUrls": [ { "url": "MY_SITE", "method": "GET" } ], "useSitemaps": false, "includeUrlGlobs": [], "excludeUrlGlobs": [], "maxCrawlPages": 9999999, "initialConcurrency": 0, "maxConcurrency": 200, "initialCookies": [], "maxSessionRotations": 10, "maxRequestRetries": 5, "minFileDownloadSpeedKBps": 128, "waitForSelector": "", "softWaitForSelector": "", "keepElementsCssSelector": "", "htmlTransformer": "readableText", "maxResults": 9999999 }
Hello, and thank you for your interest in this Actor!
Website Content Crawler tries to clean the page content before storing it to the dataset using various libraries (this is customizable). By default, we're using Mozilla Readability to remove menus, navbars, etc. In some cases, this doesn't work as it should, and the HTML cleaner removes too much content - this is what's happening here, too.
You can fix this by setting the HTML processing > HTML transformer
input option (or htmlTransformer
via API) to None / none
. This way, the Actor will store all the content from the page, minus the elements targeted by the removeElementsCssSelector
option.
You can check out my fixed run for your input here (link to the run)
I'll close this issue now, but feel free to ask additional questions, if you have any. Cheers!