Website Content Crawler avatar
Website Content Crawler

Pricing

Pay per usage

Go to Store
Website Content Crawler

Website Content Crawler

Developed by

Apify

Apify

Maintained by Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

3.7 (41)

Pricing

Pay per usage

1481

Total users

57K

Monthly users

8K

Runs succeeded

>99%

Issues response

7.8 days

Last modified

2 days ago

SL

Incomplete Web Scraping Results for a Webflow website

Closed

sllintestacc opened this issue
3 days ago

I attempted to scrape the homepage of my website, which is built using Webflow. While the output httpStatusCode shows 200 and some text from the website is successfully extracted, a significant portion of the data is missing from the output. I have enabled the "Save screenshots (headless browser only)" option, and the screenshots confirm that the website is fully loaded during the scraping process.

Despite setting the "Wait for dynamic content" option to 60 seconds, the scraping results remain incomplete. Below are the full input settings I used for the scraping attempt.

The issue is that not all content from the website is being captured, even though it appears fully loaded in the screenshots. This seems to be related to dynamic content or other rendering issues with Webflow-based websites?

Below is my input configuration:

{ "aggressivePrune": false, "clickElementsCssSelector": "[aria-expanded="false"]", "clientSideMinChangePercentage": 15, "crawlerType": "playwright:firefox", "debugLog": true, "debugMode": true, "dynamicContentWaitSecs": 60, "expandIframes": true, "ignoreCanonicalUrl": false, "keepUrlFragments": false, "maxCrawlDepth": 0, "maxScrollHeightPixels": 10000000, "proxyConfiguration": { "useApifyProxy": true, "apifyProxyGroups": [] }, "readableTextCharThreshold": 100, "removeCookieWarnings": true, "removeElementsCssSelector": "footer, script, style, noscript, img[src^='data:'], [role="alertdialog"]\n", "renderingTypeDetectionPercentage": 10, "requestTimeoutSecs": 120, "respectRobotsTxtFile": true, "saveFiles": false, "saveHtml": false, "saveHtmlAsFile": true, "saveMarkdown": true, "saveScreenshots": true, "startUrls": [ { "url": "MY_SITE", "method": "GET" } ], "useSitemaps": false, "includeUrlGlobs": [], "excludeUrlGlobs": [], "maxCrawlPages": 9999999, "initialConcurrency": 0, "maxConcurrency": 200, "initialCookies": [], "maxSessionRotations": 10, "maxRequestRetries": 5, "minFileDownloadSpeedKBps": 128, "waitForSelector": "", "softWaitForSelector": "", "keepElementsCssSelector": "", "htmlTransformer": "readableText", "maxResults": 9999999 }

jindrich.bar avatar

Hello, and thank you for your interest in this Actor!

Website Content Crawler tries to clean the page content before storing it to the dataset using various libraries (this is customizable). By default, we're using Mozilla Readability to remove menus, navbars, etc. In some cases, this doesn't work as it should, and the HTML cleaner removes too much content - this is what's happening here, too.

You can fix this by setting the HTML processing > HTML transformer input option (or htmlTransformer via API) to None / none. This way, the Actor will store all the content from the page, minus the elements targeted by the removeElementsCssSelector option.

You can check out my fixed run for your input here (link to the run)

I'll close this issue now, but feel free to ask additional questions, if you have any. Cheers!