No credit card required

Website Content Crawler

apify/website-content-crawler

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Back to issues Create new issue

It crawls only the bottom half of the page.

Closed

dkampien opened this issue

Im trying to crawl: https://docs.blender.org/manual/en/4.3/editors/outliner/interface.html

And the output starts from "Main Region" instead of the top of the page. I tried all the settings in the html processing.

The article id is furo-main-content.

run id TEXjTO6PDnDNAl0fK

What could be wrong?

Jiří Spilka (jiri.spilka)

Hi, thank you for using the Website Content Crawler.

I can see that you’ve experimented a lot and mostly got it right.

In this case, the issue occurred because the Readable Text transformer removed the header, likely due to its name Header. To avoid this, it’s better to disable it entirely by setting "htmlTransformer": "none". You can refer to my example run and configuration.

Please let me know if you have any further questions. I’ll close this issue for now, but feel free to reach out if you need more help. I’d be happy to help. Jiri

dkampien

Oh it seems that the particular page was having a section id with "header" where it talked about a blender related header.

But I still can't seem to get the right combination of settings.

The main content selector is #furo-main-content and I also want to include a #header for that section. If I leave the html transformer to "readable text", for some reason it won't recognize the #header and will still output a partial page without that section.

However, If I set the html transformer to none, it gives me the whole page but the links break from:

[interaction mode].(https://docs.blender.org/manual/en/4.3/editors/3dview/modes.html) to
[interaction mode].(../3dview/modes.html) I added a dot so apify won't format it.

Any idea how to fix this?

Also is there any way to get rid of the anchor links from the headers? [¶].(#interface "Link to this heading") I want to feed multiple pages into an llm and those achor text links are adding up the token count considerably across my dataset.

Thank you. Keep up the good work.

Jiří Spilka (jiri.spilka)

Hi, thank you for the detailed explanation and example.

You are correct that if you select any other HTML processor, the header gets removed. However, when you choose "htmlTransformer": "none", the HTML remains intact, preserving all relative links.

I’ll forward this internally to explore whether we should add functionality to convert relative URLs into absolute ones.

Regarding the anchor, it contains a link to a section, such as [¶](https://docs.blender.org/manual/en/4.3/editors/outliner/selecting.html#properties-editor-sync "Link to this heading"). It should not be removed, as doing so would result in losing valuable links.

While writing this response, I realized there might be a better solution. I used the Web Scraper, which is flexible enough to extract exactly what you need.

Please see my example run (aborted early to avoid wasting resources). In the input, I included a pageFunction that extracts the desired content and converts it to Markdown. You can copy paste the Input to your Web Scraper to see how it works.

Also, make sure to configure the glob pattern correctly, such as https://docs.blender.org/manual/en/4.3/**, to ensure the crawler doesn’t scrape unwanted content.

Let me know if this helps. Jiri

Add comment

Developer

Apify

Actor Metrics

5.5k monthly users
999 bookmarks
>99% runs succeeded
1.1 days response time
Created in Mar 2023
Modified 14 days ago

Categories

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

290

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

164

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

332

Sing a page 🎶

josef.prochazka/sing-a-page

This Actor allows you to listen to a song of your favorite genre with lyrics generated from a page you provide.

Josef Procházka

Example Website Screenshot Crawler

dz_omar/example-website-screenshot-crawler

Automated website screenshot crawler using Pyppeteer and Apify. This open-source actor captures screenshots from specified URLs, uploads them to the Apify Key-Value Store, and provides easy access to the results, making it ideal for monitoring website changes and archiving web content.

Abdlhakim hefaia

Web Scraper

apify/web-scraper

Crawls arbitrary websites using the Chrome browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

Apify

76.1k

456

Video Link Crawler

infoweaver/video-link-crawler

Effortlessly discover and extract video links from any website with our powerful Video Link Crawler within few seconds. Starting from a specified URL, it navigates through web pages, identifies video content, and compiles structured datasets.! Try it Now!

InfoWeaver

News Website Crawler & Article Extractor

xtech/news-source-crawler

Scrape all articles from any news website. Extract full text, metadata, keywords, and summaries. Ideal for content analysis, research, and news aggregation.

Xtech

Web Crawler

rigelbytes/webcrawler

This web crawler is designed to provide users with complete flexibility by allowing them to use their **own proxies**. The scraper collects all pages from the website and returns extracts the **MetaData**, **Title**, and **Content** of the page in MarkDown.

Rigel Bytes

Google Maps Scraper

compass/crawler-google-places

Extract data from thousands of Google Maps locations and businesses. Get Google Maps data including reviews, reviewer details, images, contact info, opening hours, location, prices & more. Export scraped data, run the scraper via API, schedule and monitor runs, or integrate with other tools.