data:image/s3,"s3://crabby-images/553bc/553bcf2e8ddc35f0ee1991767f11418936069665" alt="Website Content Crawler avatar"
Website Content Crawler
No credit card required
data:image/s3,"s3://crabby-images/553bc/553bcf2e8ddc35f0ee1991767f11418936069665" alt="Website Content Crawler"
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
It crawls only the bottom half of the page.
Im trying to crawl: https://docs.blender.org/manual/en/4.3/editors/outliner/interface.html
And the output starts from "Main Region" instead of the top of the page. I tried all the settings in the html processing.
The article id is furo-main-content.
run id TEXjTO6PDnDNAl0fK
What could be wrong?
data:image/s3,"s3://crabby-images/91a36/91a367a04f91462b9a0580243cbfa98c064d7117" alt="jiri.spilka avatar"
Hi, thank you for using the Website Content Crawler.
I can see that you’ve experimented a lot and mostly got it right.
In this case, the issue occurred because the Readable Text transformer removed the header, likely due to its name Header
.
To avoid this, it’s better to disable it entirely by setting "htmlTransformer": "none"
. You can refer to my example run and configuration.
Please let me know if you have any further questions. I’ll close this issue for now, but feel free to reach out if you need more help. I’d be happy to help. Jiri
dkampien
Oh it seems that the particular page was having a section id with "header" where it talked about a blender related header.
But I still can't seem to get the right combination of settings.
The main content selector is #furo-main-content and I also want to include a #header for that section. If I leave the html transformer to "readable text", for some reason it won't recognize the #header and will still output a partial page without that section.
However, If I set the html transformer to none, it gives me the whole page but the links break from:
- [interaction mode].(https://docs.blender.org/manual/en/4.3/editors/3dview/modes.html) to
- [interaction mode].(../3dview/modes.html) I added a dot so apify won't format it.
Any idea how to fix this?
Also is there any way to get rid of the anchor links from the headers? [¶].(#interface "Link to this heading") I want to feed multiple pages into an llm and those achor text links are adding up the token count considerably across my dataset.
Thank you. Keep up the good work.
data:image/s3,"s3://crabby-images/91a36/91a367a04f91462b9a0580243cbfa98c064d7117" alt="jiri.spilka avatar"
Hi, thank you for the detailed explanation and example.
You are correct that if you select any other HTML processor, the header gets removed. However, when you choose "htmlTransformer": "none"
, the HTML remains intact, preserving all relative links.
I’ll forward this internally to explore whether we should add functionality to convert relative URLs into absolute ones.
Regarding the anchor, it contains a link to a section, such as [¶](https://docs.blender.org/manual/en/4.3/editors/outliner/selecting.html#properties-editor-sync "Link to this heading")
. It should not be removed, as doing so would result in losing valuable links.
While writing this response, I realized there might be a better solution. I used the Web Scraper, which is flexible enough to extract exactly what you need.
Please see my example run (aborted early to avoid wasting resources). In the input, I included a pageFunction
that extracts the desired content and converts it to Markdown. You can copy paste the Input to your Web Scraper to see how it works.
Also, make sure to configure the glob pattern correctly, such as https://docs.blender.org/manual/en/4.3/**
, to ensure the crawler doesn’t scrape unwanted content.
Let me know if this helps. Jiri
Actor Metrics
5.5k monthly users
-
999 bookmarks
>99% runs succeeded
1.1 days response time
Created in Mar 2023
Modified 14 days ago