data:image/s3,"s3://crabby-images/553bc/553bcf2e8ddc35f0ee1991767f11418936069665" alt="Website Content Crawler avatar"
Website Content Crawler
No credit card required
data:image/s3,"s3://crabby-images/553bc/553bcf2e8ddc35f0ee1991767f11418936069665" alt="Website Content Crawler"
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Trying to scrape the target URL for an "Apply" button
I'm trying to apply this website. All of the data extracts well EXCEPT for the button "Apply." It resolves to an external website and I'm struggling to capture that target URL. Please help!
data:image/s3,"s3://crabby-images/c79cd/c79cd5d2204277ed2e2b09b4aa20a1ed4f8bc34c" alt="jakub.kopecky avatar"
Hi, thank you for using the Website Content Crawler.
In this case, the issue occurred because the Readable Text transformer removed the button because it is an <a>
tag in HTML. This transformer removes navigation elements by default. To avoid this, you can try setting the HTML transformer in HTML processing settings to None
or, in the JSON input, set "htmlTransformer": "none"
.
See this example run.
Please let me know if you have any further questions. I’ll close this issue for now, but feel free to reach out if you need more help. I’d be happy to help. Jakub
Actor Metrics
5.5k monthly users
-
999 bookmarks
>99% runs succeeded
1.1 days response time
Created in Mar 2023
Modified 14 days ago