
Website Content Crawler
Pricing
Pay per usage

Website Content Crawler
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
4.6 (38)
Pricing
Pay per usage
1224
Monthly users
6.4k
Runs succeeded
>99%
Response time
4.6 days
Last modified
a day ago
Exclude Start URL and Disallowed Paths from Output + Return Clean JSON Structure
Open
Hi,
I'm opening this issue just because I hope you can help me figure out the configuration that I need to archive my goals with this actor.
I'm testing the actor and I'm trying to use it to extract structured data from RepVue company profile pages, but I'm running into a couple of problems I hope you can help with.
Here's my setup:
- I'm starting from a filtered listing page:
https://www.repvue.com/companies?sort_key=quota_attainment&sort_direction=asc&per_page=100&metro_locations.slug=united-states-other - I'm using Playwright: Firefox and a custom extendOutputFunction that enqueues links from the listing page and extracts company profile data.
Issues I want to fix:
- Exclude the listing page from the output:
I don’t want the startUrl (listing page) to appear in the results. - Respect robots.txt rules:
I want to automatically exclude any page that would be disallowed under the site’s robots.txt, including:- /companies/compare/-vs-
- /user/*
- /monitoring
- /api/*
- any URLs ending in .json or .js
- Clean structured JSON output:
Instead of returning the full text content, I want a structured object for each company like this or at least something similar: { "url": "https://www.repvue.com/companies/Shogun", "company name": "Shogun", "company website": "https://getshogun.com/", "company description": "...", "company industry": "Internet", "company funding": "Venture Capital", "company size": "228", "company location": "United States", "company quota attainment": "31%" }
- I attached images from the page where you can see this information.
What I'm asking for:
- A suggestion or update to the input so that:
- It does not include the start URL in the results.
- It ignores pages that would be disallowed by robots.txt, you can find RepVue’s robots.txt here: https://www.repvue.com/robots.txt.
- It returns clean structured JSON rather than full raw page text.
Let me know if you need the full input JSON, happy to share it again. Thanks a lot!
rudy-seo
Hello!
It's been almost a week and I really need a solution to this to continue with my current project.
Could somebody give me a hand?
Pricing
Pricing model
Pay per usageThis Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage.