Website Content Crawler avatar
Website Content Crawler

Pricing

Pay per usage

Go to Store
Website Content Crawler

Website Content Crawler

Developed by

Apify

Maintained by Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

4.6 (38)

Pricing

Pay per usage

1224

Monthly users

6.4k

Runs succeeded

>99%

Response time

4.6 days

Last modified

a day ago

RS

Exclude Start URL and Disallowed Paths from Output + Return Clean JSON Structure

Open

rudy-seo opened this issue
7 days ago

Hi,

I'm opening this issue just because I hope you can help me figure out the configuration that I need to archive my goals with this actor.

I'm testing the actor and I'm trying to use it to extract structured data from RepVue company profile pages, but I'm running into a couple of problems I hope you can help with.

Here's my setup:

Issues I want to fix:

  1. Exclude the listing page from the output:
    I don’t want the startUrl (listing page) to appear in the results.
  2. Respect robots.txt rules:
    I want to automatically exclude any page that would be disallowed under the site’s robots.txt, including:
    • /companies/compare/-vs-
    • /user/*
    • /monitoring
    • /api/*
    • any URLs ending in .json or .js
  3. Clean structured JSON output:
    Instead of returning the full text content, I want a structured object for each company like this or at least something similar: { "url": "https://www.repvue.com/companies/Shogun", "company name": "Shogun", "company website": "https://getshogun.com/", "company description": "...", "company industry": "Internet", "company funding": "Venture Capital", "company size": "228", "company location": "United States", "company quota attainment": "31%" }
  • I attached images from the page where you can see this information.

What I'm asking for:

  • A suggestion or update to the input so that:
    • It does not include the start URL in the results.
    • It ignores pages that would be disallowed by robots.txt, you can find RepVue’s robots.txt here: https://www.repvue.com/robots.txt.
    • It returns clean structured JSON rather than full raw page text.

Let me know if you need the full input JSON, happy to share it again. Thanks a lot!

RS

rudy-seo

2 days ago

Hello!

It's been almost a week and I really need a solution to this to continue with my current project.

Could somebody give me a hand?

Pricing

Pricing model

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage.