Extended GPT Scraper avatar
Extended GPT Scraper
Try for free

No credit card required

View all Actors
Extended GPT Scraper

Extended GPT Scraper

drobnikj/extended-gpt-scraper
Try for free

No credit card required

Extract data from any website and feed it into GPT via the OpenAI API. Use ChatGPT to proofread content, analyze sentiment, summarize reviews, extract contact details, and much more.

User avatar

Not crawling other pages besides start urls

Closed

convincing_bush opened this issue
9 months ago

I tried a variety of configurations using the default link selector but haven't been able to get the crawler to go past the first page provided by the start url. "linkSelector": "a[href]", Globs: tried setting to [] and excluding property entirely Crawling depth: set to 0 for unlimited and tried with a given value of 2 No content selector is used.

Any tips for getting this portion to work? Otherwise it works great!

User avatar

convincing_bush

9 months ago

Figured this out, I think you need to provide at least one glob for it to work.

User avatar

shiraklein-justt

8 months ago

Hi, I am also facing the same issue.

I want to scrape only the urls that contain the certain strings. For example, let's take https://news.ycombinator.com as the start URL and define the string to be "ask", so the scraper should scrape the page https://news.ycombinator.com/ask.

I tried the configurations below, but the crawler didn't go past the first page provided by the start url.

  1. startUrls: https://news.ycombinator.com/, "globs": [], "linkSelector": "a[href*=ask]"
  2. startUrls: https://news.ycombinator.com/, "globs": ["*ask*"], "linkSelector": "a[href]"
  3. startUrls: https://news.ycombinator.com/, "globs": ["*ask*"], "linkSelector": "a[href*=ask]"
Developer
Maintained by Apify
Actor metrics
  • 74 monthly users
  • 95.2% runs succeeded
  • 1.9 days response time
  • Created in Jun 2023
  • Modified about 2 hours ago