Extended GPT Scraper avatar
Extended GPT Scraper

Pricing

Pay per usage

Go to Store
Extended GPT Scraper

Extended GPT Scraper

Developed by

Jakub Drobník

Jakub Drobník

Maintained by Apify

Extract data from any website and feed it into GPT via the OpenAI API. Use ChatGPT to proofread content, analyze sentiment, summarize reviews, extract contact details, and much more.

4.6 (4)

Pricing

Pay per usage

87

Total users

1.5K

Monthly users

33

Runs succeeded

99%

Last modified

6 months ago

CB

Not crawling other pages besides start urls

Closed

convincing_bush opened this issue
2 years ago

I tried a variety of configurations using the default link selector but haven't been able to get the crawler to go past the first page provided by the start url. "linkSelector": "a[href]", Globs: tried setting to [] and excluding property entirely Crawling depth: set to 0 for unlimited and tried with a given value of 2 No content selector is used.

Any tips for getting this portion to work? Otherwise it works great!

CB

convincing_bush

2 years ago

Figured this out, I think you need to provide at least one glob for it to work.

SJ

shiraklein-justt

2 years ago

Hi, I am also facing the same issue.

I want to scrape only the urls that contain the certain strings. For example, let's take https://news.ycombinator.com as the start URL and define the string to be "ask", so the scraper should scrape the page https://news.ycombinator.com/ask.

I tried the configurations below, but the crawler didn't go past the first page provided by the start url.

  1. startUrls: https://news.ycombinator.com/, "globs": [], "linkSelector": "a[href*=ask]"
  2. startUrls: https://news.ycombinator.com/, "globs": ["*ask*"], "linkSelector": "a[href]"
  3. startUrls: https://news.ycombinator.com/, "globs": ["*ask*"], "linkSelector": "a[href*=ask]"