Extended GPT Scraper avatar

Extended GPT Scraper

Try for free

No credit card required

View all Actors
Extended GPT Scraper

Extended GPT Scraper

drobnikj/extended-gpt-scraper
Try for free

No credit card required

Extract data from any website and feed it into GPT via the OpenAI API. Use ChatGPT to proofread content, analyze sentiment, summarize reviews, extract contact details, and much more.

Do you want to learn more about this Actor?

Get a demo
CB

Not crawling other pages besides start urls

Closed

convincing_bush opened this issue
a year ago

I tried a variety of configurations using the default link selector but haven't been able to get the crawler to go past the first page provided by the start url. "linkSelector": "a[href]", Globs: tried setting to [] and excluding property entirely Crawling depth: set to 0 for unlimited and tried with a given value of 2 No content selector is used.

Any tips for getting this portion to work? Otherwise it works great!

CB

convincing_bush

a year ago

Figured this out, I think you need to provide at least one glob for it to work.

SJ

shiraklein-justt

a year ago

Hi, I am also facing the same issue.

I want to scrape only the urls that contain the certain strings. For example, let's take https://news.ycombinator.com as the start URL and define the string to be "ask", so the scraper should scrape the page https://news.ycombinator.com/ask.

I tried the configurations below, but the crawler didn't go past the first page provided by the start url.

  1. startUrls: https://news.ycombinator.com/, "globs": [], "linkSelector": "a[href*=ask]"
  2. startUrls: https://news.ycombinator.com/, "globs": ["*ask*"], "linkSelector": "a[href]"
  3. startUrls: https://news.ycombinator.com/, "globs": ["*ask*"], "linkSelector": "a[href*=ask]"
Developer
Maintained by Apify
Actor metrics
  • 77 monthly users
  • 44 stars
  • 99.6% runs succeeded
  • 3.4 days response time
  • Created in Jun 2023
  • Modified 7 days ago