Web Scraper avatar
Web Scraper
Try for free

No credit card required

View all Actors
Web Scraper

Web Scraper

apify/web-scraper
Try for free

No credit card required

Crawls arbitrary websites using the Chrome browser and extracts data from pages using a provided JavaScript code. The actor supports both recursive crawling and lists of URLs and automatically manages concurrency for maximum performance. This is Apify's basic tool for web crawling and scraping.

User avatar

Error with "pseudoUrls", ¿it is deprecated?

Closed

ArturoS opened this issue
3 months ago

I am trying to use regex for a link selection filtering, and using the pseudoUrls, its not working, but testing the pseudoUrls, it gives an OK in the detection.

I use this pseudoURL: "pseudoUrls": [ { "purl": "http[s?]://planderecuperacion.gob.es/como-acceder-a-los-fondos/convocatorias?combine=&field_estado_value%5B0%5D=Proximamente&field_estado_value%5B1%5D=Abierta&field_tipo_convocatoria_value%5BAyuda/subvencion%5D=Ayuda/subvencion&page=[\d*]" } ], and i try to catch this kind of URLs:

<a href="http://planderecuperacion.gob.es/como-acceder-a-los-fondos/convocatorias?combine=&amp;field_estado_value%5B0%5D=Proximamente&amp;field_estado_value%5B1%5D=Abierta&amp;field_tipo_convocatoria_value%5BAyuda/subvencion%5D=Ayuda/subvencion&amp;page=41" title="Ir a pagina 41"> 41 Pagina

But I got this warning: 2024-03-02T14:39:16.217Z WARN pseudoUrls option is deprecated, use globs or regexps instead

And i dont know how to use that "regexps" in the web console instead of pseudoURLs.

Thanks in advance.

User avatar

Yes, they are deprecated (I think for more than a year now), but they should still work as before - although as a deprecated option, we won't likely fix anything on it, so adopting globs is surely something I would suggest.

My guess is your problem is the escaping of special characters (e.g. & vs &amp;).

User avatar

ArturoS

3 months ago

I finally managed to make it work, it was more an HTML problem than others. Thank you anyway :D. Nonetheless, I would love to see the regexps option in the web console, instead of the pseudoUrls, since I guess Globs only accepts wildcards and not regular expresions...

Developer
Maintained by Apify
Actor metrics
  • 3.4k monthly users
  • 99.9% runs succeeded
  • 3.2 days response time
  • Created in Mar 2019
  • Modified about 2 months ago