No credit card required
Web Scraper
No credit card required
Crawls arbitrary websites using the Chrome browser and extracts data from pages using a provided JavaScript code. The actor supports both recursive crawling and lists of URLs and automatically manages concurrency for maximum performance. This is Apify's basic tool for web crawling and scraping.
I am trying to use regex for a link selection filtering, and using the pseudoUrls, its not working, but testing the pseudoUrls, it gives an OK in the detection.
I use this pseudoURL: "pseudoUrls": [ { "purl": "http[s?]://planderecuperacion.gob.es/como-acceder-a-los-fondos/convocatorias?combine=&field_estado_value%5B0%5D=Proximamente&field_estado_value%5B1%5D=Abierta&field_tipo_convocatoria_value%5BAyuda/subvencion%5D=Ayuda/subvencion&page=[\d*]" } ], and i try to catch this kind of URLs:
<a href="http://planderecuperacion.gob.es/como-acceder-a-los-fondos/convocatorias?combine=&field_estado_value%5B0%5D=Proximamente&field_estado_value%5B1%5D=Abierta&field_tipo_convocatoria_value%5BAyuda/subvencion%5D=Ayuda/subvencion&page=41" title="Ir a pagina 41"> 41 Pagina
But I got this warning:
2024-03-02T14:39:16.217Z WARN pseudoUrls
option is deprecated, use globs
or regexps
instead
And i dont know how to use that "regexps" in the web console instead of pseudoURLs.
Thanks in advance.
Yes, they are deprecated (I think for more than a year now), but they should still work as before - although as a deprecated option, we won't likely fix anything on it, so adopting globs is surely something I would suggest.
My guess is your problem is the escaping of special characters (e.g. &
vs &
).
I finally managed to make it work, it was more an HTML problem than others. Thank you anyway :D. Nonetheless, I would love to see the regexps option in the web console, instead of the pseudoUrls, since I guess Globs only accepts wildcards and not regular expresions...
- 3.7k monthly users
- 98.8% runs succeeded
- 3.6 days response time
- Created in Mar 2019
- Modified about 1 month ago