RegExp Scraper avatar

RegExp Scraper

Try for free

30 minutes trial then $25.00/month - No credit card required now

Go to Store
RegExp Scraper

RegExp Scraper

ib4ngz/regexp-scraper
Try for free

30 minutes trial then $25.00/month - No credit card required now

This actor scrapes data from a list of provided URLs using regular expressions for precise and customizable pattern matching. It can handle both static and dynamic web pages and supports depth-based crawling to explore links and extract data from multiple levels of the web.

This actor scrapes data from a list of provided URLs using regular expressions for precise and customizable pattern matching. It can handle both static and dynamic web pages and supports depth-based crawling to explore links and extract data from multiple levels of the web.

Features

  • RegEx-Based Scraping: Allows precise and customizable data extraction using regular expressions to match specific patterns on web pages.
  • Static and Dynamic Page Support: Capable of handling both static content and dynamically loaded content (e.g., JavaScript-rendered pages).
  • Depth-Based Crawling: Supports crawling with configurable depth, allowing you to scrape data from multiple levels of linked pages.
  • Flexible Input Configuration: Accepts a list of starting URLs and provides advanced configuration for customizing the crawl behavior.
  • Proxy Configuration: Advanced proxy support for anonymous scraping and bypassing IP restrictions.
  • Unique Dataset: Ensures only unique matches are saved by preventing duplicates during the crawling process.

Input Schema

  • startUrls: A list of URLs to start the crawling process from. These URLs will be used as entry points for the scraper.
  • maxDepth: The maximum depth for crawling. It defines how many levels of linked pages will be crawled starting from the start URLs.
  • patterns: Regular expressions (RegEx) used to extract data from HTML content. Each pattern should be written on a new line.
  • crawlerType: The type of crawler to use.
    • Crawlee + Cheerio: A fast crawler that uses Cheerio for parsing HTML content. It does not execute JavaScript and is suitable for static pages.
    • Crawlee + Puppeteer + Chrome: A slower crawler that uses Puppeteer and headless Chrome to render JavaScript and load dynamic content, making it suitable for JavaScript-heavy websites.
  • proxyConfiguration: Configuration for using proxies to anonymize requests and avoid IP blocking.

Dataset Schema

  • match: The match found based on the provided regular expression. It is displayed as text.

How to Use

  1. Configure Input:

    • Provide a list of startUrls where the scraper should begin its operation. These URLs should be valid web addresses that the scraper will visit.
    • Set the maxDepth to control how deep the crawler will follow links on each page. A depth of 1 means only the start page will be scraped, while higher values will scrape linked pages.
  2. Set Regex Patterns:

    • Define the regex patterns in the patterns field. These patterns will be used to search through the HTML content of the scraped pages. Each pattern should be on a new line.
  3. Choose Crawler Type:

    • Select a crawlerType based on your needs:
      • Crawlee + Cheerio: Suitable for fast scraping of static HTML pages.
      • Crawlee + Puppeteer + Chrome: Required for scraping JavaScript-heavy websites that rely on dynamic content.
  4. Advanced Configuration (Optional):

    • Optionally configure proxy settings under proxyConfiguration to route your requests through a proxy.
  5. Run the Actor:

    • Run the actor, and it will start crawling the provided URLs, applying regex patterns to find matches in the page content.
    • The results will be saved in the dataset for later analysis.
  6. Review Results:

    • After execution, you can access the results through the Output tab. The matches found by the regex patterns will be displayed here.

Conclusion

The RegExp Scraper is a powerful tool designed to scrape both static and dynamic content from websites using flexible regex patterns. Whether you're scraping simple static HTML pages or more complex sites that rely on JavaScript for rendering content, this actor provides the necessary capabilities to handle different types of web scraping tasks.

With customizable options such as crawling depth, regex pattern matching, and the ability to select between different types of crawlers, users can tailor the scraping process to their specific needs. The integration of proxy configuration and the ability to store results in a structured dataset also makes this actor ideal for large-scale scraping operations.

By following the provided input and configuration guidelines, users can easily deploy the RegExp Scraper to gather valuable data from a wide variety of websites. Whether you're a developer, data scientist, or anyone looking to extract structured information from the web, this actor offers a robust and flexible solution for your web scraping needs.

Developer
Maintained by Community

Actor Metrics

  • 1 monthly user

  • 1 star

  • Created in Jan 2025

  • Modified 11 hours ago