Pricing

$12.00/month + usage

Go to Store

RegExp Scraper

Try for free

Developed by

Iqbal R

This actor scrapes data from a list of provided URLs using regular expressions for precise and customizable pattern matching. It can handle both static and dynamic web pages and supports depth-based crawling to explore links and extract data from multiple levels of the web.

0.0 (0)

Pricing

$12.00/month + usage

Total users

Monthly users

Runs succeeded

>99%

Last modified

2 months ago

Automation

Developer tools

Features

RegEx-Based Scraping: Allows precise and customizable data extraction using regular expressions to match specific patterns on web pages.
Static and Dynamic Page Support: Capable of handling both static content and dynamically loaded content (e.g., JavaScript-rendered pages).
Depth-Based Crawling: Supports crawling with configurable depth, allowing you to scrape data from multiple levels of linked pages.
Flexible Input Configuration: Accepts a list of starting URLs and provides advanced configuration for customizing the crawl behavior.
Proxy Configuration: Advanced proxy support for anonymous scraping and bypassing IP restrictions.
Unique Dataset: Ensures only unique matches are saved by preventing duplicates during the crawling process.

Input Schema

startUrls: A list of URLs to start the crawling process from. These URLs will be used as entry points for the scraper.
maxDepth: The maximum depth for crawling. It defines how many levels of linked pages will be crawled starting from the start URLs.
patterns: Regular expressions (RegEx) used to extract data from HTML content. Each pattern should be written on a new line.
crawlerType: The type of crawler to use.
- Crawlee + Cheerio: A fast crawler that uses Cheerio for parsing HTML content. It does not execute JavaScript and is suitable for static pages.
- Crawlee + Puppeteer + Chrome: A slower crawler that uses Puppeteer and headless Chrome to render JavaScript and load dynamic content, making it suitable for JavaScript-heavy websites.
proxyConfiguration: Configuration for using proxies to anonymize requests and avoid IP blocking.

Dataset Schema

match: The match found based on the provided regular expression. It is displayed as text.

How to Use

Configure Input:
- Provide a list of startUrls where the scraper should begin its operation. These URLs should be valid web addresses that the scraper will visit.
- Set the maxDepth to control how deep the crawler will follow links on each page. A depth of 1 means only the start page will be scraped, while higher values will scrape linked pages.
Set Regex Patterns:
- Define the regex patterns in the patterns field. These patterns will be used to search through the HTML content of the scraped pages. Each pattern should be on a new line.
Choose Crawler Type:
- Select a crawlerType based on your needs:
  - Crawlee + Cheerio: Suitable for fast scraping of static HTML pages.
  - Crawlee + Puppeteer + Chrome: Required for scraping JavaScript-heavy websites that rely on dynamic content.
Advanced Configuration (Optional):
- Optionally configure proxy settings under proxyConfiguration to route your requests through a proxy.
Run the Actor:
- Run the actor, and it will start crawling the provided URLs, applying regex patterns to find matches in the page content.
- The results will be saved in the dataset for later analysis.
Review Results:
- After execution, you can access the results through the Output tab. The matches found by the regex patterns will be displayed here.

Conclusion

The RegExp Scraper is a powerful tool designed to scrape both static and dynamic content from websites using flexible regex patterns. Whether you're scraping simple static HTML pages or more complex sites that rely on JavaScript for rendering content, this actor provides the necessary capabilities to handle different types of web scraping tasks.

With customizable options such as crawling depth, regex pattern matching, and the ability to select between different types of crawlers, users can tailor the scraping process to their specific needs. The integration of proxy configuration and the ability to store results in a structured dataset also makes this actor ideal for large-scale scraping operations.

By following the provided input and configuration guidelines, users can easily deploy the RegExp Scraper to gather valuable data from a wide variety of websites. Whether you're a developer, data scientist, or anyone looking to extract structured information from the web, this actor offers a robust and flexible solution for your web scraping needs.

On this page

RegExp Scraper

Share Actor:

Web Scraper

apify/web-scraper

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

Apify

89K

4.5

Email Scraper

ib4ngz/email-scraper

This actor scrapes email addresses from a list of provided URLs. It recursively crawls pages, extracts unique emails, and stores them in a dataset. The actor supports DNS validation to ensure domain authenticity and allows filtering based on custom crawling depth.

Iqbal R

139

Simple Contact Info and Social Media Scraper

pajoe/simple-contact-info-and-social-media-scraper

This Apify actor is designed to crawl web pages and extract social media handles, emails, and phone numbers using Puppeteer. It can handle dynamic content and navigate through multiple pages, making it suitable for comprehensive data extraction tasks.

va-gasd

Dynamic Web Scraper

josejet/dynamic-web-scraper

Dynamic Web Scraper is an Apify Actor that gathers information online by simulating user browsing behavior on the web. It reduces the time and amount of scraped web pages by using a model (ChatGPT) to make decisions regarding browser navigation and results evaluation.

Pepa J W̚͠h̾̔̎̿͊͛̄͊e̢̦̲̰̦̋̇͗̾̑oi̟͈̯̝̊̉́̇͑̕ğ̆͘͡e͗͛o͊̔̇̄

145

web-scrape-data

angelbabyai123/my-actor

web-scrape-data

Angel Baby

Puppeteer Scraper

apify/puppeteer-scraper

Crawls websites with the headless Chrome and Puppeteer library using a provided server-side Node.js code. This crawler is an alternative to apify/web-scraper that gives you finer control over the process. Supports both recursive crawling and list of URLs. Supports login to website.

Apify

8.1K

5.0

Pro Web Content Crawler (With Images)

assertive_analogy/pro-web-content-crawler

Pro Web Content Crawler is a powerful tool that digs deep into web content and images. It handles complex sites, dynamic pages, and hidden content, making it perfect for extracting both data and images. Customizable and API-ready for your unique data needs.

Gideon Nesh

116

5.0

bcv-tasa-oficial

grupoaceivzla/bcv-tasa-oficial

Grupo ACEI

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Apify

8.8K

4.7

Playwright Scraper

apify/playwright-scraper

Crawls websites with the headless Chromium, Chrome, or Firefox browser and Playwright library using a provided server-side Node.js code. Supports both recursive crawling and a list of URLs. Supports login to a website.

Apify

1.9K

4.3