Pricing

$30.00 / 1,000 web scrape results

Universal Web Scraper - Extract Any URL

Pay-per-result web scraper with JS rendering, CSS selector / XPath / regex extraction, schema validation, retry on failure. Use for product catalogs, competitor pricing, news aggregation, lead generation. Fast (<2s/page), respects robots.txt by default.

Pricing

$30.00 / 1,000 web scrape results

Rating

0.0

(0)

Developer

2x lazymac

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Web Scraper Toolkit

Extract structured data from any webpage -- metadata, links, headlines, images, tables, full text, or custom CSS selectors. Scrape up to 10 URLs in a single run with 8 flexible extraction modes. No browser required, no API keys needed.

Built for developers, data analysts, content marketers, and anyone who needs to pull structured data from the web quickly and reliably.

What It Does

Web Scraper Toolkit fetches public web pages and extracts data in one of 8 modes. You can grab just the metadata (title, description, OG tags), extract all links on a page, pull out headlines, collect images with alt text, parse HTML tables into structured rows, extract clean body text, or target specific elements using custom CSS selectors. The "full" mode combines metadata, headlines, links, images, and tables in a single pass.

Each URL is processed independently, and results are pushed to the Apify dataset one by one. If a URL fails, the others still succeed -- you never lose partial results.

Key Capabilities

8 Extraction Modes: full, metadata, links, headlines, images, tables, text, custom
Batch Processing: Scrape up to 10 URLs per run
Custom CSS Selectors: Target any element on the page with standard CSS selector syntax
Automatic Redirect Handling: Follows HTTP redirects transparently
Link Resolution: Relative URLs are automatically resolved to absolute URLs
Deduplication: Link extraction removes duplicate URLs automatically
Graceful Error Handling: Failed URLs are reported with error messages, other URLs continue processing
Lightweight: No browser rendering -- pure HTTP + HTML parsing for fast, cost-effective execution

What Data You Get

Common Fields (All Modes)

Field	Type	Description
`url`	string	The URL that was scraped
`status`	number	HTTP status code
`timestamp`	number	Unix timestamp of when the scrape occurred

Metadata Mode

Field	Type	Description
`metadata.title`	string	Page title from `<title>` tag
`metadata.description`	string	Meta description content
`metadata.ogImage`	string	Open Graph image URL
`metadata.ogTitle`	string	Open Graph title
`metadata.canonical`	string	Canonical URL
`metadata.language`	string	Page language from `lang` attribute
`metadata.url`	string	The requested URL

Links Mode

Field	Type	Description
`links`	array	Array of link objects
`links[].url`	string	Absolute URL of the link
`links[].text`	string	Anchor text (null if empty)
`count`	number	Total number of unique links found

Headlines Mode

Field	Type	Description
`headlines`	array	Array of headline objects
`headlines[].tag`	string	HTML tag (h1, h2, or h3)
`headlines[].text`	string	Headline text content
`count`	number	Total number of headlines found

Images Mode

Field	Type	Description
`images`	array	Array of image objects
`images[].url`	string	Absolute URL of the image
`images[].alt`	string	Alt text (null if missing)
`count`	number	Total number of images found

Tables Mode

Field	Type	Description
`tables`	array	Array of table objects
`tables[].headers`	array	Column headers from `<th>` elements
`tables[].rows`	array	Array of row arrays (each row is an array of cell text)
`tables[].rowCount`	number	Number of data rows
`count`	number	Total number of tables found

Text Mode

Field	Type	Description
`text`	string	Clean body text with scripts, styles, nav, footer, and header removed
`length`	number	Character count of the extracted text

Custom Mode

Field	Type	Description
`results`	array	Array of matched element objects
`results[].text`	string	Text content of the matched element
`results[].tag`	string	HTML tag name of the matched element
`count`	number	Total number of matched elements

Full Mode

Returns metadata, headlines, links (top 50), images (top 20), and tables all in one result object.

How to Use

Basic Usage

Open the Web Scraper Toolkit on Apify
Enter your URLs as a JSON array (e.g., ["https://example.com"])
Select a scraping mode (default: full)
Click "Start"
View results in the "Dataset" tab

Custom CSS Selector

Set mode to custom
Enter your CSS selector in the "CSS Selector" field (e.g., .article-title, #main-content p, table.data-table tr)
The actor extracts text content and tag name for every matching element

Input Configuration

Field	Type	Required	Default	Description
`urls`	array	Yes	--	JSON array of URLs to scrape. Maximum 10 URLs per run. Each URL must be a publicly accessible webpage. Example: `["https://example.com", "https://github.com"]`
`mode`	string	No	`full`	Extraction mode. One of: `full` (metadata + headlines + links + images + tables), `metadata` (page title, description, OG tags, canonical, language), `links` (all unique links with anchor text), `headlines` (H1, H2, H3 headings), `images` (all images with alt text), `tables` (HTML tables parsed into rows), `text` (clean body text), `custom` (elements matching a CSS selector).
`selector`	string	No	--	CSS selector for `custom` mode. Supports any valid CSS selector syntax: element selectors (`div`, `p`), class selectors (`.class-name`), ID selectors (`#id`), attribute selectors (`[data-type="value"]`), combinators (`div > p`, `ul li`), pseudo-classes (`:first-child`, `:nth-child(2)`). Ignored when mode is not `custom`.

Output Example

Full Mode

{
  "url": "https://example.com",
  "status": 200,
  "timestamp": 1713264000000,
  "metadata": {
    "title": "Example Domain",
    "description": null,
    "ogImage": null,
    "ogTitle": null,
    "canonical": null,
    "language": null,
    "url": "https://example.com"
  },
  "headlines": [
    { "tag": "h1", "text": "Example Domain" }
  ],
  "links": [
    { "url": "https://www.iana.org/domains/example", "text": "More information..." }
  ],
  "images": [],
  "tables": []
}

Links Mode

{
  "url": "https://news.ycombinator.com",
  "status": 200,
  "timestamp": 1713264000000,
  "links": [
    { "url": "https://news.ycombinator.com/newest", "text": "new" },
    { "url": "https://news.ycombinator.com/front", "text": "past" },
    { "url": "https://news.ycombinator.com/newcomments", "text": "comments" },
    { "url": "https://some-article.com/post", "text": "Show HN: My new project" }
  ],
  "count": 187
}

Tables Mode

{
  "url": "https://en.wikipedia.org/wiki/List_of_countries",
  "status": 200,
  "timestamp": 1713264000000,
  "tables": [
    {
      "headers": ["Country", "Population", "Area (km2)"],
      "rows": [
        ["China", "1,425,671,352", "9,596,961"],
        ["India", "1,428,627,663", "3,287,263"]
      ],
      "rowCount": 195
    }
  ],
  "count": 1
}

Custom Mode

{
  "url": "https://example.com",
  "status": 200,
  "timestamp": 1713264000000,
  "results": [
    { "text": "Example Domain", "tag": "h1" },
    { "text": "This domain is for use in illustrative examples.", "tag": "p" }
  ],
  "count": 2
}

Cost Estimation

This actor uses the pay-per-event pricing model. You are charged per URL scraped.

Action	Event	Estimated Cost
Scrape 1 URL	1 event	~$0.01 - $0.03 per URL
Scrape 10 URLs	10 events	~$0.10 - $0.30 per run

Typical run uses minimal compute (128 MB RAM, 1-3 seconds per URL) because there is no browser involved.

Integration Guide

Python

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")

# Scrape metadata from multiple URLs
run = client.actor("lazymac/web-scraper-toolkit").call(run_input={
    "urls": ["https://github.com", "https://gitlab.com", "https://bitbucket.org"],
    "mode": "metadata"
})

dataset = client.dataset(run["defaultDatasetId"])
for item in dataset.iterate_items():
    print(f"{item['url']}: {item['metadata']['title']}")

# Extract all links from a page
run = client.actor("lazymac/web-scraper-toolkit").call(run_input={
    "urls": ["https://news.ycombinator.com"],
    "mode": "links"
})

dataset = client.dataset(run["defaultDatasetId"])
for item in dataset.iterate_items():
    print(f"Found {item['count']} links")
    for link in item['links'][:10]:
        print(f"  {link['text']}: {link['url']}")

# Custom CSS selector extraction
run = client.actor("lazymac/web-scraper-toolkit").call(run_input={
    "urls": ["https://example.com"],
    "mode": "custom",
    "selector": "h1, h2, p"
})

Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });

// Full extraction
const run = await client.actor('lazymac/web-scraper-toolkit').call({
    urls: ['https://github.com'],
    mode: 'full',
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items[0].metadata.title);
console.log(`Headlines: ${items[0].headlines.length}`);
console.log(`Links: ${items[0].links.length}`);

// Extract tables
const tableRun = await client.actor('lazymac/web-scraper-toolkit').call({
    urls: ['https://en.wikipedia.org/wiki/List_of_programming_languages'],
    mode: 'tables',
});

const { items: tableItems } = await client.dataset(tableRun.defaultDatasetId).listItems();
tableItems[0].tables.forEach(table => {
    console.log(`Table with ${table.rowCount} rows, headers: ${table.headers.join(', ')}`);
});

Apify API (cURL)

# Start a run
curl -X POST "https://api.apify.com/v2/acts/lazymac~web-scraper-toolkit/runs" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com"], "mode": "metadata"}'

# Get results
curl "https://api.apify.com/v2/datasets/DATASET_ID/items?format=json" \
  -H "Authorization: Bearer YOUR_API_TOKEN"

Use Cases

Content Monitoring: Track headlines and text changes on competitor websites
Link Analysis: Extract all outbound/inbound links from a page for SEO research
Data Collection: Scrape HTML tables from Wikipedia, government sites, or any public data source
Social Media Preview: Check OG tags and metadata before sharing links
Research Automation: Collect structured data from multiple pages in one run
Image Auditing: Find all images on a page and check for missing alt text
Custom Extraction: Use CSS selectors to target specific page elements for any use case
Price Monitoring: Extract product prices from e-commerce pages on a schedule to track price changes
News Aggregation: Scrape headlines from multiple news sources and compile a daily digest
Accessibility Auditing: Extract all images and check for missing alt text across your site's pages
Sitemap Verification: Scrape links from key pages to verify your internal linking structure matches your sitemap
Academic Research: Collect structured data from public data portals and government websites

Integration with Other Tools

Zapier

Create a Zap with your desired trigger (schedule, new spreadsheet row, webhook, etc.)
Add an action: Apify -- Run Actor
Select lazymac/web-scraper-toolkit and configure URLs and mode
Add downstream actions to send extracted data to Google Sheets, Slack, email, Airtable, or any Zapier-connected app
Map fields like metadata.title, links[].url, or headlines[].text to your destination columns

Make (Integromat)

Create a new scenario with the Apify module
Select "Run an Actor" and choose lazymac/web-scraper-toolkit
Use an iterator to process each scraped URL result individually
Route data to Google Sheets, a REST API, database, or notification service based on conditions

Google Sheets Integration

from apify_client import ApifyClient
import gspread
from oauth2client.service_account import ServiceAccountCredentials

# Scrape metadata from multiple pages
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("lazymac/web-scraper-toolkit").call(run_input={
    "urls": ["https://example.com", "https://github.com"],
    "mode": "metadata"
})

dataset = client.dataset(run["defaultDatasetId"])
results = list(dataset.iterate_items())

# Write to Google Sheets
scope = ["https://spreadsheets.google.com/feeds"]
creds = ServiceAccountCredentials.from_json_keyfile_name("creds.json", scope)
gc = gspread.authorize(creds)
sheet = gc.open("Web Data").sheet1

sheet.append_row(["URL", "Title", "Description", "OG Image", "Language"])
for r in results:
    m = r.get("metadata", {})
    sheet.append_row([r["url"], m.get("title"), m.get("description"), m.get("ogImage"), m.get("language")])

Webhooks

Set up an Apify webhook with event ACTOR.RUN.SUCCEEDED to receive a notification when scraping completes. The webhook payload includes the run ID and dataset ID, allowing you to fetch results immediately from your backend.

Scheduled Monitoring Pipeline

# Schedule this as a daily cron job to track headline changes
RESULT=$(curl -s -X POST "https://api.apify.com/v2/acts/lazymac~web-scraper-toolkit/run-sync-get-dataset-items" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://news.ycombinator.com"], "mode": "headlines"}')

echo "$RESULT" | jq '.[0].headlines[] | .text' > today_headlines.txt
diff yesterday_headlines.txt today_headlines.txt > headline_changes.txt
cp today_headlines.txt yesterday_headlines.txt

CI/CD Pipeline Integration

Add link validation to your deployment pipeline:

# GitHub Actions example
- name: Check Links After Deploy
  run: |
    RESULT=$(curl -s -X POST "https://api.apify.com/v2/acts/lazymac~web-scraper-toolkit/run-sync-get-dataset-items" \
      -H "Authorization: Bearer ${{ secrets.APIFY_TOKEN }}" \
      -H "Content-Type: application/json" \
      -d '{"urls": ["${{ env.DEPLOY_URL }}"], "mode": "links"}')
    COUNT=$(echo $RESULT | jq '.[0].count')
    echo "Found $COUNT links on deployed page"

Tips and Tricks

Use metadata mode for quick page audits. If you only need the title, description, and OG tags, the metadata mode is the fastest option. It skips link, image, and table extraction entirely.
Combine modes across multiple runs. Run once with headlines mode and once with links mode to get targeted datasets. This is more efficient than full mode if you only need specific data types.
Use custom CSS selectors for precision extraction. Instead of parsing the entire page, target exactly the elements you need. For example, .product-price on e-commerce pages or article p for blog content.
Batch related URLs together. Scrape up to 10 URLs per run to minimize API calls and overhead. Group URLs by site or purpose for cleaner dataset organization.
Check the status code in results. A 200 status means the page loaded successfully. A 301/302 means it was redirected. A 403/404 means access was denied or the page does not exist. Always filter by status in your downstream processing.
Export tables directly to CSV. The tables mode output is already structured with headers and rows, making it trivial to convert to CSV format for spreadsheet import.
Use text mode for content analysis. The text extraction removes nav, footer, header, scripts, and styles, giving you clean body content. This is ideal for word count analysis, sentiment analysis, or content comparison.
Schedule regular scrapes for monitoring. Use Apify's built-in scheduler to run the actor daily or weekly on specific pages. Track changes by comparing datasets over time.

Frequently Asked Questions

Q: Does this actor render JavaScript? A: No. It fetches raw HTML using a lightweight HTTP client. For JavaScript-rendered pages (SPAs built with React, Vue, Angular), you may not get the full content. Consider using a browser-based scraper for such sites.

Q: What is the maximum number of URLs per run? A: 10 URLs per run. For larger batches, run the actor multiple times programmatically using the Apify API or schedule multiple runs.

Q: How does the actor handle failed URLs? A: Each URL is processed independently. If one URL fails (timeout, DNS error, HTTP error), it is reported with an error message, and the remaining URLs continue processing normally.

Q: Can I scrape pages behind authentication? A: No. The actor can only access publicly available URLs. Pages requiring login will return the login page instead of the actual content.

Q: What CSS selectors are supported in custom mode? A: All standard CSS selectors are supported, including element (div), class (.class), ID (#id), attribute ([href]), combinators (div > p, ul li), and pseudo-classes (:first-child, :nth-of-type(2)). The selector is passed to cheerio, which implements the CSS Selectors Level 3 specification.

Q: Are relative URLs in link extraction resolved to absolute? A: Yes. All relative URLs are automatically resolved to full absolute URLs using the page's base URL.

Q: How does deduplication work in links mode? A: Links are deduplicated by URL. If the same URL appears multiple times with different anchor text, only the first occurrence is kept.

Q: What content is removed in text mode? A: Script tags, style tags, <nav>, <footer>, and <header> elements are removed before extracting body text. This gives you the main content without navigation, boilerplate, or code.

Q: Can I export results to CSV or Excel? A: Yes. Apify datasets support export to JSON, CSV, XML, and Excel formats. After the run completes, use the dataset export API or download directly from the Apify Console.

Q: Is there a timeout per URL? A: Yes, each URL has a 15-second timeout. If a page does not respond within 15 seconds, it is skipped with an error message.

Q: Can I use this actor with the Apify CLI? A: Yes. Install the Apify CLI (npm install -g apify-cli), then run: apify call lazymac/web-scraper-toolkit -i '{"urls": ["https://example.com"], "mode": "metadata"}'. Results are saved to the local dataset.

Q: Does the actor handle rate limiting? A: The actor processes URLs sequentially, which naturally avoids hitting rate limits. For sites with aggressive rate limiting, consider adding a proxy configuration or reducing the number of URLs per run.

Q: Can I scrape PDF files or images? A: No. The actor is designed for HTML web pages only. It sends Accept: text/html headers and parses the response as HTML. Non-HTML responses (PDFs, images, JSON APIs) will either fail or return empty results.

Q: What happens if a URL returns a redirect loop? A: The actor follows redirects up to a reasonable limit. If a redirect loop is detected (too many redirects), the URL is reported with an error message and processing continues with the next URL.

Q: Can I extract data from iframes? A: No. The actor fetches and parses only the main page HTML. Content inside iframes (including embedded videos, maps, and third-party widgets) is not included in the extraction.

Q: How do I scrape more than 10 URLs? A: Run the actor multiple times with batches of 10 URLs each. You can automate this using the Apify API in a loop, or set up multiple scheduled runs with different URL batches.

Q: What encoding does the actor support? A: The actor handles UTF-8 encoded pages by default. Most modern web pages use UTF-8. Pages with other encodings (ISO-8859-1, Shift_JIS, etc.) may have character display issues in the output.

Limitations

Does not execute JavaScript (static HTML analysis only)
Maximum 10 URLs per run
Cannot access authenticated or paywalled pages
15-second timeout per URL
Full mode limits links to 50 and images to 20 per URL
Custom mode returns text content only, not HTML

Changelog

v1.0 - Initial release with 8 extraction modes and pay-per-event pricing

Universal Web Scraper & Data Extractor – Fast No-Code Tool

motivational_nickel/my-actor

Universal web scraper that extracts structured data from almost any website. Detect and scrape webpage content into clean datasets (CSV, Excel, JSON) without coding. Ideal for web scraping, research, lead generation, automation pipelines, and large-scale data extraction.

Leoncio Jr Coronado

5.0

Universal AI Web Scraper

stanvanrooy6/universal-ai-web-scraper

Turn any website into an API. Extract structured data using plain English. Features anti-bot bypass, dynamic rendering, and web search. No coding needed.

Stan Van Rooy

1.5

Web2json Agent

legible_ship/web2json-agent

国强杨

Fast Sitemap Generator

eunit/sitemap-generator

Boost SEO with this automatic Sitemap Generator. Crawl any site to create XML, HTML, & TXT sitemaps. Supports custom depth, regex filters, & robots.txt. Compatible with Google Search Console.

Emmanuel Uchenna

5.0

Universal AI Page Monitor

vkuprin/universal-ai-page-monitor

Monitor any URL for changes — Amazon products, LinkedIn profiles, job boards, pricing pages, Hacker News, GitHub. AI generates CSS selectors + regex filters in plain English. Works in any language. No API key needed. MCP-ready for Claude, Cursor, Codex, Gemini, Cline.

Vitaly Kuprin

Robots.txt Validator

predictable_function/my-actor-3

List of website base URLs whose robots.txt files will be validated

riya rawat

5.0

Robots.txt Checker - CMS-Aware Analysis with AI Recommendations

alizarin_refrigerator-owner/robots-txt-checker

The Robots.txt Checker provides comprehensive analysis of your robots.txt file: Syntax Validation CMS Detection - Identify WordPress, Shopify, Drupal,& 6+ other CMS platforms Best Practice Check Companion File Checks - sitemap.xml, llms.txt, security.txt AI Recommendations - CMS-specific suggestions

The Howlers

Universal Website to API – Any Site → JSON

inquisitive_zeppelin/universal-website-to-api---any-site---json

Multi-URL Web Scraper is a fast, reliable, and highly flexible scraping tool designed to extract structured data from multiple web pages in a single run.

Hamza Ahmed

Scrape GPT - Universal AI Web Scraper Agent

paradox-analytics/scrape-gpt---universal-ai-web-scraper-agent

AI-powered universal web scraper that works on ANY website without configuration. Extract data from e-commerce, news sites, social media, and more using intelligent LLM-based field mapping. Features JSON-first extraction, automatic pagination, anti-bot bypass, and cost-effective caching.

Paradox Analytics

Web Search & Scrape by XCrawl Proxy

empathetic_chorus/xcrawl-search-scrape

Search the web or scrape any URL using XCrawl residential proxy network. Bypass anti-bot systems with automatic JS rendering fallback, global IP rotation, and configurable concurrency (1-20). Perfect for market research, LLM data collection, and content aggregation.