Universal Web Scraper - Extract Any URL avatar

Universal Web Scraper - Extract Any URL

Pricing

$30.00 / 1,000 web scrape results

Go to Apify Store
Universal Web Scraper - Extract Any URL

Universal Web Scraper - Extract Any URL

Pay-per-result web scraper with JS rendering, CSS selector / XPath / regex extraction, schema validation, retry on failure. Use for product catalogs, competitor pricing, news aggregation, lead generation. Fast (<2s/page), respects robots.txt by default.

Pricing

$30.00 / 1,000 web scrape results

Rating

0.0

(0)

Developer

2x lazymac

2x lazymac

Maintained by Community

Actor stats

0

Bookmarked

5

Total users

2

Monthly active users

5 days ago

Last modified

Categories

Share

Web Scraper Toolkit

Extract structured data from any webpage -- metadata, links, headlines, images, tables, full text, or custom CSS selectors. Scrape up to 10 URLs in a single run with 8 flexible extraction modes. No browser required, no API keys needed.

Built for developers, data analysts, content marketers, and anyone who needs to pull structured data from the web quickly and reliably.


What It Does

Web Scraper Toolkit fetches public web pages and extracts data in one of 8 modes. You can grab just the metadata (title, description, OG tags), extract all links on a page, pull out headlines, collect images with alt text, parse HTML tables into structured rows, extract clean body text, or target specific elements using custom CSS selectors. The "full" mode combines metadata, headlines, links, images, and tables in a single pass.

Each URL is processed independently, and results are pushed to the Apify dataset one by one. If a URL fails, the others still succeed -- you never lose partial results.

Key Capabilities

  • 8 Extraction Modes: full, metadata, links, headlines, images, tables, text, custom
  • Batch Processing: Scrape up to 10 URLs per run
  • Custom CSS Selectors: Target any element on the page with standard CSS selector syntax
  • Automatic Redirect Handling: Follows HTTP redirects transparently
  • Link Resolution: Relative URLs are automatically resolved to absolute URLs
  • Deduplication: Link extraction removes duplicate URLs automatically
  • Graceful Error Handling: Failed URLs are reported with error messages, other URLs continue processing
  • Lightweight: No browser rendering -- pure HTTP + HTML parsing for fast, cost-effective execution

What Data You Get

Common Fields (All Modes)

FieldTypeDescription
urlstringThe URL that was scraped
statusnumberHTTP status code
timestampnumberUnix timestamp of when the scrape occurred

Metadata Mode

FieldTypeDescription
metadata.titlestringPage title from <title> tag
metadata.descriptionstringMeta description content
metadata.ogImagestringOpen Graph image URL
metadata.ogTitlestringOpen Graph title
metadata.canonicalstringCanonical URL
metadata.languagestringPage language from lang attribute
metadata.urlstringThe requested URL
FieldTypeDescription
linksarrayArray of link objects
links[].urlstringAbsolute URL of the link
links[].textstringAnchor text (null if empty)
countnumberTotal number of unique links found

Headlines Mode

FieldTypeDescription
headlinesarrayArray of headline objects
headlines[].tagstringHTML tag (h1, h2, or h3)
headlines[].textstringHeadline text content
countnumberTotal number of headlines found

Images Mode

FieldTypeDescription
imagesarrayArray of image objects
images[].urlstringAbsolute URL of the image
images[].altstringAlt text (null if missing)
countnumberTotal number of images found

Tables Mode

FieldTypeDescription
tablesarrayArray of table objects
tables[].headersarrayColumn headers from <th> elements
tables[].rowsarrayArray of row arrays (each row is an array of cell text)
tables[].rowCountnumberNumber of data rows
countnumberTotal number of tables found

Text Mode

FieldTypeDescription
textstringClean body text with scripts, styles, nav, footer, and header removed
lengthnumberCharacter count of the extracted text

Custom Mode

FieldTypeDescription
resultsarrayArray of matched element objects
results[].textstringText content of the matched element
results[].tagstringHTML tag name of the matched element
countnumberTotal number of matched elements

Full Mode

Returns metadata, headlines, links (top 50), images (top 20), and tables all in one result object.


How to Use

Basic Usage

  1. Open the Web Scraper Toolkit on Apify
  2. Enter your URLs as a JSON array (e.g., ["https://example.com"])
  3. Select a scraping mode (default: full)
  4. Click "Start"
  5. View results in the "Dataset" tab

Custom CSS Selector

  1. Set mode to custom
  2. Enter your CSS selector in the "CSS Selector" field (e.g., .article-title, #main-content p, table.data-table tr)
  3. The actor extracts text content and tag name for every matching element

Input Configuration

FieldTypeRequiredDefaultDescription
urlsarrayYes--JSON array of URLs to scrape. Maximum 10 URLs per run. Each URL must be a publicly accessible webpage. Example: ["https://example.com", "https://github.com"]
modestringNofullExtraction mode. One of: full (metadata + headlines + links + images + tables), metadata (page title, description, OG tags, canonical, language), links (all unique links with anchor text), headlines (H1, H2, H3 headings), images (all images with alt text), tables (HTML tables parsed into rows), text (clean body text), custom (elements matching a CSS selector).
selectorstringNo--CSS selector for custom mode. Supports any valid CSS selector syntax: element selectors (div, p), class selectors (.class-name), ID selectors (#id), attribute selectors ([data-type="value"]), combinators (div > p, ul li), pseudo-classes (:first-child, :nth-child(2)). Ignored when mode is not custom.

Output Example

Full Mode

{
"url": "https://example.com",
"status": 200,
"timestamp": 1713264000000,
"metadata": {
"title": "Example Domain",
"description": null,
"ogImage": null,
"ogTitle": null,
"canonical": null,
"language": null,
"url": "https://example.com"
},
"headlines": [
{ "tag": "h1", "text": "Example Domain" }
],
"links": [
{ "url": "https://www.iana.org/domains/example", "text": "More information..." }
],
"images": [],
"tables": []
}
{
"url": "https://news.ycombinator.com",
"status": 200,
"timestamp": 1713264000000,
"links": [
{ "url": "https://news.ycombinator.com/newest", "text": "new" },
{ "url": "https://news.ycombinator.com/front", "text": "past" },
{ "url": "https://news.ycombinator.com/newcomments", "text": "comments" },
{ "url": "https://some-article.com/post", "text": "Show HN: My new project" }
],
"count": 187
}

Tables Mode

{
"url": "https://en.wikipedia.org/wiki/List_of_countries",
"status": 200,
"timestamp": 1713264000000,
"tables": [
{
"headers": ["Country", "Population", "Area (km2)"],
"rows": [
["China", "1,425,671,352", "9,596,961"],
["India", "1,428,627,663", "3,287,263"]
],
"rowCount": 195
}
],
"count": 1
}

Custom Mode

{
"url": "https://example.com",
"status": 200,
"timestamp": 1713264000000,
"results": [
{ "text": "Example Domain", "tag": "h1" },
{ "text": "This domain is for use in illustrative examples.", "tag": "p" }
],
"count": 2
}

Cost Estimation

This actor uses the pay-per-event pricing model. You are charged per URL scraped.

ActionEventEstimated Cost
Scrape 1 URL1 event~$0.01 - $0.03 per URL
Scrape 10 URLs10 events~$0.10 - $0.30 per run

Typical run uses minimal compute (128 MB RAM, 1-3 seconds per URL) because there is no browser involved.


Integration Guide

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
# Scrape metadata from multiple URLs
run = client.actor("lazymac/web-scraper-toolkit").call(run_input={
"urls": ["https://github.com", "https://gitlab.com", "https://bitbucket.org"],
"mode": "metadata"
})
dataset = client.dataset(run["defaultDatasetId"])
for item in dataset.iterate_items():
print(f"{item['url']}: {item['metadata']['title']}")
# Extract all links from a page
run = client.actor("lazymac/web-scraper-toolkit").call(run_input={
"urls": ["https://news.ycombinator.com"],
"mode": "links"
})
dataset = client.dataset(run["defaultDatasetId"])
for item in dataset.iterate_items():
print(f"Found {item['count']} links")
for link in item['links'][:10]:
print(f" {link['text']}: {link['url']}")
# Custom CSS selector extraction
run = client.actor("lazymac/web-scraper-toolkit").call(run_input={
"urls": ["https://example.com"],
"mode": "custom",
"selector": "h1, h2, p"
})

Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
// Full extraction
const run = await client.actor('lazymac/web-scraper-toolkit').call({
urls: ['https://github.com'],
mode: 'full',
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items[0].metadata.title);
console.log(`Headlines: ${items[0].headlines.length}`);
console.log(`Links: ${items[0].links.length}`);
// Extract tables
const tableRun = await client.actor('lazymac/web-scraper-toolkit').call({
urls: ['https://en.wikipedia.org/wiki/List_of_programming_languages'],
mode: 'tables',
});
const { items: tableItems } = await client.dataset(tableRun.defaultDatasetId).listItems();
tableItems[0].tables.forEach(table => {
console.log(`Table with ${table.rowCount} rows, headers: ${table.headers.join(', ')}`);
});

Apify API (cURL)

# Start a run
curl -X POST "https://api.apify.com/v2/acts/lazymac~web-scraper-toolkit/runs" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com"], "mode": "metadata"}'
# Get results
curl "https://api.apify.com/v2/datasets/DATASET_ID/items?format=json" \
-H "Authorization: Bearer YOUR_API_TOKEN"

Use Cases

  • Content Monitoring: Track headlines and text changes on competitor websites
  • Link Analysis: Extract all outbound/inbound links from a page for SEO research
  • Data Collection: Scrape HTML tables from Wikipedia, government sites, or any public data source
  • Social Media Preview: Check OG tags and metadata before sharing links
  • Research Automation: Collect structured data from multiple pages in one run
  • Image Auditing: Find all images on a page and check for missing alt text
  • Custom Extraction: Use CSS selectors to target specific page elements for any use case
  • Price Monitoring: Extract product prices from e-commerce pages on a schedule to track price changes
  • News Aggregation: Scrape headlines from multiple news sources and compile a daily digest
  • Accessibility Auditing: Extract all images and check for missing alt text across your site's pages
  • Sitemap Verification: Scrape links from key pages to verify your internal linking structure matches your sitemap
  • Academic Research: Collect structured data from public data portals and government websites

Integration with Other Tools

Zapier

  1. Create a Zap with your desired trigger (schedule, new spreadsheet row, webhook, etc.)
  2. Add an action: Apify -- Run Actor
  3. Select lazymac/web-scraper-toolkit and configure URLs and mode
  4. Add downstream actions to send extracted data to Google Sheets, Slack, email, Airtable, or any Zapier-connected app
  5. Map fields like metadata.title, links[].url, or headlines[].text to your destination columns

Make (Integromat)

  1. Create a new scenario with the Apify module
  2. Select "Run an Actor" and choose lazymac/web-scraper-toolkit
  3. Use an iterator to process each scraped URL result individually
  4. Route data to Google Sheets, a REST API, database, or notification service based on conditions

Google Sheets Integration

from apify_client import ApifyClient
import gspread
from oauth2client.service_account import ServiceAccountCredentials
# Scrape metadata from multiple pages
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("lazymac/web-scraper-toolkit").call(run_input={
"urls": ["https://example.com", "https://github.com"],
"mode": "metadata"
})
dataset = client.dataset(run["defaultDatasetId"])
results = list(dataset.iterate_items())
# Write to Google Sheets
scope = ["https://spreadsheets.google.com/feeds"]
creds = ServiceAccountCredentials.from_json_keyfile_name("creds.json", scope)
gc = gspread.authorize(creds)
sheet = gc.open("Web Data").sheet1
sheet.append_row(["URL", "Title", "Description", "OG Image", "Language"])
for r in results:
m = r.get("metadata", {})
sheet.append_row([r["url"], m.get("title"), m.get("description"), m.get("ogImage"), m.get("language")])

Webhooks

Set up an Apify webhook with event ACTOR.RUN.SUCCEEDED to receive a notification when scraping completes. The webhook payload includes the run ID and dataset ID, allowing you to fetch results immediately from your backend.

Scheduled Monitoring Pipeline

# Schedule this as a daily cron job to track headline changes
RESULT=$(curl -s -X POST "https://api.apify.com/v2/acts/lazymac~web-scraper-toolkit/run-sync-get-dataset-items" \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"urls": ["https://news.ycombinator.com"], "mode": "headlines"}')
echo "$RESULT" | jq '.[0].headlines[] | .text' > today_headlines.txt
diff yesterday_headlines.txt today_headlines.txt > headline_changes.txt
cp today_headlines.txt yesterday_headlines.txt

CI/CD Pipeline Integration

Add link validation to your deployment pipeline:

# GitHub Actions example
- name: Check Links After Deploy
run: |
RESULT=$(curl -s -X POST "https://api.apify.com/v2/acts/lazymac~web-scraper-toolkit/run-sync-get-dataset-items" \
-H "Authorization: Bearer ${{ secrets.APIFY_TOKEN }}" \
-H "Content-Type: application/json" \
-d '{"urls": ["${{ env.DEPLOY_URL }}"], "mode": "links"}')
COUNT=$(echo $RESULT | jq '.[0].count')
echo "Found $COUNT links on deployed page"

Tips and Tricks

  1. Use metadata mode for quick page audits. If you only need the title, description, and OG tags, the metadata mode is the fastest option. It skips link, image, and table extraction entirely.

  2. Combine modes across multiple runs. Run once with headlines mode and once with links mode to get targeted datasets. This is more efficient than full mode if you only need specific data types.

  3. Use custom CSS selectors for precision extraction. Instead of parsing the entire page, target exactly the elements you need. For example, .product-price on e-commerce pages or article p for blog content.

  4. Batch related URLs together. Scrape up to 10 URLs per run to minimize API calls and overhead. Group URLs by site or purpose for cleaner dataset organization.

  5. Check the status code in results. A 200 status means the page loaded successfully. A 301/302 means it was redirected. A 403/404 means access was denied or the page does not exist. Always filter by status in your downstream processing.

  6. Export tables directly to CSV. The tables mode output is already structured with headers and rows, making it trivial to convert to CSV format for spreadsheet import.

  7. Use text mode for content analysis. The text extraction removes nav, footer, header, scripts, and styles, giving you clean body content. This is ideal for word count analysis, sentiment analysis, or content comparison.

  8. Schedule regular scrapes for monitoring. Use Apify's built-in scheduler to run the actor daily or weekly on specific pages. Track changes by comparing datasets over time.


Frequently Asked Questions

Q: Does this actor render JavaScript? A: No. It fetches raw HTML using a lightweight HTTP client. For JavaScript-rendered pages (SPAs built with React, Vue, Angular), you may not get the full content. Consider using a browser-based scraper for such sites.

Q: What is the maximum number of URLs per run? A: 10 URLs per run. For larger batches, run the actor multiple times programmatically using the Apify API or schedule multiple runs.

Q: How does the actor handle failed URLs? A: Each URL is processed independently. If one URL fails (timeout, DNS error, HTTP error), it is reported with an error message, and the remaining URLs continue processing normally.

Q: Can I scrape pages behind authentication? A: No. The actor can only access publicly available URLs. Pages requiring login will return the login page instead of the actual content.

Q: What CSS selectors are supported in custom mode? A: All standard CSS selectors are supported, including element (div), class (.class), ID (#id), attribute ([href]), combinators (div > p, ul li), and pseudo-classes (:first-child, :nth-of-type(2)). The selector is passed to cheerio, which implements the CSS Selectors Level 3 specification.

Q: Are relative URLs in link extraction resolved to absolute? A: Yes. All relative URLs are automatically resolved to full absolute URLs using the page's base URL.

Q: How does deduplication work in links mode? A: Links are deduplicated by URL. If the same URL appears multiple times with different anchor text, only the first occurrence is kept.

Q: What content is removed in text mode? A: Script tags, style tags, <nav>, <footer>, and <header> elements are removed before extracting body text. This gives you the main content without navigation, boilerplate, or code.

Q: Can I export results to CSV or Excel? A: Yes. Apify datasets support export to JSON, CSV, XML, and Excel formats. After the run completes, use the dataset export API or download directly from the Apify Console.

Q: Is there a timeout per URL? A: Yes, each URL has a 15-second timeout. If a page does not respond within 15 seconds, it is skipped with an error message.

Q: Can I use this actor with the Apify CLI? A: Yes. Install the Apify CLI (npm install -g apify-cli), then run: apify call lazymac/web-scraper-toolkit -i '{"urls": ["https://example.com"], "mode": "metadata"}'. Results are saved to the local dataset.

Q: Does the actor handle rate limiting? A: The actor processes URLs sequentially, which naturally avoids hitting rate limits. For sites with aggressive rate limiting, consider adding a proxy configuration or reducing the number of URLs per run.

Q: Can I scrape PDF files or images? A: No. The actor is designed for HTML web pages only. It sends Accept: text/html headers and parses the response as HTML. Non-HTML responses (PDFs, images, JSON APIs) will either fail or return empty results.

Q: What happens if a URL returns a redirect loop? A: The actor follows redirects up to a reasonable limit. If a redirect loop is detected (too many redirects), the URL is reported with an error message and processing continues with the next URL.

Q: Can I extract data from iframes? A: No. The actor fetches and parses only the main page HTML. Content inside iframes (including embedded videos, maps, and third-party widgets) is not included in the extraction.

Q: How do I scrape more than 10 URLs? A: Run the actor multiple times with batches of 10 URLs each. You can automate this using the Apify API in a loop, or set up multiple scheduled runs with different URL batches.

Q: What encoding does the actor support? A: The actor handles UTF-8 encoded pages by default. Most modern web pages use UTF-8. Pages with other encodings (ISO-8859-1, Shift_JIS, etc.) may have character display issues in the output.


Limitations

  • Does not execute JavaScript (static HTML analysis only)
  • Maximum 10 URLs per run
  • Cannot access authenticated or paywalled pages
  • 15-second timeout per URL
  • Full mode limits links to 50 and images to 20 per URL
  • Custom mode returns text content only, not HTML

Changelog

  • v1.0 - Initial release with 8 extraction modes and pay-per-event pricing