Universal Web Scraper - Extract Any URL
Pricing
$30.00 / 1,000 web scrape results
Universal Web Scraper - Extract Any URL
Pay-per-result web scraper with JS rendering, CSS selector / XPath / regex extraction, schema validation, retry on failure. Use for product catalogs, competitor pricing, news aggregation, lead generation. Fast (<2s/page), respects robots.txt by default.
Pricing
$30.00 / 1,000 web scrape results
Rating
0.0
(0)
Developer
2x lazymac
Actor stats
0
Bookmarked
5
Total users
2
Monthly active users
5 days ago
Last modified
Categories
Share
Web Scraper Toolkit
Extract structured data from any webpage -- metadata, links, headlines, images, tables, full text, or custom CSS selectors. Scrape up to 10 URLs in a single run with 8 flexible extraction modes. No browser required, no API keys needed.
Built for developers, data analysts, content marketers, and anyone who needs to pull structured data from the web quickly and reliably.
What It Does
Web Scraper Toolkit fetches public web pages and extracts data in one of 8 modes. You can grab just the metadata (title, description, OG tags), extract all links on a page, pull out headlines, collect images with alt text, parse HTML tables into structured rows, extract clean body text, or target specific elements using custom CSS selectors. The "full" mode combines metadata, headlines, links, images, and tables in a single pass.
Each URL is processed independently, and results are pushed to the Apify dataset one by one. If a URL fails, the others still succeed -- you never lose partial results.
Key Capabilities
- 8 Extraction Modes: full, metadata, links, headlines, images, tables, text, custom
- Batch Processing: Scrape up to 10 URLs per run
- Custom CSS Selectors: Target any element on the page with standard CSS selector syntax
- Automatic Redirect Handling: Follows HTTP redirects transparently
- Link Resolution: Relative URLs are automatically resolved to absolute URLs
- Deduplication: Link extraction removes duplicate URLs automatically
- Graceful Error Handling: Failed URLs are reported with error messages, other URLs continue processing
- Lightweight: No browser rendering -- pure HTTP + HTML parsing for fast, cost-effective execution
What Data You Get
Common Fields (All Modes)
| Field | Type | Description |
|---|---|---|
url | string | The URL that was scraped |
status | number | HTTP status code |
timestamp | number | Unix timestamp of when the scrape occurred |
Metadata Mode
| Field | Type | Description |
|---|---|---|
metadata.title | string | Page title from <title> tag |
metadata.description | string | Meta description content |
metadata.ogImage | string | Open Graph image URL |
metadata.ogTitle | string | Open Graph title |
metadata.canonical | string | Canonical URL |
metadata.language | string | Page language from lang attribute |
metadata.url | string | The requested URL |
Links Mode
| Field | Type | Description |
|---|---|---|
links | array | Array of link objects |
links[].url | string | Absolute URL of the link |
links[].text | string | Anchor text (null if empty) |
count | number | Total number of unique links found |
Headlines Mode
| Field | Type | Description |
|---|---|---|
headlines | array | Array of headline objects |
headlines[].tag | string | HTML tag (h1, h2, or h3) |
headlines[].text | string | Headline text content |
count | number | Total number of headlines found |
Images Mode
| Field | Type | Description |
|---|---|---|
images | array | Array of image objects |
images[].url | string | Absolute URL of the image |
images[].alt | string | Alt text (null if missing) |
count | number | Total number of images found |
Tables Mode
| Field | Type | Description |
|---|---|---|
tables | array | Array of table objects |
tables[].headers | array | Column headers from <th> elements |
tables[].rows | array | Array of row arrays (each row is an array of cell text) |
tables[].rowCount | number | Number of data rows |
count | number | Total number of tables found |
Text Mode
| Field | Type | Description |
|---|---|---|
text | string | Clean body text with scripts, styles, nav, footer, and header removed |
length | number | Character count of the extracted text |
Custom Mode
| Field | Type | Description |
|---|---|---|
results | array | Array of matched element objects |
results[].text | string | Text content of the matched element |
results[].tag | string | HTML tag name of the matched element |
count | number | Total number of matched elements |
Full Mode
Returns metadata, headlines, links (top 50), images (top 20), and tables all in one result object.
How to Use
Basic Usage
- Open the Web Scraper Toolkit on Apify
- Enter your URLs as a JSON array (e.g.,
["https://example.com"]) - Select a scraping mode (default:
full) - Click "Start"
- View results in the "Dataset" tab
Custom CSS Selector
- Set mode to
custom - Enter your CSS selector in the "CSS Selector" field (e.g.,
.article-title,#main-content p,table.data-table tr) - The actor extracts text content and tag name for every matching element
Input Configuration
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
urls | array | Yes | -- | JSON array of URLs to scrape. Maximum 10 URLs per run. Each URL must be a publicly accessible webpage. Example: ["https://example.com", "https://github.com"] |
mode | string | No | full | Extraction mode. One of: full (metadata + headlines + links + images + tables), metadata (page title, description, OG tags, canonical, language), links (all unique links with anchor text), headlines (H1, H2, H3 headings), images (all images with alt text), tables (HTML tables parsed into rows), text (clean body text), custom (elements matching a CSS selector). |
selector | string | No | -- | CSS selector for custom mode. Supports any valid CSS selector syntax: element selectors (div, p), class selectors (.class-name), ID selectors (#id), attribute selectors ([data-type="value"]), combinators (div > p, ul li), pseudo-classes (:first-child, :nth-child(2)). Ignored when mode is not custom. |
Output Example
Full Mode
{"url": "https://example.com","status": 200,"timestamp": 1713264000000,"metadata": {"title": "Example Domain","description": null,"ogImage": null,"ogTitle": null,"canonical": null,"language": null,"url": "https://example.com"},"headlines": [{ "tag": "h1", "text": "Example Domain" }],"links": [{ "url": "https://www.iana.org/domains/example", "text": "More information..." }],"images": [],"tables": []}
Links Mode
{"url": "https://news.ycombinator.com","status": 200,"timestamp": 1713264000000,"links": [{ "url": "https://news.ycombinator.com/newest", "text": "new" },{ "url": "https://news.ycombinator.com/front", "text": "past" },{ "url": "https://news.ycombinator.com/newcomments", "text": "comments" },{ "url": "https://some-article.com/post", "text": "Show HN: My new project" }],"count": 187}
Tables Mode
{"url": "https://en.wikipedia.org/wiki/List_of_countries","status": 200,"timestamp": 1713264000000,"tables": [{"headers": ["Country", "Population", "Area (km2)"],"rows": [["China", "1,425,671,352", "9,596,961"],["India", "1,428,627,663", "3,287,263"]],"rowCount": 195}],"count": 1}
Custom Mode
{"url": "https://example.com","status": 200,"timestamp": 1713264000000,"results": [{ "text": "Example Domain", "tag": "h1" },{ "text": "This domain is for use in illustrative examples.", "tag": "p" }],"count": 2}
Cost Estimation
This actor uses the pay-per-event pricing model. You are charged per URL scraped.
| Action | Event | Estimated Cost |
|---|---|---|
| Scrape 1 URL | 1 event | ~$0.01 - $0.03 per URL |
| Scrape 10 URLs | 10 events | ~$0.10 - $0.30 per run |
Typical run uses minimal compute (128 MB RAM, 1-3 seconds per URL) because there is no browser involved.
Integration Guide
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")# Scrape metadata from multiple URLsrun = client.actor("lazymac/web-scraper-toolkit").call(run_input={"urls": ["https://github.com", "https://gitlab.com", "https://bitbucket.org"],"mode": "metadata"})dataset = client.dataset(run["defaultDatasetId"])for item in dataset.iterate_items():print(f"{item['url']}: {item['metadata']['title']}")
# Extract all links from a pagerun = client.actor("lazymac/web-scraper-toolkit").call(run_input={"urls": ["https://news.ycombinator.com"],"mode": "links"})dataset = client.dataset(run["defaultDatasetId"])for item in dataset.iterate_items():print(f"Found {item['count']} links")for link in item['links'][:10]:print(f" {link['text']}: {link['url']}")
# Custom CSS selector extractionrun = client.actor("lazymac/web-scraper-toolkit").call(run_input={"urls": ["https://example.com"],"mode": "custom","selector": "h1, h2, p"})
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });// Full extractionconst run = await client.actor('lazymac/web-scraper-toolkit').call({urls: ['https://github.com'],mode: 'full',});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items[0].metadata.title);console.log(`Headlines: ${items[0].headlines.length}`);console.log(`Links: ${items[0].links.length}`);
// Extract tablesconst tableRun = await client.actor('lazymac/web-scraper-toolkit').call({urls: ['https://en.wikipedia.org/wiki/List_of_programming_languages'],mode: 'tables',});const { items: tableItems } = await client.dataset(tableRun.defaultDatasetId).listItems();tableItems[0].tables.forEach(table => {console.log(`Table with ${table.rowCount} rows, headers: ${table.headers.join(', ')}`);});
Apify API (cURL)
# Start a runcurl -X POST "https://api.apify.com/v2/acts/lazymac~web-scraper-toolkit/runs" \-H "Authorization: Bearer YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{"urls": ["https://example.com"], "mode": "metadata"}'# Get resultscurl "https://api.apify.com/v2/datasets/DATASET_ID/items?format=json" \-H "Authorization: Bearer YOUR_API_TOKEN"
Use Cases
- Content Monitoring: Track headlines and text changes on competitor websites
- Link Analysis: Extract all outbound/inbound links from a page for SEO research
- Data Collection: Scrape HTML tables from Wikipedia, government sites, or any public data source
- Social Media Preview: Check OG tags and metadata before sharing links
- Research Automation: Collect structured data from multiple pages in one run
- Image Auditing: Find all images on a page and check for missing alt text
- Custom Extraction: Use CSS selectors to target specific page elements for any use case
- Price Monitoring: Extract product prices from e-commerce pages on a schedule to track price changes
- News Aggregation: Scrape headlines from multiple news sources and compile a daily digest
- Accessibility Auditing: Extract all images and check for missing alt text across your site's pages
- Sitemap Verification: Scrape links from key pages to verify your internal linking structure matches your sitemap
- Academic Research: Collect structured data from public data portals and government websites
Integration with Other Tools
Zapier
- Create a Zap with your desired trigger (schedule, new spreadsheet row, webhook, etc.)
- Add an action: Apify -- Run Actor
- Select
lazymac/web-scraper-toolkitand configure URLs and mode - Add downstream actions to send extracted data to Google Sheets, Slack, email, Airtable, or any Zapier-connected app
- Map fields like
metadata.title,links[].url, orheadlines[].textto your destination columns
Make (Integromat)
- Create a new scenario with the Apify module
- Select "Run an Actor" and choose
lazymac/web-scraper-toolkit - Use an iterator to process each scraped URL result individually
- Route data to Google Sheets, a REST API, database, or notification service based on conditions
Google Sheets Integration
from apify_client import ApifyClientimport gspreadfrom oauth2client.service_account import ServiceAccountCredentials# Scrape metadata from multiple pagesclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("lazymac/web-scraper-toolkit").call(run_input={"urls": ["https://example.com", "https://github.com"],"mode": "metadata"})dataset = client.dataset(run["defaultDatasetId"])results = list(dataset.iterate_items())# Write to Google Sheetsscope = ["https://spreadsheets.google.com/feeds"]creds = ServiceAccountCredentials.from_json_keyfile_name("creds.json", scope)gc = gspread.authorize(creds)sheet = gc.open("Web Data").sheet1sheet.append_row(["URL", "Title", "Description", "OG Image", "Language"])for r in results:m = r.get("metadata", {})sheet.append_row([r["url"], m.get("title"), m.get("description"), m.get("ogImage"), m.get("language")])
Webhooks
Set up an Apify webhook with event ACTOR.RUN.SUCCEEDED to receive a notification when scraping completes. The webhook payload includes the run ID and dataset ID, allowing you to fetch results immediately from your backend.
Scheduled Monitoring Pipeline
# Schedule this as a daily cron job to track headline changesRESULT=$(curl -s -X POST "https://api.apify.com/v2/acts/lazymac~web-scraper-toolkit/run-sync-get-dataset-items" \-H "Authorization: Bearer YOUR_TOKEN" \-H "Content-Type: application/json" \-d '{"urls": ["https://news.ycombinator.com"], "mode": "headlines"}')echo "$RESULT" | jq '.[0].headlines[] | .text' > today_headlines.txtdiff yesterday_headlines.txt today_headlines.txt > headline_changes.txtcp today_headlines.txt yesterday_headlines.txt
CI/CD Pipeline Integration
Add link validation to your deployment pipeline:
# GitHub Actions example- name: Check Links After Deployrun: |RESULT=$(curl -s -X POST "https://api.apify.com/v2/acts/lazymac~web-scraper-toolkit/run-sync-get-dataset-items" \-H "Authorization: Bearer ${{ secrets.APIFY_TOKEN }}" \-H "Content-Type: application/json" \-d '{"urls": ["${{ env.DEPLOY_URL }}"], "mode": "links"}')COUNT=$(echo $RESULT | jq '.[0].count')echo "Found $COUNT links on deployed page"
Tips and Tricks
-
Use metadata mode for quick page audits. If you only need the title, description, and OG tags, the metadata mode is the fastest option. It skips link, image, and table extraction entirely.
-
Combine modes across multiple runs. Run once with
headlinesmode and once withlinksmode to get targeted datasets. This is more efficient thanfullmode if you only need specific data types. -
Use custom CSS selectors for precision extraction. Instead of parsing the entire page, target exactly the elements you need. For example,
.product-priceon e-commerce pages orarticle pfor blog content. -
Batch related URLs together. Scrape up to 10 URLs per run to minimize API calls and overhead. Group URLs by site or purpose for cleaner dataset organization.
-
Check the status code in results. A 200 status means the page loaded successfully. A 301/302 means it was redirected. A 403/404 means access was denied or the page does not exist. Always filter by status in your downstream processing.
-
Export tables directly to CSV. The tables mode output is already structured with headers and rows, making it trivial to convert to CSV format for spreadsheet import.
-
Use text mode for content analysis. The text extraction removes nav, footer, header, scripts, and styles, giving you clean body content. This is ideal for word count analysis, sentiment analysis, or content comparison.
-
Schedule regular scrapes for monitoring. Use Apify's built-in scheduler to run the actor daily or weekly on specific pages. Track changes by comparing datasets over time.
Frequently Asked Questions
Q: Does this actor render JavaScript? A: No. It fetches raw HTML using a lightweight HTTP client. For JavaScript-rendered pages (SPAs built with React, Vue, Angular), you may not get the full content. Consider using a browser-based scraper for such sites.
Q: What is the maximum number of URLs per run? A: 10 URLs per run. For larger batches, run the actor multiple times programmatically using the Apify API or schedule multiple runs.
Q: How does the actor handle failed URLs? A: Each URL is processed independently. If one URL fails (timeout, DNS error, HTTP error), it is reported with an error message, and the remaining URLs continue processing normally.
Q: Can I scrape pages behind authentication? A: No. The actor can only access publicly available URLs. Pages requiring login will return the login page instead of the actual content.
Q: What CSS selectors are supported in custom mode?
A: All standard CSS selectors are supported, including element (div), class (.class), ID (#id), attribute ([href]), combinators (div > p, ul li), and pseudo-classes (:first-child, :nth-of-type(2)). The selector is passed to cheerio, which implements the CSS Selectors Level 3 specification.
Q: Are relative URLs in link extraction resolved to absolute? A: Yes. All relative URLs are automatically resolved to full absolute URLs using the page's base URL.
Q: How does deduplication work in links mode? A: Links are deduplicated by URL. If the same URL appears multiple times with different anchor text, only the first occurrence is kept.
Q: What content is removed in text mode?
A: Script tags, style tags, <nav>, <footer>, and <header> elements are removed before extracting body text. This gives you the main content without navigation, boilerplate, or code.
Q: Can I export results to CSV or Excel? A: Yes. Apify datasets support export to JSON, CSV, XML, and Excel formats. After the run completes, use the dataset export API or download directly from the Apify Console.
Q: Is there a timeout per URL? A: Yes, each URL has a 15-second timeout. If a page does not respond within 15 seconds, it is skipped with an error message.
Q: Can I use this actor with the Apify CLI?
A: Yes. Install the Apify CLI (npm install -g apify-cli), then run: apify call lazymac/web-scraper-toolkit -i '{"urls": ["https://example.com"], "mode": "metadata"}'. Results are saved to the local dataset.
Q: Does the actor handle rate limiting? A: The actor processes URLs sequentially, which naturally avoids hitting rate limits. For sites with aggressive rate limiting, consider adding a proxy configuration or reducing the number of URLs per run.
Q: Can I scrape PDF files or images?
A: No. The actor is designed for HTML web pages only. It sends Accept: text/html headers and parses the response as HTML. Non-HTML responses (PDFs, images, JSON APIs) will either fail or return empty results.
Q: What happens if a URL returns a redirect loop? A: The actor follows redirects up to a reasonable limit. If a redirect loop is detected (too many redirects), the URL is reported with an error message and processing continues with the next URL.
Q: Can I extract data from iframes? A: No. The actor fetches and parses only the main page HTML. Content inside iframes (including embedded videos, maps, and third-party widgets) is not included in the extraction.
Q: How do I scrape more than 10 URLs? A: Run the actor multiple times with batches of 10 URLs each. You can automate this using the Apify API in a loop, or set up multiple scheduled runs with different URL batches.
Q: What encoding does the actor support? A: The actor handles UTF-8 encoded pages by default. Most modern web pages use UTF-8. Pages with other encodings (ISO-8859-1, Shift_JIS, etc.) may have character display issues in the output.
Limitations
- Does not execute JavaScript (static HTML analysis only)
- Maximum 10 URLs per run
- Cannot access authenticated or paywalled pages
- 15-second timeout per URL
- Full mode limits links to 50 and images to 20 per URL
- Custom mode returns text content only, not HTML
Changelog
- v1.0 - Initial release with 8 extraction modes and pay-per-event pricing


