Web Scraper
DeprecatedPricing
$35.00 / 1,000 results
Web Scraper
DeprecatedPricing
$35.00 / 1,000 results
Rating
5.0
(1)
Developer
Rush
Maintained by CommunityActor stats
3
Bookmarked
33
Total users
0
Monthly active users
3 months ago
Last modified
Categories
Share
Web Scraper — Extract Content from Any Website
Extract structured content from any public web page. Enter one or more URLs, optionally crawl deeper into the site, and receive clean, flat JSON records containing titles, headings, text, links, images, tables, Open Graph metadata, and JSON-LD structured data.
What this actor does
Give it a list of URLs — it visits each page and extracts:
- Page metadata — title, meta description, language, canonical URL
- Headings — H1, H2, H3 texts
- Text content — all paragraph text combined into a single field
- Links — up to 50 links with anchor text and destination URL
- Images — up to 20 images with source URL and alt text
- Tables — up to 5 tables with headers and row data
- SEO data — Open Graph title, description, and image; JSON-LD structured data
Every URL produces one record with the same fixed set of fields, making the output easy to use in spreadsheets, databases, and automation tools like Make, Zapier, or n8n.
Deep crawling
Set Crawl Depth greater than 0 to automatically discover and extract content from pages linked within the same website. Depth 0 means only the URLs you provide; depth 1 also extracts pages linked from those, and so on. Use Maximum Pages to control the total number of pages extracted.
Getting started
- Add URLs — paste one or more web page URLs into the input field
- Set crawl depth (optional) — increase to discover linked pages within the same site
- Run the actor — click Start and wait for extraction to finish
- Download results — export as JSON, CSV, or Excel from the Dataset tab
The actor works out of the box with default settings. No configuration needed for most use cases.
Input fields
| Field | Required | Default | Description |
|---|---|---|---|
| URLs to Scrape | Yes | — | One or more web page URLs to extract content from |
| Maximum Pages | No | 50 | Total page limit across all URLs and discovered links |
| Crawl Depth | No | 0 | How many levels of links to follow (0 = only provided URLs) |
| Scroll Page | No | Off | Scroll each page before extracting (for sites that load more content as you scroll) |
| Force Browser Mode | No | Off | Always use a full browser instead of fast extraction (for JavaScript-heavy sites) |
Output fields
Each extracted page produces a record with these fields:
| Field | Type | Description |
|---|---|---|
| url | string | Page URL |
| depth | integer | Crawl depth (0 = start URL, 1 = linked from start, etc.) |
| title | string | Page title |
| description | string | Meta description |
| language | string | Language code (e.g. en, zh-TW) |
| canonicalUrl | string | Canonical URL if specified |
| h1 | array | H1 heading texts |
| h2 | array | H2 heading texts |
| h3 | array | H3 heading texts |
| textContent | string | All paragraph text combined |
| paragraphCount | integer | Total paragraphs on page |
| linkCount | integer | Total links on page |
| links | array | Links with text and href (up to 50) |
| imageCount | integer | Total images on page |
| images | array | Images with src and alt (up to 20) |
| tableCount | integer | Number of tables |
| tables | array | Table data with headers and rows (up to 5) |
| ogTitle | string | Open Graph title |
| ogDescription | string | Open Graph description |
| ogImage | string | Open Graph image URL |
| jsonLd | array | JSON-LD structured data |
| scrapedAt | string | Extraction timestamp (ISO 8601) |
| error | string | Error message if extraction failed |
Output example
{"url": "https://example.com","depth": 0,"title": "Example Domain","description": "","language": "en","canonicalUrl": "","h1": ["Example Domain"],"h2": [],"h3": [],"textContent": "This domain is for use in illustrative examples in documents.","paragraphCount": 2,"linkCount": 1,"links": [{ "text": "More information...", "href": "https://www.iana.org/domains/example" }],"imageCount": 0,"images": [],"tableCount": 0,"tables": [],"ogTitle": "","ogDescription": "","ogImage": "","jsonLd": [],"scrapedAt": "2026-03-09T12:00:00.000Z"}
Dataset views
The output includes three pre-configured views in the Apify Console:
- Overview — URL, depth, title, description, content counts, timestamp
- Content — URL, title, H1 headings, text content, OG title and image
- SEO Data — URL, title, meta description, canonical URL, OG metadata, language
Tips
- For sites that load content dynamically (infinite scroll), enable Scroll Page
- Use Crawl Depth to explore a site beyond the URLs you provide — depth 1 is usually enough for most sites
- Set Maximum Pages to control costs when crawling large sites
- Enable Force Browser Mode if a site requires JavaScript to display content — most sites work without it
- Pages that block automated access or return errors will still appear in the output with an
errorfield
FAQ
Q: Why are some fields empty?
Not every page uses all HTML elements. A page without <p> tags will have empty textContent, and a page without <h1> will have an empty h1 array. This is normal behavior.
Q: Can I extract custom elements with CSS selectors? This actor extracts a fixed set of fields from every page. For custom CSS selector extraction, consider using Apify's dedicated Cheerio Scraper or building a custom actor with Crawlee.
Q: Does this actor follow links? Yes, if you set Crawl Depth to 1 or higher. With depth 0 (default), it only extracts content from the exact URLs you provide. Links are followed only within the same website domain.
Q: When should I enable Force Browser Mode? Most websites work with the default fast extraction mode. Enable it only if you notice missing content on JavaScript-heavy websites that show blank pages without a browser.
Q: What happens if a page fails to load?
The record is still created with all fields set to empty/zero values and an error field describing the issue. This ensures every input URL produces output.
Q: How does the page limit work with deep crawling? The Maximum Pages limit applies to the total across all start URLs and all discovered links. For example, with 3 start URLs and a limit of 50, the actor will extract at most 50 pages total.
Disclaimer
This actor is intended for extracting content from publicly accessible web pages. Users are responsible for ensuring their use complies with applicable laws and the terms of service of the websites they access. The author is not liable for any misuse.
This actor does not bypass authentication, CAPTCHAs, or access restrictions. It does not use proxies or residential IP addresses.
web scraper, website content extractor, html extractor, page scraper, web data extraction, structured data, open graph, json-ld, seo scraper, metadata extractor, table scraper, link extractor, headings extractor, web content crawler, site scraper, deep crawler, website crawler, OpenClaw, Claude Code, Gemini, Antigravity, Codex, ChatGPT