Web Scraper avatar

Web Scraper

Deprecated

Pricing

$35.00 / 1,000 results

Go to Apify Store
Web Scraper

Web Scraper

Deprecated

Pricing

$35.00 / 1,000 results

Rating

5.0

(1)

Developer

Rush

Rush

Maintained by Community

Actor stats

3

Bookmarked

33

Total users

0

Monthly active users

3 months ago

Last modified

Share

Web Scraper — Extract Content from Any Website

Extract structured content from any public web page. Enter one or more URLs, optionally crawl deeper into the site, and receive clean, flat JSON records containing titles, headings, text, links, images, tables, Open Graph metadata, and JSON-LD structured data.


What this actor does

Give it a list of URLs — it visits each page and extracts:

  • Page metadata — title, meta description, language, canonical URL
  • Headings — H1, H2, H3 texts
  • Text content — all paragraph text combined into a single field
  • Links — up to 50 links with anchor text and destination URL
  • Images — up to 20 images with source URL and alt text
  • Tables — up to 5 tables with headers and row data
  • SEO data — Open Graph title, description, and image; JSON-LD structured data

Every URL produces one record with the same fixed set of fields, making the output easy to use in spreadsheets, databases, and automation tools like Make, Zapier, or n8n.

Deep crawling

Set Crawl Depth greater than 0 to automatically discover and extract content from pages linked within the same website. Depth 0 means only the URLs you provide; depth 1 also extracts pages linked from those, and so on. Use Maximum Pages to control the total number of pages extracted.

Getting started

  1. Add URLs — paste one or more web page URLs into the input field
  2. Set crawl depth (optional) — increase to discover linked pages within the same site
  3. Run the actor — click Start and wait for extraction to finish
  4. Download results — export as JSON, CSV, or Excel from the Dataset tab

The actor works out of the box with default settings. No configuration needed for most use cases.

Input fields

FieldRequiredDefaultDescription
URLs to ScrapeYesOne or more web page URLs to extract content from
Maximum PagesNo50Total page limit across all URLs and discovered links
Crawl DepthNo0How many levels of links to follow (0 = only provided URLs)
Scroll PageNoOffScroll each page before extracting (for sites that load more content as you scroll)
Force Browser ModeNoOffAlways use a full browser instead of fast extraction (for JavaScript-heavy sites)

Output fields

Each extracted page produces a record with these fields:

FieldTypeDescription
urlstringPage URL
depthintegerCrawl depth (0 = start URL, 1 = linked from start, etc.)
titlestringPage title
descriptionstringMeta description
languagestringLanguage code (e.g. en, zh-TW)
canonicalUrlstringCanonical URL if specified
h1arrayH1 heading texts
h2arrayH2 heading texts
h3arrayH3 heading texts
textContentstringAll paragraph text combined
paragraphCountintegerTotal paragraphs on page
linkCountintegerTotal links on page
linksarrayLinks with text and href (up to 50)
imageCountintegerTotal images on page
imagesarrayImages with src and alt (up to 20)
tableCountintegerNumber of tables
tablesarrayTable data with headers and rows (up to 5)
ogTitlestringOpen Graph title
ogDescriptionstringOpen Graph description
ogImagestringOpen Graph image URL
jsonLdarrayJSON-LD structured data
scrapedAtstringExtraction timestamp (ISO 8601)
errorstringError message if extraction failed

Output example

{
"url": "https://example.com",
"depth": 0,
"title": "Example Domain",
"description": "",
"language": "en",
"canonicalUrl": "",
"h1": ["Example Domain"],
"h2": [],
"h3": [],
"textContent": "This domain is for use in illustrative examples in documents.",
"paragraphCount": 2,
"linkCount": 1,
"links": [{ "text": "More information...", "href": "https://www.iana.org/domains/example" }],
"imageCount": 0,
"images": [],
"tableCount": 0,
"tables": [],
"ogTitle": "",
"ogDescription": "",
"ogImage": "",
"jsonLd": [],
"scrapedAt": "2026-03-09T12:00:00.000Z"
}

Dataset views

The output includes three pre-configured views in the Apify Console:

  • Overview — URL, depth, title, description, content counts, timestamp
  • Content — URL, title, H1 headings, text content, OG title and image
  • SEO Data — URL, title, meta description, canonical URL, OG metadata, language

Tips

  • For sites that load content dynamically (infinite scroll), enable Scroll Page
  • Use Crawl Depth to explore a site beyond the URLs you provide — depth 1 is usually enough for most sites
  • Set Maximum Pages to control costs when crawling large sites
  • Enable Force Browser Mode if a site requires JavaScript to display content — most sites work without it
  • Pages that block automated access or return errors will still appear in the output with an error field

FAQ

Q: Why are some fields empty? Not every page uses all HTML elements. A page without <p> tags will have empty textContent, and a page without <h1> will have an empty h1 array. This is normal behavior.

Q: Can I extract custom elements with CSS selectors? This actor extracts a fixed set of fields from every page. For custom CSS selector extraction, consider using Apify's dedicated Cheerio Scraper or building a custom actor with Crawlee.

Q: Does this actor follow links? Yes, if you set Crawl Depth to 1 or higher. With depth 0 (default), it only extracts content from the exact URLs you provide. Links are followed only within the same website domain.

Q: When should I enable Force Browser Mode? Most websites work with the default fast extraction mode. Enable it only if you notice missing content on JavaScript-heavy websites that show blank pages without a browser.

Q: What happens if a page fails to load? The record is still created with all fields set to empty/zero values and an error field describing the issue. This ensures every input URL produces output.

Q: How does the page limit work with deep crawling? The Maximum Pages limit applies to the total across all start URLs and all discovered links. For example, with 3 start URLs and a limit of 50, the actor will extract at most 50 pages total.


Disclaimer

This actor is intended for extracting content from publicly accessible web pages. Users are responsible for ensuring their use complies with applicable laws and the terms of service of the websites they access. The author is not liable for any misuse.

This actor does not bypass authentication, CAPTCHAs, or access restrictions. It does not use proxies or residential IP addresses.


web scraper, website content extractor, html extractor, page scraper, web data extraction, structured data, open graph, json-ld, seo scraper, metadata extractor, table scraper, link extractor, headings extractor, web content crawler, site scraper, deep crawler, website crawler, OpenClaw, Claude Code, Gemini, Antigravity, Codex, ChatGPT