TheCrawler — Scrape Everything from Any Page avatar

TheCrawler — Scrape Everything from Any Page

Pricing

Pay per usage

Go to Apify Store
TheCrawler — Scrape Everything from Any Page

TheCrawler — Scrape Everything from Any Page

Scrape any webpage with JS rendering (Playwright) or fast HTTP (Cheerio). Extract text, links, images, meta, headings, tables, JSON-LD, emails, phones, OG/Twitter cards, social links. LLM-ready markdown with heading-aware chunking. CSS selectors, recursive crawling, URL filtering. $0.003/page.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Manchitt Sanan

Manchitt Sanan

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

10 minutes ago

Last modified

Categories

Share

Universal Web Scraper — Extract Everything from Any Page

Scrape any webpage and extract every data point: text content, links, images, meta tags, headings (h1-h6), HTML tables, JSON-LD structured data, email addresses, and phone numbers. CSS selector targeting for specific content. Recursive crawling to follow internal links. $0.003/page.


What it extracts per page

DataDescription
TextAll visible text (scripts/styles stripped), up to 50K chars
LinksEvery <a> tag — href, anchor text, internal/external flag
ImagesEvery <img> — src, alt text, width, height
Meta tagsAll <meta> — description, og:title, keywords, robots, etc
HeadingsAll h1-h6 with level and text
TablesHTML tables as structured arrays (headers + rows)
JSON-LDSchema.org structured data from <script type="application/ld+json">
EmailsEmail addresses found anywhere in the HTML
PhonesPhone numbers (7+ digits) found in the HTML
SelectedContent matching your CSS selector

Every extraction type can be toggled on/off.


Quick start

Scrape a single page:

{
"urls": ["https://example.com"]
}

Crawl a site (follow links):

{
"urls": ["https://example.com"],
"maxDepth": 2,
"maxPages": 50
}

Target specific content:

{
"urls": ["https://example.com"],
"cssSelector": ".main-content"
}

Input

FieldTypeDefaultDescription
urlsarray(required)URLs to scrape
extractTextbooleantrueVisible text content
extractLinksbooleantrueAll links with anchor text
extractImagesbooleantrueAll images with alt/dimensions
extractMetabooleantrueMeta tags
extractHeadingsbooleantrueh1-h6 headings
extractTablesbooleantrueHTML tables as arrays
extractStructuredDatabooleantrueJSON-LD schema.org data
extractEmailsbooleantrueEmail addresses
extractPhonesbooleantruePhone numbers
cssSelectorstring(optional)Target specific element
maxDepthinteger00 = listed URLs only. 1+ = follow links
maxPagesinteger100Max pages to scrape total
dryRunbooleanfalseScrape without charges

Pricing

$0.003 per page scraped (pay-per-event pricing).

  • Errors and dry runs are never charged.
  • 100 pages = $0.30
  • 1,000 pages = $3.00

Performance

  • Uses CheerioCrawler — pure HTTP, no headless browser
  • Fast: 100-500 pages/minute depending on target site
  • Low memory: 256MB handles most scraping jobs

Limitations

  • No JavaScript rendering. This scraper reads the initial HTML response. Content injected by JavaScript (React, Vue, Angular SPAs) won't be captured. For JS-heavy sites, use a Playwright-based scraper.
  • Email/phone extraction uses regex — may include false positives from code snippets or malformed patterns.
  • Tables are extracted as flat text arrays. Complex nested tables may not parse correctly.
  • Rate limiting. Crawlee handles basic rate limiting, but aggressive crawling may trigger bot protection.