Crawl4ai avatar

Crawl4ai

Pricing

Pay per usage

Go to Apify Store
Crawl4ai

Crawl4ai

Extract page content (markdown/HTML/text), metadata, and link stats. Uses crawl4ai.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Kael Odin

Kael Odin

Maintained by Community

Actor stats

1

Bookmarked

2

Total users

1

Monthly active users

5 days ago

Last modified

Categories

Share

Website Content Extractor

Apify Actor: extract page content (markdown/HTML/text), metadata, and link stats. Uses crawl4ai.

Quick start

pip install -e ".[dev]"
crawl4ai-setup
python -m crawl4ai_actor.main

Input: startUrls (required), maxPages, maxDepth, waitUntil, waitForSelector, cssSelector, etc. Full schema: .actor/input_schema.json.

Output: dataset with url, success, content, title, content_length, links_internal_count, etc. Run summary in Storage → Key-value store (runSummary), including failedUrls for retries.

Options (high level)

OptionPurpose
crawlModefull (default) | discover_only — discover_only = URLs + links only, no content
includeLinkUrlsInclude links_internal / links_external arrays in each item
waitUntildomcontentloaded | load | networkidle (SPA/slow sites)
pageLoadWaitSecsExtra delay before capture
waitForSelectorWait for CSS selector (or css:/js: prefix)
cssSelectorExtract only this region (e.g. main, .article)
virtualScrollSelectorInfinite-scroll container to expand

Example — SPA / slow site: { "startUrls": ["https://..."], "waitUntil": "networkidle", "pageLoadWaitSecs": 2 }
Example — discover links only: { "startUrls": ["https://..."], "crawlMode": "discover_only", "maxPages": 100 }

Run locally / Docker

$docker build -t website-content-extractor .

Regression

$UX_MATRIX_GROUP=core python scripts/ux_matrix.py

Reports: scripts/ux_matrix_output.json, scripts/ux_matrix_report.txt (gitignored).