JS Content Crawler Lite for RAG avatar

JS Content Crawler Lite for RAG

Pricing

Pay per usage

Go to Apify Store
JS Content Crawler Lite for RAG

JS Content Crawler Lite for RAG

Extract clean Markdown, text, metadata, links, and diagnostics from web pages. Static-first for low cost, Browserless rendering only when needed.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

George Kioko

George Kioko

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

1.8 days

Issues response

5 days ago

Last modified

Share

Extract clean Markdown, text, metadata, links, headings, images, and quality diagnostics from URL lists. The actor is static-first for margin control and only uses Browserless when JavaScript rendering is required.

Why this actor

Many RAG crawlers run a full browser for every page. That is expensive and slow. This actor tries normal HTTP extraction first, checks content quality, then escalates to Browserless only when renderMode requires it or static output is blocked or too thin.

Inputs

  • urls: Required list of HTTP or HTTPS URLs. Duplicates are processed once.
  • renderMode: auto, never, or always. Default is auto.
  • minWords: Minimum visible words for content_ready. Default is 80.
  • timeoutMs: Per-request timeout. Default is 30000.

Output fields

  • url, finalUrl, title, description
  • markdown, text, wordCount
  • headings, links, images
  • sourceLane: static, browserless, or none
  • qualityState: content_ready, blocked, low_content, fetch_failed, render_failed, render_unavailable, or invalid_url
  • qualityReasons: Diagnostic reason codes
  • billingState, chargedEvent

Pricing events

  • page-extracted: $0.005 per successful static extraction
  • js-page-rendered: $0.015 per successful Browserless-rendered extraction Failed, blocked, invalid, and low-quality rows are pushed with diagnostics but are not billed.

Browserless setup

Set these environment variables when rendering is needed:

BROWSERLESS_BASE_URL=http://127.0.0.1:3000
BROWSERLESS_TOKEN=your-token

If Browserless is not configured and a page needs rendering, the actor emits a diagnostic row and does not charge a rendered event.

Local smoke

npm install
apify run -i '{"urls":["https://example.com"],"renderMode":"never","minWords":10}'

Browserless smoke:

$env:BROWSERLESS_BASE_URL="http://127.0.0.1:3000"
$env:BROWSERLESS_TOKEN="your-token"
apify run -i '{"urls":["https://example.com"],"renderMode":"always","minWords":10}'