Pricing

from $2.00 / 1,000 reasoning packs

Compliance-Grade Web Intelligence for AI Agents

The scraper AI agents trust. Extract grounded facts with citations, entities, claims & RAG chunks. Built for LangChain, LlamaIndex, AutoGPT. Quality scoring, auto-citations, 6 task modes.

Pricing

from $2.00 / 1,000 reasoning packs

Rating

0.0

(0)

Developer

Jason Pellerin

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

🚨 Colorado SB 25B-004: AI Compliance Deadline June 30, 2026

Colorado Senate Bill 25B-004 establishes first-in-nation AI transparency and accountability requirements. Effective June 30, 2026, organizations deploying AI systems must:

Document data sources used by AI agents
Maintain audit trails for AI-generated decisions
Provide source citations when AI systems use web-scraped data
Enable human review of AI outputs with traceable provenance

This actor is purpose-built for SB 25B-004 compliance:

Requirement	How This Actor Helps
Source Documentation	Every extraction includes `sourceBlockId` + `exactQuote`
Audit Trails	Full `provenance` object with timestamps, hashes, status codes
Citation Generation	Auto-generated APA, MLA, Chicago, and inline citations
Human Review	`quality` scores and `summary` enable efficient oversight
Change Monitoring	`contentHash` and `materialityScore` track source changes

Denver-based businesses: Get ahead of compliance before June 30th. This actor creates the audit evidence your AI systems need.

Why AI Agents Need This Scraper

Traditional scrapers return raw HTML or unstructured text. AI agents need grounded intelligence:

Traditional Scraper	Compliance Web Intel
Raw HTML dump	Clean markdown + semantic blocks
No source tracking	Every fact has `sourceBlockId` + exact quote
Single output format	Reasoning Pack with 6 structured components
No change detection	Content hashing + materiality scoring
Hallucination-prone	Citation-ready extractions

Built for: LangChain, LlamaIndex, AutoGPT, CrewAI, n8n AI nodes, custom RAG pipelines

What You Get: The Reasoning Pack

Every URL produces a Reasoning Pack - a structured bundle optimized for AI agent consumption:

ReasoningPack
├── content/           # Clean, structured content
│   ├── markdown       # Readability-processed markdown
│   ├── outline[]      # H1→H6 heading tree
│   └── blocks[]       # Semantic blocks with IDs
│
├── extraction/        # Grounded intelligence
│   ├── facts[]        # Statements with sourceBlockId + exactQuote
│   ├── entities[]     # People, orgs, products, locations, money, dates
│   ├── claims[]       # Marketing claims, guarantees, compliance statements
│   ├── pricing[]      # Prices with tier names and periods
│   └── contactInfo    # Emails, phones, addresses, social links
│
├── schema/            # Structured data detection
│   ├── detectedType   # Article, Product, LocalBusiness, FAQ, etc.
│   ├── confidence     # 0-1 detection confidence
│   └── normalizedFields  # Title, author, price, rating, etc.
│
├── monitoring/        # Change detection
│   ├── contentHash    # SHA-256 for deduplication
│   └── materialityScore  # 0-1 significance of changes
│
├── provenance/        # Full audit trail
│   ├── htmlHash       # Original HTML fingerprint
│   ├── fetchedAt      # ISO timestamp
│   ├── statusCode     # HTTP response code
│   └── renderMode     # static/javascript/browser
│
└── chunks[]           # RAG-ready segments
    ├── text           # 500-900 token chunks
    ├── tokenCount     # Exact token count
    └── metadata       # URL, section, blockIds, position

Task Modes: Pre-configured for Common AI Workflows

1. `competitor_teardown`

Best for: Competitive intelligence agents, market research bots

Focuses on: pricing pages, features, about, testimonials, comparison pages

Output includes:

Pricing tiers with features
Marketing claims with citations
Positioning statements
Competitor differentiators

{
  "taskMode": "competitor_teardown",
  "startUrls": [{"url": "https://competitor.com"}],
  "maxPages": 20
}

2. `compliance_discovery`

Best for: AI governance agents, policy monitoring bots, legal research

Focuses on: privacy policy, terms of service, legal notices, accessibility statements

Output includes:

GDPR/CCPA compliance claims
Data handling statements
Legal disclaimers with exact quotes
Policy change detection

{
  "taskMode": "compliance_discovery",
  "startUrls": [{"url": "https://company.com/privacy"}],
  "alertWebhook": "https://your-webhook.com/policy-changes"
}

3. `local_seo_audit`

Best for: Local SEO agents, citation building bots, GEO optimization

Focuses on: contact pages, about, locations, services, reviews

Output includes:

NAP (Name, Address, Phone) extraction
Schema.org markup detection
Service area identification
Business hours parsing

{
  "taskMode": "local_seo_audit",
  "startUrls": [{"url": "https://local-business.com"}],
  "maxPages": 10
}

4. `sales_research`

Best for: Account research agents, sales enablement bots

Focuses on: about, team, leadership, news, press, careers, case studies

Output includes:

Company signals (hiring, funding, expansion)
Key personnel extraction
Technology stack indicators
Recent news with citations

{
  "taskMode": "sales_research",
  "startUrls": [{"url": "https://prospect-company.com"}],
  "maxPages": 15
}

5. `docs_extraction`

Best for: Documentation-to-API agents, knowledge base builders

Focuses on: docs, API references, guides, tutorials, help articles

Output includes:

Structured procedure extraction
Code example identification
Parameter documentation
Step-by-step instructions

{
  "taskMode": "docs_extraction",
  "startUrls": [{"url": "https://docs.example.com"}],
  "maxPages": 50
}

6. `pricing_intelligence`

Best for: Price monitoring agents, market analysis bots

Focuses on: pricing pages, plans, packages, enterprise quotes

Output includes:

Tier-by-tier pricing breakdown
Feature comparisons
Discount indicators
Enterprise contact triggers

{
  "taskMode": "pricing_intelligence",
  "startUrls": [{"url": "https://saas-company.com/pricing"}],
  "maxPages": 5
}

Integration Examples

LangChain RAG Pipeline

from langchain.document_loaders import ApifyDatasetLoader
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings

# Load Reasoning Packs from Apify
loader = ApifyDatasetLoader(
    dataset_id="YOUR_DATASET_ID",
    dataset_mapping_function=lambda item: Document(
        page_content=item["content"]["markdown"],
        metadata={
            "url": item["url"],
            "title": item["content"]["title"],
            "entities": item["extraction"]["entities"],
            "claims": item["extraction"]["claims"],
        }
    )
)

docs = loader.load()

# Or use pre-chunked RAG segments
chunked_loader = ApifyDatasetLoader(
    dataset_id="YOUR_DATASET_ID",
    dataset_mapping_function=lambda item: [
        Document(
            page_content=chunk["text"],
            metadata=chunk["metadata"]
        ) for chunk in item["chunks"]
    ]
)

n8n Workflow Integration

{
  "nodes": [
    {
      "name": "Run Compliance Scraper",
      "type": "n8n-nodes-base.apify",
      "parameters": {
        "actorId": "ai_solutionist/compliance-web-intel",
        "input": {
          "startUrls": [{"url": "{{ $json.targetUrl }}"}],
          "taskMode": "competitor_teardown"
        }
      }
    },
    {
      "name": "Process Reasoning Pack",
      "type": "n8n-nodes-base.code",
      "parameters": {
        "code": "return items[0].json.extraction.claims.map(c => ({ claim: c.text, source: c.exactQuote }))"
      }
    }
  ]
}

Direct API Call

curl -X POST "https://api.apify.com/v2/acts/ai_solutionist~compliance-web-intel/runs?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "startUrls": [{"url": "https://example.com"}],
    "taskMode": "general",
    "maxPages": 10
  }'

Why Grounded Extraction Matters for AI

The Hallucination Problem

When AI agents scrape websites and summarize content, they often:

Misattribute statements
Conflate information from multiple sources
Generate plausible-sounding but incorrect facts
Lose the ability to cite sources

The Grounded Solution

Every extraction in Compliance Web Intel includes:

{
  "facts": [
    {
      "id": "fact_a1b2c3",
      "statement": "Company processes 1 million requests daily",
      "sourceBlockId": "blk_14_def456",
      "exactQuote": "Our platform handles over 1 million API requests every day...",
      "confidence": 0.92,
      "category": "statistic"
    }
  ]
}

Your AI agent can now:

Cite sources: "According to [exactQuote] from [url]..."
Verify claims: Cross-reference sourceBlockId with original content
Audit decisions: Full provenance chain for compliance

Input Schema

Parameter	Type	Default	Description
`startUrls`	array	required	URLs to crawl (objects with `url` property)
`taskMode`	string	`general`	Preset configuration for extraction focus
`maxPages`	integer	50	Maximum pages to crawl per run
`maxDepth`	integer	3	How deep to follow links
`includePatterns`	array	[]	Only crawl URLs matching these regex patterns
`excludePatterns`	array	[]	Skip URLs matching these regex patterns
`minRelevanceScore`	number	0.5	URL relevance threshold (0-1)
`ragChunkSize`	integer	750	Target tokens per RAG chunk
`diffBaseline`	string	null	Previous run ID for change detection
`alertWebhook`	string	null	Webhook for material change alerts
`proxyConfig`	object	Apify Proxy	Proxy configuration

Output Format

Full Reasoning Pack Example

{
  "runId": "abc123def456",
  "url": "https://example.com/pricing",
  "domain": "example.com",
  "crawlTimestamp": "2026-01-28T12:00:00.000Z",
  "taskMode": "pricing_intelligence",
  
  "content": {
    "markdown": "# Pricing Plans\n\nChoose the plan that's right for you...",
    "title": "Pricing - Example Company",
    "metaDescription": "Simple, transparent pricing for teams of all sizes.",
    "wordCount": 1247,
    "readingTimeMinutes": 6,
    "outline": [
      {"level": 1, "text": "Pricing Plans", "id": "h_0_abc", "children": [
        {"level": 2, "text": "Starter", "id": "h_1_def", "children": []},
        {"level": 2, "text": "Professional", "id": "h_2_ghi", "children": []},
        {"level": 2, "text": "Enterprise", "id": "h_3_jkl", "children": []}
      ]}
    ],
    "blocks": [
      {
        "id": "blk_0_mno",
        "type": "heading",
        "text": "Pricing Plans",
        "tokenCount": 3,
        "position": 0
      },
      {
        "id": "blk_1_pqr",
        "type": "paragraph",
        "text": "Choose the plan that's right for your team. All plans include...",
        "tokenCount": 42,
        "position": 1
      }
    ]
  },
  
  "extraction": {
    "facts": [
      {
        "id": "fact_001",
        "statement": "Professional plan includes unlimited API calls",
        "sourceBlockId": "blk_5_xyz",
        "exactQuote": "Professional: Unlimited API calls, priority support, and advanced analytics.",
        "confidence": 0.88,
        "category": "claim"
      }
    ],
    "entities": [
      {"id": "ent_001", "text": "$29/month", "type": "money", "sourceBlockId": "blk_3_abc", "confidence": 1.0},
      {"id": "ent_002", "text": "support@example.com", "type": "email", "sourceBlockId": "blk_12_def", "confidence": 1.0}
    ],
    "claims": [
      {
        "id": "claim_001",
        "text": "#1 rated solution",
        "type": "marketing",
        "sourceBlockId": "blk_2_ghi",
        "exactQuote": "The #1 rated solution for growing teams",
        "sentiment": "positive"
      }
    ],
    "pricing": [
      {
        "id": "price_001",
        "amount": 29,
        "currency": "USD",
        "period": "month",
        "tierName": "Starter",
        "sourceBlockId": "blk_3_abc",
        "rawText": "$29/mo"
      },
      {
        "id": "price_002",
        "amount": 99,
        "currency": "USD",
        "period": "month",
        "tierName": "Professional",
        "sourceBlockId": "blk_5_def",
        "rawText": "$99/month"
      }
    ],
    "contactInfo": {
      "emails": ["support@example.com", "sales@example.com"],
      "phones": ["1-800-555-0123"],
      "addresses": [],
      "socialLinks": {"twitter": "https://twitter.com/example", "linkedin": "https://linkedin.com/company/example"}
    },
    "dates": [
      {
        "id": "date_001",
        "text": "January 15, 2026",
        "parsed": "2026-01-15T00:00:00.000Z",
        "context": "...pricing effective January 15, 2026...",
        "sourceBlockId": "blk_8_xyz"
      }
    ]
  },
  
  "schema": {
    "detectedType": "SoftwareApplication",
    "confidence": 0.92,
    "normalizedFields": {
      "title": "Pricing - Example Company",
      "description": "Simple, transparent pricing for teams of all sizes.",
      "price": "$29"
    },
    "rawJsonLd": {"@type": "SoftwareApplication", "name": "Example App", ...},
    "microdata": null
  },
  
  "monitoring": {
    "contentHash": "sha256:a1b2c3d4e5f6...",
    "previousHash": null,
    "diffSummary": null,
    "materialityScore": 0,
    "changedSections": [],
    "alertTriggered": false
  },
  
  "provenance": {
    "requestHeaders": {},
    "responseHeaders": {"content-type": "text/html; charset=utf-8"},
    "renderMode": "static",
    "proxyUsed": "datacenter",
    "htmlHash": "sha256:9f8e7d6c5b4a...",
    "htmlSizeBytes": 45678,
    "fetchedAt": "2026-01-28T12:00:00.000Z",
    "statusCode": 200,
    "redirectChain": []
  },
  
  "chunks": [
    {
      "id": "chunk_001",
      "text": "Pricing Plans. Choose the plan that's right for your team. All plans include unlimited users, 99.9% uptime SLA, and 24/7 support...",
      "tokenCount": 687,
      "metadata": {
        "url": "https://example.com/pricing",
        "title": "Pricing - Example Company",
        "section": "Pricing Plans",
        "blockIds": ["blk_0_mno", "blk_1_pqr", "blk_2_stu"],
        "position": 0,
        "totalChunks": 3
      }
    }
  ],
  
  "quality": {
    "overallScore": 88,
    "completeness": 83,
    "confidence": 78,
    "citationCoverage": 100,
    "structureQuality": 90
  },
  
  "summary": {
    "url": "https://example.com/pricing",
    "title": "Pricing - Example Company",
    "oneLiner": "SoftwareApplication page from example.com with pricing and feature information.",
    "keyFacts": [
      "Professional plan includes unlimited API calls",
      "All plans include 24/7 support",
      "Enterprise tier offers custom integrations"
    ],
    "keyEntities": ["Example Company (organization)", "$29/month (money)", "$99/month (money)"],
    "topClaims": ["The #1 rated solution for growing teams"],
    "pricingSummary": "3 pricing tiers from $29/month to $299/month",
    "qualityScore": 88,
    "recommendedActions": [
      "Track pricing changes with diffBaseline parameter",
      "Use RAG chunks for efficient embedding"
    ]
  },
  
  "citation": {
    "apa": "Example Company. (2026). Pricing - Example Company. Retrieved January 28, 2026, from https://example.com/pricing",
    "mla": "\"Pricing - Example Company.\" example.com, January 28, 2026, https://example.com/pricing.",
    "chicago": "\"Pricing - Example Company.\" example.com. Accessed January 28, 2026. https://example.com/pricing.",
    "inline": "[Pricing - Example Company](https://example.com/pricing) (accessed 2026-01-28)",
    "markdown": "> Source: [Pricing - Example Company](https://example.com/pricing)\n> Retrieved: January 28, 2026\n> Domain: example.com"
  },
  
  "langchainMetadata": {
    "source": "https://example.com/pricing",
    "title": "Pricing - Example Company",
    "domain": "example.com",
    "crawl_date": "2026-01-28T12:00:00.000Z",
    "word_count": 1247,
    "detected_type": "SoftwareApplication",
    "entities_count": 12,
    "claims_count": 3,
    "has_pricing": true,
    "quality_score": 88,
    "content_hash": "sha256:a1b2c3d4e5f6..."
  },
  
  "processingTimeMs": 1247,
  "warnings": [],
  "errors": []
}

Frequently Asked Questions

How is this different from Apify's Website Content Crawler?

Website Content Crawler returns raw content. Compliance Web Intel adds:

Grounded extraction with source citations
Semantic chunking optimized for RAG
Entity/claim detection with confidence scores
Schema.org detection and normalization
Change monitoring with materiality scoring

Can I use this for monitoring competitor pricing?

Yes! Use taskMode: "pricing_intelligence" with diffBaseline set to a previous run ID. The actor will detect pricing changes and calculate a materialityScore. Set up an alertWebhook to get notified of significant changes.

How do RAG chunks work?

Chunks are created by:

Grouping content by section (using headings as boundaries)
Splitting on sentence boundaries
Targeting 500-900 tokens per chunk
Adding 50-token overlap between chunks
Including full metadata (URL, section, block IDs)

What's the difference between `facts` and `claims`?

Facts: Verifiable statements (statistics, definitions, processes)
Claims: Subjective assertions (marketing claims, guarantees, testimonials)

Both include source citations for verification.

How do I handle JavaScript-rendered pages?

The actor uses Cheerio (static HTML) by default. For JS-heavy pages, the crawler will still extract what's available in the initial HTML. For fully dynamic SPAs, consider using Apify's Puppeteer/Playwright scrapers first, then processing the HTML through this actor.

Use Cases by Industry

SaaS / Technology

Competitor feature tracking
Pricing intelligence
Documentation-to-API conversion
Market positioning analysis

Legal / Compliance

Privacy policy monitoring
Terms of service change detection
Regulatory compliance evidence
AI governance audit trails

Marketing / SEO

Content gap analysis
Local SEO citation audits
Competitive positioning research
Schema markup validation

Sales / Business Development

Account research automation
Company signal detection
Contact information extraction
News and press monitoring

Research / Intelligence

Due diligence automation
Market research compilation
Academic source collection
Fact-checking support

Pricing

This actor uses Apify's standard compute unit pricing. Typical costs:

10 pages: ~~0.01 compute units (~~$0.005)
100 pages: ~~0.1 compute units (~~$0.05)
1000 pages: ~~1 compute unit (~~$0.50)

Actual costs depend on page complexity, proxy usage, and processing time.

Support & Feedback

Issues: GitHub Issues
Email: jason@jasonpellerin.com
Apify Discord: @ai_solutionist

About the Author

AI Solutionist builds automation infrastructure for the AI-native enterprise. This actor embodies the HyperCognate philosophy:

Question assumptions. Obsess over detail. Plan like Da Vinci. Craft solutions that sing. Iterate relentlessly. Simplify ruthlessly.

Changelog

v1.0.0 (2026-01-28)

Initial release
6 task modes for common AI workflows
Grounded extraction with source citations
RAG-ready chunking
Schema.org detection
Change monitoring infrastructure

License

MIT License - Use freely in commercial and open-source projects.

The scraper that AI agents trust. Because intelligence without provenance is just hallucination.

SEC Filings Intelligence - 10-K Decoded for AI Agents

ai_solutionist/sec-filings-intelligence

The SEC decoder AI agents trust. Extract structured financials, risk factors, executive compensation, and MD&A from 10-K, 10-Q, 8-K, and proxy statements. Built for Colorado SB 25B-004 compliance. Powers AI employees with grounded financial intelligence, Bluebook citations, and RAG-ready chunks.

Jason Pellerin

AutoGPT Forge Scraper

consummate_mandala/autogpt-forge-scraper

Scrape AutoGPT Forge benchmarks and agents. Extract agent names, benchmark scores, task completions, and configs.

Donny Nguyen

Web Scraper Task

undrtkr984/web-scraper-task

Matt

133

Regulatory Intelligence API - AI Compliance Radar

ai_solutionist/regulatory-intelligence-api

The compliance radar AI agents trust. Extract structured requirements, deadlines, and compliance checklists from Federal Register, regulations.gov, and state regulations.

Jason Pellerin

RAG-Markdown Extractor

hachi-dev/rag-markdown-extractor

The ultimate web-to-markdown tool for AI builders. Extracts clean content from any site, auto-dismisses cookie banners, and handles SPAs with Playwright. Optimized for LangChain, LlamaIndex, and RAG pipelines. Save token costs with 99% noise-free markdown.

JI JUN

RAG Pipeline Data Collector

scraper_guru/rag-pipeline-data-collector

AI-ready web content extraction for RAG systems, LLMs, and AI agents. Single-page or multi-page scraping with parallel processing.

LIAICHI MUSTAPHA

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

Gabriel Antony Xaviour

Google Scholar Scraper: Articles, Citations & PDFs

primeparse/google-scholar-scraper

Extract academic data from Google Scholar: titles, authors, years, citations, abstracts, PDF links. Supports queries, year filters (1900-2100), pagination (up to 5 pages). Rate-limited for safety. Ideal for research, citations, datasets, AI. Clean JSON output. Run on Apify with proxies.

PrimeParse

PDF Intelligence

marielise.dev/pdf-intelligence

Stop fighting PDFs. Extract text, tables, and insights from any document, scanned or digital. Get RAG-ready chunks for LangChain & LlamaIndex. AI-powered summaries, classification, entity extraction. Use our API keys or bring your own (50% discount). From PDF chaos to clean data in minutes.

Marielise

Compliance-Grade Web Intelligence for AI Agents

🚨 Colorado SB 25B-004: AI Compliance Deadline June 30, 2026

Why AI Agents Need This Scraper

What You Get: The Reasoning Pack

Task Modes: Pre-configured for Common AI Workflows

1. competitor_teardown

2. compliance_discovery

3. local_seo_audit

4. sales_research

5. docs_extraction

6. pricing_intelligence

Integration Examples

LangChain RAG Pipeline

n8n Workflow Integration

Direct API Call

Why Grounded Extraction Matters for AI

The Hallucination Problem

The Grounded Solution

Input Schema

Output Format

Full Reasoning Pack Example

Frequently Asked Questions

How is this different from Apify's Website Content Crawler?

Can I use this for monitoring competitor pricing?

How do RAG chunks work?

What's the difference between facts and claims?

How do I handle JavaScript-rendered pages?

Use Cases by Industry

SaaS / Technology

Legal / Compliance

Marketing / SEO

Sales / Business Development

Research / Intelligence

Pricing

Support & Feedback

About the Author

Changelog

v1.0.0 (2026-01-28)

License

You might also like

SEC Filings Intelligence - 10-K Decoded for AI Agents

AutoGPT Forge Scraper

Web Scraper Task

Regulatory Intelligence API - AI Compliance Radar

RAG-Markdown Extractor

RAG Pipeline Data Collector

Web-to-Markdown Generator for AI & RAG Pipelines

Docs To Rag

Google Scholar Scraper: Articles, Citations & PDFs

PDF Intelligence

Related articles

1. `competitor_teardown`

2. `compliance_discovery`

3. `local_seo_audit`

4. `sales_research`

5. `docs_extraction`

6. `pricing_intelligence`

What's the difference between `facts` and `claims`?