Compliance-Grade Web Intelligence for AI Agents avatar
Compliance-Grade Web Intelligence for AI Agents

Pricing

from $2.00 / 1,000 reasoning packs

Go to Apify Store
Compliance-Grade Web Intelligence for AI Agents

Compliance-Grade Web Intelligence for AI Agents

The scraper AI agents trust. Extract grounded facts with citations, entities, claims & RAG chunks. Built for LangChain, LlamaIndex, AutoGPT. Quality scoring, auto-citations, 6 task modes.

Pricing

from $2.00 / 1,000 reasoning packs

Rating

0.0

(0)

Developer

Jason Pellerin

Jason Pellerin

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

The scraper that AI agents trust. Extract grounded facts with source citations, not hallucinated summaries.

Actor Quality Score RAG Ready AI Agent Compatible Colorado SB 25B-004 Ready


🚨 Colorado SB 25B-004: AI Compliance Deadline June 30, 2026

Colorado Senate Bill 25B-004 establishes first-in-nation AI transparency and accountability requirements. Effective June 30, 2026, organizations deploying AI systems must:

  • Document data sources used by AI agents
  • Maintain audit trails for AI-generated decisions
  • Provide source citations when AI systems use web-scraped data
  • Enable human review of AI outputs with traceable provenance

This actor is purpose-built for SB 25B-004 compliance:

RequirementHow This Actor Helps
Source DocumentationEvery extraction includes sourceBlockId + exactQuote
Audit TrailsFull provenance object with timestamps, hashes, status codes
Citation GenerationAuto-generated APA, MLA, Chicago, and inline citations
Human Reviewquality scores and summary enable efficient oversight
Change MonitoringcontentHash and materialityScore track source changes

Denver-based businesses: Get ahead of compliance before June 30th. This actor creates the audit evidence your AI systems need.


Why AI Agents Need This Scraper

Traditional scrapers return raw HTML or unstructured text. AI agents need grounded intelligence:

Traditional ScraperCompliance Web Intel
Raw HTML dumpClean markdown + semantic blocks
No source trackingEvery fact has sourceBlockId + exact quote
Single output formatReasoning Pack with 6 structured components
No change detectionContent hashing + materiality scoring
Hallucination-proneCitation-ready extractions

Built for: LangChain, LlamaIndex, AutoGPT, CrewAI, n8n AI nodes, custom RAG pipelines


What You Get: The Reasoning Pack

Every URL produces a Reasoning Pack - a structured bundle optimized for AI agent consumption:

ReasoningPack
├── content/ # Clean, structured content
│ ├── markdown # Readability-processed markdown
│ ├── outline[] # H1→H6 heading tree
│ └── blocks[] # Semantic blocks with IDs
├── extraction/ # Grounded intelligence
│ ├── facts[] # Statements with sourceBlockId + exactQuote
│ ├── entities[] # People, orgs, products, locations, money, dates
│ ├── claims[] # Marketing claims, guarantees, compliance statements
│ ├── pricing[] # Prices with tier names and periods
│ └── contactInfo # Emails, phones, addresses, social links
├── schema/ # Structured data detection
│ ├── detectedType # Article, Product, LocalBusiness, FAQ, etc.
│ ├── confidence # 0-1 detection confidence
│ └── normalizedFields # Title, author, price, rating, etc.
├── monitoring/ # Change detection
│ ├── contentHash # SHA-256 for deduplication
│ └── materialityScore # 0-1 significance of changes
├── provenance/ # Full audit trail
│ ├── htmlHash # Original HTML fingerprint
│ ├── fetchedAt # ISO timestamp
│ ├── statusCode # HTTP response code
│ └── renderMode # static/javascript/browser
└── chunks[] # RAG-ready segments
├── text # 500-900 token chunks
├── tokenCount # Exact token count
└── metadata # URL, section, blockIds, position

Task Modes: Pre-configured for Common AI Workflows

1. competitor_teardown

Best for: Competitive intelligence agents, market research bots

Focuses on: pricing pages, features, about, testimonials, comparison pages

Output includes:

  • Pricing tiers with features
  • Marketing claims with citations
  • Positioning statements
  • Competitor differentiators
{
"taskMode": "competitor_teardown",
"startUrls": [{"url": "https://competitor.com"}],
"maxPages": 20
}

2. compliance_discovery

Best for: AI governance agents, policy monitoring bots, legal research

Focuses on: privacy policy, terms of service, legal notices, accessibility statements

Output includes:

  • GDPR/CCPA compliance claims
  • Data handling statements
  • Legal disclaimers with exact quotes
  • Policy change detection
{
"taskMode": "compliance_discovery",
"startUrls": [{"url": "https://company.com/privacy"}],
"alertWebhook": "https://your-webhook.com/policy-changes"
}

3. local_seo_audit

Best for: Local SEO agents, citation building bots, GEO optimization

Focuses on: contact pages, about, locations, services, reviews

Output includes:

  • NAP (Name, Address, Phone) extraction
  • Schema.org markup detection
  • Service area identification
  • Business hours parsing
{
"taskMode": "local_seo_audit",
"startUrls": [{"url": "https://local-business.com"}],
"maxPages": 10
}

4. sales_research

Best for: Account research agents, sales enablement bots

Focuses on: about, team, leadership, news, press, careers, case studies

Output includes:

  • Company signals (hiring, funding, expansion)
  • Key personnel extraction
  • Technology stack indicators
  • Recent news with citations
{
"taskMode": "sales_research",
"startUrls": [{"url": "https://prospect-company.com"}],
"maxPages": 15
}

5. docs_extraction

Best for: Documentation-to-API agents, knowledge base builders

Focuses on: docs, API references, guides, tutorials, help articles

Output includes:

  • Structured procedure extraction
  • Code example identification
  • Parameter documentation
  • Step-by-step instructions
{
"taskMode": "docs_extraction",
"startUrls": [{"url": "https://docs.example.com"}],
"maxPages": 50
}

6. pricing_intelligence

Best for: Price monitoring agents, market analysis bots

Focuses on: pricing pages, plans, packages, enterprise quotes

Output includes:

  • Tier-by-tier pricing breakdown
  • Feature comparisons
  • Discount indicators
  • Enterprise contact triggers
{
"taskMode": "pricing_intelligence",
"startUrls": [{"url": "https://saas-company.com/pricing"}],
"maxPages": 5
}

Integration Examples

LangChain RAG Pipeline

from langchain.document_loaders import ApifyDatasetLoader
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings
# Load Reasoning Packs from Apify
loader = ApifyDatasetLoader(
dataset_id="YOUR_DATASET_ID",
dataset_mapping_function=lambda item: Document(
page_content=item["content"]["markdown"],
metadata={
"url": item["url"],
"title": item["content"]["title"],
"entities": item["extraction"]["entities"],
"claims": item["extraction"]["claims"],
}
)
)
docs = loader.load()
# Or use pre-chunked RAG segments
chunked_loader = ApifyDatasetLoader(
dataset_id="YOUR_DATASET_ID",
dataset_mapping_function=lambda item: [
Document(
page_content=chunk["text"],
metadata=chunk["metadata"]
) for chunk in item["chunks"]
]
)

n8n Workflow Integration

{
"nodes": [
{
"name": "Run Compliance Scraper",
"type": "n8n-nodes-base.apify",
"parameters": {
"actorId": "ai_solutionist/compliance-web-intel",
"input": {
"startUrls": [{"url": "{{ $json.targetUrl }}"}],
"taskMode": "competitor_teardown"
}
}
},
{
"name": "Process Reasoning Pack",
"type": "n8n-nodes-base.code",
"parameters": {
"code": "return items[0].json.extraction.claims.map(c => ({ claim: c.text, source: c.exactQuote }))"
}
}
]
}

Direct API Call

curl -X POST "https://api.apify.com/v2/acts/ai_solutionist~compliance-web-intel/runs?token=YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"startUrls": [{"url": "https://example.com"}],
"taskMode": "general",
"maxPages": 10
}'

Why Grounded Extraction Matters for AI

The Hallucination Problem

When AI agents scrape websites and summarize content, they often:

  • Misattribute statements
  • Conflate information from multiple sources
  • Generate plausible-sounding but incorrect facts
  • Lose the ability to cite sources

The Grounded Solution

Every extraction in Compliance Web Intel includes:

{
"facts": [
{
"id": "fact_a1b2c3",
"statement": "Company processes 1 million requests daily",
"sourceBlockId": "blk_14_def456",
"exactQuote": "Our platform handles over 1 million API requests every day...",
"confidence": 0.92,
"category": "statistic"
}
]
}

Your AI agent can now:

  • Cite sources: "According to [exactQuote] from [url]..."
  • Verify claims: Cross-reference sourceBlockId with original content
  • Audit decisions: Full provenance chain for compliance

Input Schema

ParameterTypeDefaultDescription
startUrlsarrayrequiredURLs to crawl (objects with url property)
taskModestringgeneralPreset configuration for extraction focus
maxPagesinteger50Maximum pages to crawl per run
maxDepthinteger3How deep to follow links
includePatternsarray[]Only crawl URLs matching these regex patterns
excludePatternsarray[]Skip URLs matching these regex patterns
minRelevanceScorenumber0.5URL relevance threshold (0-1)
ragChunkSizeinteger750Target tokens per RAG chunk
diffBaselinestringnullPrevious run ID for change detection
alertWebhookstringnullWebhook for material change alerts
proxyConfigobjectApify ProxyProxy configuration

Output Format

Full Reasoning Pack Example

{
"runId": "abc123def456",
"url": "https://example.com/pricing",
"domain": "example.com",
"crawlTimestamp": "2026-01-28T12:00:00.000Z",
"taskMode": "pricing_intelligence",
"content": {
"markdown": "# Pricing Plans\n\nChoose the plan that's right for you...",
"title": "Pricing - Example Company",
"metaDescription": "Simple, transparent pricing for teams of all sizes.",
"wordCount": 1247,
"readingTimeMinutes": 6,
"outline": [
{"level": 1, "text": "Pricing Plans", "id": "h_0_abc", "children": [
{"level": 2, "text": "Starter", "id": "h_1_def", "children": []},
{"level": 2, "text": "Professional", "id": "h_2_ghi", "children": []},
{"level": 2, "text": "Enterprise", "id": "h_3_jkl", "children": []}
]}
],
"blocks": [
{
"id": "blk_0_mno",
"type": "heading",
"text": "Pricing Plans",
"tokenCount": 3,
"position": 0
},
{
"id": "blk_1_pqr",
"type": "paragraph",
"text": "Choose the plan that's right for your team. All plans include...",
"tokenCount": 42,
"position": 1
}
]
},
"extraction": {
"facts": [
{
"id": "fact_001",
"statement": "Professional plan includes unlimited API calls",
"sourceBlockId": "blk_5_xyz",
"exactQuote": "Professional: Unlimited API calls, priority support, and advanced analytics.",
"confidence": 0.88,
"category": "claim"
}
],
"entities": [
{"id": "ent_001", "text": "$29/month", "type": "money", "sourceBlockId": "blk_3_abc", "confidence": 1.0},
{"id": "ent_002", "text": "support@example.com", "type": "email", "sourceBlockId": "blk_12_def", "confidence": 1.0}
],
"claims": [
{
"id": "claim_001",
"text": "#1 rated solution",
"type": "marketing",
"sourceBlockId": "blk_2_ghi",
"exactQuote": "The #1 rated solution for growing teams",
"sentiment": "positive"
}
],
"pricing": [
{
"id": "price_001",
"amount": 29,
"currency": "USD",
"period": "month",
"tierName": "Starter",
"sourceBlockId": "blk_3_abc",
"rawText": "$29/mo"
},
{
"id": "price_002",
"amount": 99,
"currency": "USD",
"period": "month",
"tierName": "Professional",
"sourceBlockId": "blk_5_def",
"rawText": "$99/month"
}
],
"contactInfo": {
"emails": ["support@example.com", "sales@example.com"],
"phones": ["1-800-555-0123"],
"addresses": [],
"socialLinks": {"twitter": "https://twitter.com/example", "linkedin": "https://linkedin.com/company/example"}
},
"dates": [
{
"id": "date_001",
"text": "January 15, 2026",
"parsed": "2026-01-15T00:00:00.000Z",
"context": "...pricing effective January 15, 2026...",
"sourceBlockId": "blk_8_xyz"
}
]
},
"schema": {
"detectedType": "SoftwareApplication",
"confidence": 0.92,
"normalizedFields": {
"title": "Pricing - Example Company",
"description": "Simple, transparent pricing for teams of all sizes.",
"price": "$29"
},
"rawJsonLd": {"@type": "SoftwareApplication", "name": "Example App", ...},
"microdata": null
},
"monitoring": {
"contentHash": "sha256:a1b2c3d4e5f6...",
"previousHash": null,
"diffSummary": null,
"materialityScore": 0,
"changedSections": [],
"alertTriggered": false
},
"provenance": {
"requestHeaders": {},
"responseHeaders": {"content-type": "text/html; charset=utf-8"},
"renderMode": "static",
"proxyUsed": "datacenter",
"htmlHash": "sha256:9f8e7d6c5b4a...",
"htmlSizeBytes": 45678,
"fetchedAt": "2026-01-28T12:00:00.000Z",
"statusCode": 200,
"redirectChain": []
},
"chunks": [
{
"id": "chunk_001",
"text": "Pricing Plans. Choose the plan that's right for your team. All plans include unlimited users, 99.9% uptime SLA, and 24/7 support...",
"tokenCount": 687,
"metadata": {
"url": "https://example.com/pricing",
"title": "Pricing - Example Company",
"section": "Pricing Plans",
"blockIds": ["blk_0_mno", "blk_1_pqr", "blk_2_stu"],
"position": 0,
"totalChunks": 3
}
}
],
"quality": {
"overallScore": 88,
"completeness": 83,
"confidence": 78,
"citationCoverage": 100,
"structureQuality": 90
},
"summary": {
"url": "https://example.com/pricing",
"title": "Pricing - Example Company",
"oneLiner": "SoftwareApplication page from example.com with pricing and feature information.",
"keyFacts": [
"Professional plan includes unlimited API calls",
"All plans include 24/7 support",
"Enterprise tier offers custom integrations"
],
"keyEntities": ["Example Company (organization)", "$29/month (money)", "$99/month (money)"],
"topClaims": ["The #1 rated solution for growing teams"],
"pricingSummary": "3 pricing tiers from $29/month to $299/month",
"qualityScore": 88,
"recommendedActions": [
"Track pricing changes with diffBaseline parameter",
"Use RAG chunks for efficient embedding"
]
},
"citation": {
"apa": "Example Company. (2026). Pricing - Example Company. Retrieved January 28, 2026, from https://example.com/pricing",
"mla": "\"Pricing - Example Company.\" example.com, January 28, 2026, https://example.com/pricing.",
"chicago": "\"Pricing - Example Company.\" example.com. Accessed January 28, 2026. https://example.com/pricing.",
"inline": "[Pricing - Example Company](https://example.com/pricing) (accessed 2026-01-28)",
"markdown": "> Source: [Pricing - Example Company](https://example.com/pricing)\n> Retrieved: January 28, 2026\n> Domain: example.com"
},
"langchainMetadata": {
"source": "https://example.com/pricing",
"title": "Pricing - Example Company",
"domain": "example.com",
"crawl_date": "2026-01-28T12:00:00.000Z",
"word_count": 1247,
"detected_type": "SoftwareApplication",
"entities_count": 12,
"claims_count": 3,
"has_pricing": true,
"quality_score": 88,
"content_hash": "sha256:a1b2c3d4e5f6..."
},
"processingTimeMs": 1247,
"warnings": [],
"errors": []
}

Frequently Asked Questions

How is this different from Apify's Website Content Crawler?

Website Content Crawler returns raw content. Compliance Web Intel adds:

  • Grounded extraction with source citations
  • Semantic chunking optimized for RAG
  • Entity/claim detection with confidence scores
  • Schema.org detection and normalization
  • Change monitoring with materiality scoring

Can I use this for monitoring competitor pricing?

Yes! Use taskMode: "pricing_intelligence" with diffBaseline set to a previous run ID. The actor will detect pricing changes and calculate a materialityScore. Set up an alertWebhook to get notified of significant changes.

How do RAG chunks work?

Chunks are created by:

  1. Grouping content by section (using headings as boundaries)
  2. Splitting on sentence boundaries
  3. Targeting 500-900 tokens per chunk
  4. Adding 50-token overlap between chunks
  5. Including full metadata (URL, section, block IDs)

What's the difference between facts and claims?

  • Facts: Verifiable statements (statistics, definitions, processes)
  • Claims: Subjective assertions (marketing claims, guarantees, testimonials)

Both include source citations for verification.

How do I handle JavaScript-rendered pages?

The actor uses Cheerio (static HTML) by default. For JS-heavy pages, the crawler will still extract what's available in the initial HTML. For fully dynamic SPAs, consider using Apify's Puppeteer/Playwright scrapers first, then processing the HTML through this actor.


Use Cases by Industry

SaaS / Technology

  • Competitor feature tracking
  • Pricing intelligence
  • Documentation-to-API conversion
  • Market positioning analysis
  • Privacy policy monitoring
  • Terms of service change detection
  • Regulatory compliance evidence
  • AI governance audit trails

Marketing / SEO

  • Content gap analysis
  • Local SEO citation audits
  • Competitive positioning research
  • Schema markup validation

Sales / Business Development

  • Account research automation
  • Company signal detection
  • Contact information extraction
  • News and press monitoring

Research / Intelligence

  • Due diligence automation
  • Market research compilation
  • Academic source collection
  • Fact-checking support

Pricing

This actor uses Apify's standard compute unit pricing. Typical costs:

  • 10 pages: 0.01 compute units ($0.005)
  • 100 pages: 0.1 compute units ($0.05)
  • 1000 pages: 1 compute unit ($0.50)

Actual costs depend on page complexity, proxy usage, and processing time.


Support & Feedback


About the Author

AI Solutionist builds automation infrastructure for the AI-native enterprise. This actor embodies the HyperCognate philosophy:

Question assumptions. Obsess over detail. Plan like Da Vinci. Craft solutions that sing. Iterate relentlessly. Simplify ruthlessly.

Other actors by AI Solutionist:


Changelog

v1.0.0 (2026-01-28)

  • Initial release
  • 6 task modes for common AI workflows
  • Grounded extraction with source citations
  • RAG-ready chunking
  • Schema.org detection
  • Change monitoring infrastructure

License

MIT License - Use freely in commercial and open-source projects.


The scraper that AI agents trust. Because intelligence without provenance is just hallucination.