Compliance-Grade Web Intelligence for AI Agents
Pricing
from $2.00 / 1,000 reasoning packs
Compliance-Grade Web Intelligence for AI Agents
The scraper AI agents trust. Extract grounded facts with citations, entities, claims & RAG chunks. Built for LangChain, LlamaIndex, AutoGPT. Quality scoring, auto-citations, 6 task modes.
Pricing
from $2.00 / 1,000 reasoning packs
Rating
0.0
(0)
Developer

Jason Pellerin
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
The scraper that AI agents trust. Extract grounded facts with source citations, not hallucinated summaries.
🚨 Colorado SB 25B-004: AI Compliance Deadline June 30, 2026
Colorado Senate Bill 25B-004 establishes first-in-nation AI transparency and accountability requirements. Effective June 30, 2026, organizations deploying AI systems must:
- Document data sources used by AI agents
- Maintain audit trails for AI-generated decisions
- Provide source citations when AI systems use web-scraped data
- Enable human review of AI outputs with traceable provenance
This actor is purpose-built for SB 25B-004 compliance:
| Requirement | How This Actor Helps |
|---|---|
| Source Documentation | Every extraction includes sourceBlockId + exactQuote |
| Audit Trails | Full provenance object with timestamps, hashes, status codes |
| Citation Generation | Auto-generated APA, MLA, Chicago, and inline citations |
| Human Review | quality scores and summary enable efficient oversight |
| Change Monitoring | contentHash and materialityScore track source changes |
Denver-based businesses: Get ahead of compliance before June 30th. This actor creates the audit evidence your AI systems need.
Why AI Agents Need This Scraper
Traditional scrapers return raw HTML or unstructured text. AI agents need grounded intelligence:
| Traditional Scraper | Compliance Web Intel |
|---|---|
| Raw HTML dump | Clean markdown + semantic blocks |
| No source tracking | Every fact has sourceBlockId + exact quote |
| Single output format | Reasoning Pack with 6 structured components |
| No change detection | Content hashing + materiality scoring |
| Hallucination-prone | Citation-ready extractions |
Built for: LangChain, LlamaIndex, AutoGPT, CrewAI, n8n AI nodes, custom RAG pipelines
What You Get: The Reasoning Pack
Every URL produces a Reasoning Pack - a structured bundle optimized for AI agent consumption:
ReasoningPack├── content/ # Clean, structured content│ ├── markdown # Readability-processed markdown│ ├── outline[] # H1→H6 heading tree│ └── blocks[] # Semantic blocks with IDs│├── extraction/ # Grounded intelligence│ ├── facts[] # Statements with sourceBlockId + exactQuote│ ├── entities[] # People, orgs, products, locations, money, dates│ ├── claims[] # Marketing claims, guarantees, compliance statements│ ├── pricing[] # Prices with tier names and periods│ └── contactInfo # Emails, phones, addresses, social links│├── schema/ # Structured data detection│ ├── detectedType # Article, Product, LocalBusiness, FAQ, etc.│ ├── confidence # 0-1 detection confidence│ └── normalizedFields # Title, author, price, rating, etc.│├── monitoring/ # Change detection│ ├── contentHash # SHA-256 for deduplication│ └── materialityScore # 0-1 significance of changes│├── provenance/ # Full audit trail│ ├── htmlHash # Original HTML fingerprint│ ├── fetchedAt # ISO timestamp│ ├── statusCode # HTTP response code│ └── renderMode # static/javascript/browser│└── chunks[] # RAG-ready segments├── text # 500-900 token chunks├── tokenCount # Exact token count└── metadata # URL, section, blockIds, position
Task Modes: Pre-configured for Common AI Workflows
1. competitor_teardown
Best for: Competitive intelligence agents, market research bots
Focuses on: pricing pages, features, about, testimonials, comparison pages
Output includes:
- Pricing tiers with features
- Marketing claims with citations
- Positioning statements
- Competitor differentiators
{"taskMode": "competitor_teardown","startUrls": [{"url": "https://competitor.com"}],"maxPages": 20}
2. compliance_discovery
Best for: AI governance agents, policy monitoring bots, legal research
Focuses on: privacy policy, terms of service, legal notices, accessibility statements
Output includes:
- GDPR/CCPA compliance claims
- Data handling statements
- Legal disclaimers with exact quotes
- Policy change detection
{"taskMode": "compliance_discovery","startUrls": [{"url": "https://company.com/privacy"}],"alertWebhook": "https://your-webhook.com/policy-changes"}
3. local_seo_audit
Best for: Local SEO agents, citation building bots, GEO optimization
Focuses on: contact pages, about, locations, services, reviews
Output includes:
- NAP (Name, Address, Phone) extraction
- Schema.org markup detection
- Service area identification
- Business hours parsing
{"taskMode": "local_seo_audit","startUrls": [{"url": "https://local-business.com"}],"maxPages": 10}
4. sales_research
Best for: Account research agents, sales enablement bots
Focuses on: about, team, leadership, news, press, careers, case studies
Output includes:
- Company signals (hiring, funding, expansion)
- Key personnel extraction
- Technology stack indicators
- Recent news with citations
{"taskMode": "sales_research","startUrls": [{"url": "https://prospect-company.com"}],"maxPages": 15}
5. docs_extraction
Best for: Documentation-to-API agents, knowledge base builders
Focuses on: docs, API references, guides, tutorials, help articles
Output includes:
- Structured procedure extraction
- Code example identification
- Parameter documentation
- Step-by-step instructions
{"taskMode": "docs_extraction","startUrls": [{"url": "https://docs.example.com"}],"maxPages": 50}
6. pricing_intelligence
Best for: Price monitoring agents, market analysis bots
Focuses on: pricing pages, plans, packages, enterprise quotes
Output includes:
- Tier-by-tier pricing breakdown
- Feature comparisons
- Discount indicators
- Enterprise contact triggers
{"taskMode": "pricing_intelligence","startUrls": [{"url": "https://saas-company.com/pricing"}],"maxPages": 5}
Integration Examples
LangChain RAG Pipeline
from langchain.document_loaders import ApifyDatasetLoaderfrom langchain.vectorstores import Pineconefrom langchain.embeddings import OpenAIEmbeddings# Load Reasoning Packs from Apifyloader = ApifyDatasetLoader(dataset_id="YOUR_DATASET_ID",dataset_mapping_function=lambda item: Document(page_content=item["content"]["markdown"],metadata={"url": item["url"],"title": item["content"]["title"],"entities": item["extraction"]["entities"],"claims": item["extraction"]["claims"],}))docs = loader.load()# Or use pre-chunked RAG segmentschunked_loader = ApifyDatasetLoader(dataset_id="YOUR_DATASET_ID",dataset_mapping_function=lambda item: [Document(page_content=chunk["text"],metadata=chunk["metadata"]) for chunk in item["chunks"]])
n8n Workflow Integration
{"nodes": [{"name": "Run Compliance Scraper","type": "n8n-nodes-base.apify","parameters": {"actorId": "ai_solutionist/compliance-web-intel","input": {"startUrls": [{"url": "{{ $json.targetUrl }}"}],"taskMode": "competitor_teardown"}}},{"name": "Process Reasoning Pack","type": "n8n-nodes-base.code","parameters": {"code": "return items[0].json.extraction.claims.map(c => ({ claim: c.text, source: c.exactQuote }))"}}]}
Direct API Call
curl -X POST "https://api.apify.com/v2/acts/ai_solutionist~compliance-web-intel/runs?token=YOUR_TOKEN" \-H "Content-Type: application/json" \-d '{"startUrls": [{"url": "https://example.com"}],"taskMode": "general","maxPages": 10}'
Why Grounded Extraction Matters for AI
The Hallucination Problem
When AI agents scrape websites and summarize content, they often:
- Misattribute statements
- Conflate information from multiple sources
- Generate plausible-sounding but incorrect facts
- Lose the ability to cite sources
The Grounded Solution
Every extraction in Compliance Web Intel includes:
{"facts": [{"id": "fact_a1b2c3","statement": "Company processes 1 million requests daily","sourceBlockId": "blk_14_def456","exactQuote": "Our platform handles over 1 million API requests every day...","confidence": 0.92,"category": "statistic"}]}
Your AI agent can now:
- Cite sources: "According to [exactQuote] from [url]..."
- Verify claims: Cross-reference sourceBlockId with original content
- Audit decisions: Full provenance chain for compliance
Input Schema
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrls | array | required | URLs to crawl (objects with url property) |
taskMode | string | general | Preset configuration for extraction focus |
maxPages | integer | 50 | Maximum pages to crawl per run |
maxDepth | integer | 3 | How deep to follow links |
includePatterns | array | [] | Only crawl URLs matching these regex patterns |
excludePatterns | array | [] | Skip URLs matching these regex patterns |
minRelevanceScore | number | 0.5 | URL relevance threshold (0-1) |
ragChunkSize | integer | 750 | Target tokens per RAG chunk |
diffBaseline | string | null | Previous run ID for change detection |
alertWebhook | string | null | Webhook for material change alerts |
proxyConfig | object | Apify Proxy | Proxy configuration |
Output Format
Full Reasoning Pack Example
{"runId": "abc123def456","url": "https://example.com/pricing","domain": "example.com","crawlTimestamp": "2026-01-28T12:00:00.000Z","taskMode": "pricing_intelligence","content": {"markdown": "# Pricing Plans\n\nChoose the plan that's right for you...","title": "Pricing - Example Company","metaDescription": "Simple, transparent pricing for teams of all sizes.","wordCount": 1247,"readingTimeMinutes": 6,"outline": [{"level": 1, "text": "Pricing Plans", "id": "h_0_abc", "children": [{"level": 2, "text": "Starter", "id": "h_1_def", "children": []},{"level": 2, "text": "Professional", "id": "h_2_ghi", "children": []},{"level": 2, "text": "Enterprise", "id": "h_3_jkl", "children": []}]}],"blocks": [{"id": "blk_0_mno","type": "heading","text": "Pricing Plans","tokenCount": 3,"position": 0},{"id": "blk_1_pqr","type": "paragraph","text": "Choose the plan that's right for your team. All plans include...","tokenCount": 42,"position": 1}]},"extraction": {"facts": [{"id": "fact_001","statement": "Professional plan includes unlimited API calls","sourceBlockId": "blk_5_xyz","exactQuote": "Professional: Unlimited API calls, priority support, and advanced analytics.","confidence": 0.88,"category": "claim"}],"entities": [{"id": "ent_001", "text": "$29/month", "type": "money", "sourceBlockId": "blk_3_abc", "confidence": 1.0},{"id": "ent_002", "text": "support@example.com", "type": "email", "sourceBlockId": "blk_12_def", "confidence": 1.0}],"claims": [{"id": "claim_001","text": "#1 rated solution","type": "marketing","sourceBlockId": "blk_2_ghi","exactQuote": "The #1 rated solution for growing teams","sentiment": "positive"}],"pricing": [{"id": "price_001","amount": 29,"currency": "USD","period": "month","tierName": "Starter","sourceBlockId": "blk_3_abc","rawText": "$29/mo"},{"id": "price_002","amount": 99,"currency": "USD","period": "month","tierName": "Professional","sourceBlockId": "blk_5_def","rawText": "$99/month"}],"contactInfo": {"emails": ["support@example.com", "sales@example.com"],"phones": ["1-800-555-0123"],"addresses": [],"socialLinks": {"twitter": "https://twitter.com/example", "linkedin": "https://linkedin.com/company/example"}},"dates": [{"id": "date_001","text": "January 15, 2026","parsed": "2026-01-15T00:00:00.000Z","context": "...pricing effective January 15, 2026...","sourceBlockId": "blk_8_xyz"}]},"schema": {"detectedType": "SoftwareApplication","confidence": 0.92,"normalizedFields": {"title": "Pricing - Example Company","description": "Simple, transparent pricing for teams of all sizes.","price": "$29"},"rawJsonLd": {"@type": "SoftwareApplication", "name": "Example App", ...},"microdata": null},"monitoring": {"contentHash": "sha256:a1b2c3d4e5f6...","previousHash": null,"diffSummary": null,"materialityScore": 0,"changedSections": [],"alertTriggered": false},"provenance": {"requestHeaders": {},"responseHeaders": {"content-type": "text/html; charset=utf-8"},"renderMode": "static","proxyUsed": "datacenter","htmlHash": "sha256:9f8e7d6c5b4a...","htmlSizeBytes": 45678,"fetchedAt": "2026-01-28T12:00:00.000Z","statusCode": 200,"redirectChain": []},"chunks": [{"id": "chunk_001","text": "Pricing Plans. Choose the plan that's right for your team. All plans include unlimited users, 99.9% uptime SLA, and 24/7 support...","tokenCount": 687,"metadata": {"url": "https://example.com/pricing","title": "Pricing - Example Company","section": "Pricing Plans","blockIds": ["blk_0_mno", "blk_1_pqr", "blk_2_stu"],"position": 0,"totalChunks": 3}}],"quality": {"overallScore": 88,"completeness": 83,"confidence": 78,"citationCoverage": 100,"structureQuality": 90},"summary": {"url": "https://example.com/pricing","title": "Pricing - Example Company","oneLiner": "SoftwareApplication page from example.com with pricing and feature information.","keyFacts": ["Professional plan includes unlimited API calls","All plans include 24/7 support","Enterprise tier offers custom integrations"],"keyEntities": ["Example Company (organization)", "$29/month (money)", "$99/month (money)"],"topClaims": ["The #1 rated solution for growing teams"],"pricingSummary": "3 pricing tiers from $29/month to $299/month","qualityScore": 88,"recommendedActions": ["Track pricing changes with diffBaseline parameter","Use RAG chunks for efficient embedding"]},"citation": {"apa": "Example Company. (2026). Pricing - Example Company. Retrieved January 28, 2026, from https://example.com/pricing","mla": "\"Pricing - Example Company.\" example.com, January 28, 2026, https://example.com/pricing.","chicago": "\"Pricing - Example Company.\" example.com. Accessed January 28, 2026. https://example.com/pricing.","inline": "[Pricing - Example Company](https://example.com/pricing) (accessed 2026-01-28)","markdown": "> Source: [Pricing - Example Company](https://example.com/pricing)\n> Retrieved: January 28, 2026\n> Domain: example.com"},"langchainMetadata": {"source": "https://example.com/pricing","title": "Pricing - Example Company","domain": "example.com","crawl_date": "2026-01-28T12:00:00.000Z","word_count": 1247,"detected_type": "SoftwareApplication","entities_count": 12,"claims_count": 3,"has_pricing": true,"quality_score": 88,"content_hash": "sha256:a1b2c3d4e5f6..."},"processingTimeMs": 1247,"warnings": [],"errors": []}
Frequently Asked Questions
How is this different from Apify's Website Content Crawler?
Website Content Crawler returns raw content. Compliance Web Intel adds:
- Grounded extraction with source citations
- Semantic chunking optimized for RAG
- Entity/claim detection with confidence scores
- Schema.org detection and normalization
- Change monitoring with materiality scoring
Can I use this for monitoring competitor pricing?
Yes! Use taskMode: "pricing_intelligence" with diffBaseline set to a previous run ID. The actor will detect pricing changes and calculate a materialityScore. Set up an alertWebhook to get notified of significant changes.
How do RAG chunks work?
Chunks are created by:
- Grouping content by section (using headings as boundaries)
- Splitting on sentence boundaries
- Targeting 500-900 tokens per chunk
- Adding 50-token overlap between chunks
- Including full metadata (URL, section, block IDs)
What's the difference between facts and claims?
- Facts: Verifiable statements (statistics, definitions, processes)
- Claims: Subjective assertions (marketing claims, guarantees, testimonials)
Both include source citations for verification.
How do I handle JavaScript-rendered pages?
The actor uses Cheerio (static HTML) by default. For JS-heavy pages, the crawler will still extract what's available in the initial HTML. For fully dynamic SPAs, consider using Apify's Puppeteer/Playwright scrapers first, then processing the HTML through this actor.
Use Cases by Industry
SaaS / Technology
- Competitor feature tracking
- Pricing intelligence
- Documentation-to-API conversion
- Market positioning analysis
Legal / Compliance
- Privacy policy monitoring
- Terms of service change detection
- Regulatory compliance evidence
- AI governance audit trails
Marketing / SEO
- Content gap analysis
- Local SEO citation audits
- Competitive positioning research
- Schema markup validation
Sales / Business Development
- Account research automation
- Company signal detection
- Contact information extraction
- News and press monitoring
Research / Intelligence
- Due diligence automation
- Market research compilation
- Academic source collection
- Fact-checking support
Pricing
This actor uses Apify's standard compute unit pricing. Typical costs:
- 10 pages:
0.01 compute units ($0.005) - 100 pages:
0.1 compute units ($0.05) - 1000 pages:
1 compute unit ($0.50)
Actual costs depend on page complexity, proxy usage, and processing time.
Support & Feedback
- Issues: GitHub Issues
- Email: jason@jasonpellerin.com
- Apify Discord: @ai_solutionist
About the Author
AI Solutionist builds automation infrastructure for the AI-native enterprise. This actor embodies the HyperCognate philosophy:
Question assumptions. Obsess over detail. Plan like Da Vinci. Craft solutions that sing. Iterate relentlessly. Simplify ruthlessly.
Other actors by AI Solutionist:
Changelog
v1.0.0 (2026-01-28)
- Initial release
- 6 task modes for common AI workflows
- Grounded extraction with source citations
- RAG-ready chunking
- Schema.org detection
- Change monitoring infrastructure
License
MIT License - Use freely in commercial and open-source projects.
The scraper that AI agents trust. Because intelligence without provenance is just hallucination.