AI Context Scraper avatar

AI Context Scraper

Under maintenance

Pricing

from $50.00 / 1,000 results

Go to Apify Store
AI Context Scraper

AI Context Scraper

Under maintenance

AI Context Scraper is a production-grade Apify Actor that gathers high-quality coding context from the Web, GitHub, and StackOverflow for AI agents and RAG systems. It uses NVIDIA Nemotron 3 Super to synthesize documents, code snippets, and patterns into actionable implementation guidance.

Pricing

from $50.00 / 1,000 results

Rating

0.0

(0)

Developer

Varun Chopra

Varun Chopra

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

2

Monthly active users

7 hours ago

Last modified

Share

AI Context Scraper — Production-Grade Developer Knowledge Engine

Apify Actor Build Python License: MIT

Overview

A production-grade Apify Actor that intelligently compiles high-quality coding context for AI agents, developer copilots, and engineering RAG systems. Transforms any coding task into structured, LLM-optimized context with documentation, code examples, implementation patterns, and best practices.

Powered by NVIDIA Nemotron 3 Super (120B) via OpenRouter for LLM synthesis.

🚀 Quick Start

# Run via Apify CLI
apify actors call wheat_tourist/ai-context-scraper \
-f input.json -t 300 -o
# input.json
{
"task": "Build a Python FastAPI endpoint for file upload to S3",
"max_sources": 10,
"enable_llm_synthesis": true
}

Or call via the Apify Console.

✅ Production Verified

The actor has been battle-tested with real-world queries. Here are actual metrics from a live run:

MetricValue
Search queries executed12
Sources discovered5
Pages crawled4
Code snippets extracted92
Critical context chunks11
StackOverflow answers4
LLM tokens used8,471
Total execution time112s
Errors0

🔑 Key Features

Multi-Source Knowledge Mining

  • Web Search: DuckDuckGo (via ddgs) with documentation prioritization and rate-limited timeouts
  • GitHub Intelligence: Repository and code search with star-based ranking
  • StackOverflow Q&A: High-score accepted answers from the developer community
  • Documentation Priority: Boosted ranking for official docs (Python, AWS, FastAPI, etc.)

LLM RAG Synthesis

  • Model: nvidia/nemotron-3-super-120b-a12b:free (configurable)
  • Actionable Guidance: Synthesizes gathered context into implementation-ready code with open questions
  • Automatic Prompting: Builds token-optimized context prompts with code snippets, patterns, and SO answers
  • Graceful Degradation: If the LLM call fails, the pipeline safely returns raw structured context

Advanced Intelligence

  • Semantic Relevance Filtering: sentence-transformers/all-MiniLM-L6-v2 embeddings for precision ranking
  • Relevance Bucketization: Context classified as Critical / Helpful / Noise (noise dropped)
  • Implementation Pattern Detection: Automatically identifies auth, caching, async, database patterns
  • Content Deduplication: MinHash/shingling-based near-duplicate removal
  • Code Quality Scoring: Ranks snippets by completeness, relevance, and documentation

Enterprise & Security

  • SSRF Protection: CIDR-based private IP blocking (RFC 1918 + IPv6 link-local)
  • Input Sanitization: Pattern-based injection detection (script tags, eval, data URIs)
  • Secret Redaction: Automatic redaction of tokens/keys in logs
  • Caching Layer: Apify KV store with monotonic-clock TTL (immune to container clock skew)
  • Observability: Per-phase timing, relevance buckets, cache hit rates, error tracking

📊 Output Structure

{
"task": "Build a FastAPI endpoint for S3 uploads",
"relevant_context": [
{
"source": "https://fastapi.tiangolo.com/tutorial/request-files/",
"bucket": "critical",
"relevance_score": 0.66,
"why_it_matters": "High-relevance documentation chunk",
"key_detail": "# Request Files\nYou can define files to be uploaded..."
}
],
"context": {
"concepts": [...],
"code_snippets": [...],
"api_references": [...],
"best_practices": [...],
"implementation_patterns": [...],
"stackoverflow_answers": [...]
},
"llm_guidance": {
"content": "## Task\n- Build a FastAPI endpoint...\n\n## Implementation\n```python\n...\n```",
"model": "nvidia/nemotron-3-super-120b-a12b:free",
"tokens_used": 8471,
"finish_reason": "stop"
},
"metrics": {
"timing": { "total_seconds": 112.24, "search_seconds": 8.51, ... },
"counts": { "queries": 12, "sources_found": 5, "pages_scraped": 4, ... },
"relevance_buckets": { "critical_chunks": 11, "helpful_chunks": 5, ... },
"quality": { "avg_chunk_relevance": 0.462, "avg_snippet_relevance": 0.723 },
"errors": []
}
}

⚙️ Configuration

Input Parameters

ParameterTypeDefaultDescription
taskstring(required)Coding task description
max_sourcesinteger10Maximum sources to scrape (3–50)
allowed_domainsarray[]Domain whitelist (empty = all)
include_githubbooleantrueEnable GitHub repository mining
include_github_code_searchbooleantrueEnable authenticated GitHub code search
github_tokenstringnullGitHub token (or GITHUB_TOKEN env var)
github_code_languagesarray[]Target languages for code search
include_stackoverflowbooleantrueEnable StackOverflow Q&A mining
max_code_snippetsinteger20Maximum code snippets to return (1–100)
enable_cachebooleantrueEnable caching for faster repeated runs
chunk_sizeinteger500Token limit per LLM chunk (100–2000)
enable_llm_synthesisbooleantrueEnable LLM-powered context synthesis
openrouter_api_keystringnullOpenRouter API key (or OPENROUTER_API_KEY env var)
openrouter_modelstringnvidia/nemotron-3-super-120b-a12b:freeModel ID for LLM synthesis

Environment Variables

VariableRequiredDescription
OPENROUTER_API_KEYYes (for LLM)OpenRouter API key. Can also be passed as input.
GITHUB_TOKENNoGitHub personal access token for code search.

Set via Apify secrets:

apify secrets add openrouter_api_key "sk-or-v1-..."
apify secrets add github_token "ghp_..."

🏗️ Architecture

Module Structure

src/
├── __main__.py # Entry point with Pydantic input validation
├── orchestrator.py # Pipeline coordinator with metrics & error recovery
├── search.py # DDGS search with query expansion & timeouts
├── github_miner.py # GitHub repo + code search
├── stackoverflow_miner.py # StackOverflow Q&A mining
├── crawler.py # Async HTTP crawler with robots.txt, retry, rate-limit
├── extractor.py # Content + code extraction (readability + BeautifulSoup)
├── pattern_detector.py # Implementation pattern detection
├── relevance.py # Semantic ranking with embeddings + bucketization
├── chunker.py # LLM-optimized text chunking (tiktoken)
├── deduplicator.py # Near-duplicate content removal (MinHash)
├── llm_synthesizer.py # OpenRouter LLM RAG synthesis
├── cache_manager.py # Apify KV store caching (monotonic TTL)
├── security.py # Input validation, SSRF protection, secret redaction
├── metrics.py # Observability and telemetry (dict-based phase tracking)
├── exceptions.py # Custom exception hierarchy
└── formatter.py # Final output formatting

Pipeline Flow

Task Input
Task Understanding & Query Expansion (12+ search queries)
Multi-Source Discovery (Web + GitHub + StackOverflow)
Async Crawling (semaphore-limited, robots.txt-aware)
Content Extraction (readability + BeautifulSoup)
Code & Pattern Extraction
Content Deduplication (MinHash shingling)
Semantic Relevance Ranking (sentence-transformers)
Relevance Bucketization (Critical / Helpful / Noise)
LLM Context Synthesis (NVIDIA Nemotron 3 Super via OpenRouter)
Structured Context Output + LLM Guidance + Metrics
Caching for Future Runs

🧪 Testing

# Unit tests (121 tests, runs in ~12s)
pytest -v
# Live regression tests (6 tests, hits real network)
pytest -m live --no-cov -v
# All tests
pytest -m "" -v

🛠️ Deployment

Build & Push

# Initialize (first time only)
git init && git add -A && git commit -m "Initial commit"
# Deploy to Apify (builds Docker image remotely)
apify push

Docker Build (local)

docker build -t ai-context-scraper .
docker run -e OPENROUTER_API_KEY=sk-or-v1-... ai-context-scraper

The Dockerfile pre-downloads the sentence-transformers model during build so cold starts are fast.

📈 Performance

  • Async Architecture: Concurrent crawling with semaphore limits
  • Smart Caching: Task-level caching with monotonic-clock TTL (immune to container clock drift)
  • Batch Processing: Embeddings computed in batches for efficiency
  • Rate Limiting: Configurable requests/second with sleep-outside-semaphore optimization
  • Timeout Protection: asyncio.wait_for() on all external calls
  • Broad Error Recovery: Catches 8 exception types without crashing the pipeline

🔒 Security

  • SSRF Protection: ipaddress module CIDR checks against all RFC 1918, loopback, link-local, and IPv6 private ranges
  • Input Validation: Pydantic models with strict typing + regex-based injection detection
  • Secret Redaction: Automatic redaction of ghp_*, sk-or-v1-*, Bearer * tokens in logs
  • Content Sanitization: readability-lxml for safe HTML parsing
  • SEO Spam Filtering: Multi-keyword detection (sponsored, affiliate, promo, etc.)
  • Domain Whitelisting: Optional domain restrictions
  • API Key Validation: Format checks with suspicious pattern detection

📦 Dependencies

PackagePurpose
apifyActor runtime
httpxAsync HTTP client
beautifulsoup4HTML parsing
readability-lxmlContent extraction
ddgsDuckDuckGo search
markdownifyHTML → Markdown
sentence-transformersSemantic embeddings
tiktokenToken counting
pydanticInput validation
rapidfuzzLexical similarity fallback

🎯 Use Cases

  • AI Coding Agents: Power your coding agent with real-time context about libraries, patterns, and best practices
  • Developer Copilots: Provide your IDE extension with rich, structured coding context
  • RAG Systems: Build retrieval-augmented generation pipelines with curated developer knowledge
  • Engineering Onboarding: Generate comprehensive learning materials for new team members
  • Code Review Assistance: Fetch implementation patterns and best practices to guide reviews

📝 License

MIT License — see LICENSE file for details.

🤝 Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.


Built for production use by AI infrastructure teams. Actor ID: 2OBJzyOtx1FyGGt2f | Latest Build: 0.1.16 | Model: NVIDIA Nemotron 3 Super