Ai Powered Scraper avatar
Ai Powered Scraper

Pricing

$10.00 / 1,000 results

Go to Apify Store
Ai Powered Scraper

Ai Powered Scraper

Developed by

Dev with Bobby

Dev with Bobby

Maintained by Community

AI Powered Scraper using LangChain and OpenAI.

0.0 (0)

Pricing

$10.00 / 1,000 results

1

1

1

Last modified

3 days ago

AI Powered Scraper using LangChain and OpenAI

Intelligent web scraping that answers questions about crawled content using advanced AI

This Actor combines web scraping with artificial intelligence to crawl websites and answer questions about the collected content. It uses LangChain.js and OpenAI to create a powerful question-answering system from any website.

What it does

  1. Smart Web Crawling - Scrapes websites with multiple crawler types and respects robots.txt
  2. Content Vectorization - Converts web content into searchable vector embeddings using OpenAI
  3. Intelligent Caching - Stores vector indices to speed up subsequent runs on the same content
  4. AI-Powered Q&A - Answers questions about the scraped content using OpenAI's language models
  5. Source Citations - Provides references to original sources for all answers

Key Features

Advanced Crawling Options

  • Multiple Crawler Types: Choose between adaptive switching, raw HTTP (Cheerio), headless browser (Playwright), or experimental JavaScript rendering (JSDOM)
  • Sitemap Integration: Automatically discover and load URLs from sitemap.xml files
  • Robots.txt Compliance: Respects website crawling restrictions
  • Request Control: Configurable delays and retry logic to avoid overwhelming servers
  • Custom User Agents: Set custom identification for your crawler

AI-Powered Analysis

  • Question Answering: Ask any question about the crawled content
  • Source Attribution: Get citations for where information was found
  • Context-Aware: Uses advanced retrieval techniques for accurate answers
  • Caching System: Reuses processed content for faster subsequent queries

Input Configuration

Required Settings

  • Start URLs: One or more websites to crawl
  • OpenAI API Key: Your OpenAI API key for embeddings and language model
  • Query: The question you want to ask about the crawled content

Advanced Options

  • Max Pages: Limit the number of pages to crawl (default: 3)
  • Force Re-crawl: Ignore cached data and crawl fresh content
  • Load URLs from Sitemaps: Automatically discover pages via sitemap.xml
  • Respect robots.txt: Honor website crawling restrictions (recommended)
  • Crawler Type: Choose your preferred crawling method
  • Request Delay: Time between requests in milliseconds
  • Max Retries: Number of retry attempts for failed requests

Perfect For

  • Research: Gather and analyze information from multiple web sources
  • Content Analysis: Ask specific questions about website content
  • Competitive Intelligence: Analyze competitor websites and documentation
  • Knowledge Base Creation: Build searchable knowledge from web content
  • Due Diligence: Research companies, products, or topics across multiple sources

Output Format

The Actor provides structured results including:

  • Question: Your original query
  • Answer: AI-generated response based on crawled content
  • Sources: List of web pages with URLs, titles, and relevant excerpts
  • Metadata: Total documents processed and scraping timestamp

Getting Started

  1. Set up OpenAI: Get your API key from OpenAI Platform
  2. Configure Input: Add your target URLs and question
  3. Choose Settings: Select crawler type and other preferences
  4. Run Actor: Start crawling and get AI-powered answers

Performance Tips

  • Use adaptive crawler for best balance of speed and compatibility
  • Enable sitemap loading for comprehensive website coverage
  • Set appropriate request delays to respect server limits
  • Use force re-crawl only when content has significantly changed

Privacy & Ethics

  • Respects robots.txt by default
  • Configurable request delays to avoid server overload
  • No data retention beyond your Apify account
  • Transparent source attribution in all results

Technical Details

Built with:

  • LangChain.js - AI application framework
  • OpenAI - Embeddings and language models
  • Apify SDK - Web scraping infrastructure
  • HNSWLib - Efficient vector similarity search

Resources

Support