Intelligent Website Scrapper avatar
Intelligent Website Scrapper

Pricing

Pay per event

Go to Store
Intelligent Website Scrapper

Intelligent Website Scrapper

Developed by

HappiTap

HappiTap

Maintained by Community

An intelligent website scraper that uses LangChain and LLM to extract and process content based on high-level goals like summarization, product extraction, service extraction, and FAQ extraction.

0.0 (0)

Pricing

Pay per event

0

Total users

5

Monthly users

5

Runs succeeded

>99%

Last modified

8 days ago

Intelligent Website Scraper

An Apify actor that uses LangChain and LLM to intelligently scrape and process website content based on high-level goals.

Features

  • Universal Website Scraping: Works on any website, not limited to specific platforms
  • Intelligent Content Processing: Uses LangChain + OpenAI to extract and summarize content
  • Multiple Task Types: Support for summarization, product extraction, service extraction, and FAQ extraction
  • Configurable Crawling: Optional internal link following with depth control
  • Clean Content Extraction: Removes scripts, styles, and irrelevant content

Input

The actor accepts the following input format:

{
"startUrls": [
{ "url": "https://example.com" }
],
"taskType": "extractServices",
"maxDepth": 1,
"followInternalLinks": false
}

Input Parameters

ParameterTypeRequiredDefaultDescription
startUrlsArrayYes-Array of objects with url property
taskTypeStringNosummarizeType of content processing task
maxDepthNumberNo1Maximum depth for internal link following
followInternalLinksBooleanNofalseWhether to follow internal links

Supported Task Types

Task TypeDescription
summarizeSummarize entire site content
extractProductsIdentify and extract product-related sections
extractServicesExtract service listings or offerings
extractFAQsPull FAQ-like content from the page

Output

The actor outputs structured data for each processed URL:

{
"url": "https://example.com",
"title": "Example Website",
"taskType": "extractServices",
"processedContent": "AI-processed content based on task type...",
"rawContent": "First 1000 characters of raw content...",
"scrapedAt": "2024-01-01T00:00:00.000Z",
"metadata": {
"wordCount": 1500,
"linksFound": 25,
"imagesFound": 10
}
}

Environment Variables

VariableRequiredDescription
OPENAI_API_KEYYesYour OpenAI API key for LangChain integration

Example Usage

Basic Summarization

{
"startUrls": [
{ "url": "https://example.com" }
],
"taskType": "summarize"
}
{
"startUrls": [
{ "url": "https://example.com" }
],
"taskType": "extractServices",
"followInternalLinks": true,
"maxDepth": 2
}

Extract Products from Multiple URLs

{
"startUrls": [
{ "url": "https://shop1.com" },
{ "url": "https://shop2.com" }
],
"taskType": "extractProducts"
}

How It Works

  1. Content Extraction: Uses Puppeteer to load pages and Cheerio to extract clean content
  2. Intelligent Processing: LangChain processes content based on the specified task type
  3. Structured Output: Returns processed content with metadata and original URL
  4. Optional Crawling: Can follow internal links to gather more comprehensive data

Installation

  1. Clone this repository
  2. Install dependencies: npm install
  3. Set your OPENAI_API_KEY environment variable
  4. Run the actor: npm start

Development

  • npm start - Run the actor
  • npm run format - Format code with Prettier
  • npm run lint - Run ESLint
  • npm run lint:fix - Fix ESLint issues

Architecture

  • src/main.js - Main entry point and input validation
  • src/routes.js - Request routing
  • src/handlers/websiteScraper.js - Main scraping logic
  • src/services/langchainService.js - LangChain integration and task processing
  • src/puppeteerLauncher.js - Puppeteer browser configuration