
Intelligent Website Scrapper
Pricing
Pay per event
Go to Store

Intelligent Website Scrapper
An intelligent website scraper that uses LangChain and LLM to extract and process content based on high-level goals like summarization, product extraction, service extraction, and FAQ extraction.
0.0 (0)
Pricing
Pay per event
0
Total users
5
Monthly users
5
Runs succeeded
>99%
Last modified
8 days ago
Intelligent Website Scraper
An Apify actor that uses LangChain and LLM to intelligently scrape and process website content based on high-level goals.
Features
- Universal Website Scraping: Works on any website, not limited to specific platforms
- Intelligent Content Processing: Uses LangChain + OpenAI to extract and summarize content
- Multiple Task Types: Support for summarization, product extraction, service extraction, and FAQ extraction
- Configurable Crawling: Optional internal link following with depth control
- Clean Content Extraction: Removes scripts, styles, and irrelevant content
Input
The actor accepts the following input format:
{"startUrls": [{ "url": "https://example.com" }],"taskType": "extractServices","maxDepth": 1,"followInternalLinks": false}
Input Parameters
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
startUrls | Array | Yes | - | Array of objects with url property |
taskType | String | No | summarize | Type of content processing task |
maxDepth | Number | No | 1 | Maximum depth for internal link following |
followInternalLinks | Boolean | No | false | Whether to follow internal links |
Supported Task Types
Task Type | Description |
---|---|
summarize | Summarize entire site content |
extractProducts | Identify and extract product-related sections |
extractServices | Extract service listings or offerings |
extractFAQs | Pull FAQ-like content from the page |
Output
The actor outputs structured data for each processed URL:
{"url": "https://example.com","title": "Example Website","taskType": "extractServices","processedContent": "AI-processed content based on task type...","rawContent": "First 1000 characters of raw content...","scrapedAt": "2024-01-01T00:00:00.000Z","metadata": {"wordCount": 1500,"linksFound": 25,"imagesFound": 10}}
Environment Variables
Variable | Required | Description |
---|---|---|
OPENAI_API_KEY | Yes | Your OpenAI API key for LangChain integration |
Example Usage
Basic Summarization
{"startUrls": [{ "url": "https://example.com" }],"taskType": "summarize"}
Extract Services with Internal Link Following
{"startUrls": [{ "url": "https://example.com" }],"taskType": "extractServices","followInternalLinks": true,"maxDepth": 2}
Extract Products from Multiple URLs
{"startUrls": [{ "url": "https://shop1.com" },{ "url": "https://shop2.com" }],"taskType": "extractProducts"}
How It Works
- Content Extraction: Uses Puppeteer to load pages and Cheerio to extract clean content
- Intelligent Processing: LangChain processes content based on the specified task type
- Structured Output: Returns processed content with metadata and original URL
- Optional Crawling: Can follow internal links to gather more comprehensive data
Installation
- Clone this repository
- Install dependencies:
npm install
- Set your
OPENAI_API_KEY
environment variable - Run the actor:
npm start
Development
npm start
- Run the actornpm run format
- Format code with Prettiernpm run lint
- Run ESLintnpm run lint:fix
- Fix ESLint issues
Architecture
src/main.js
- Main entry point and input validationsrc/routes.js
- Request routingsrc/handlers/websiteScraper.js
- Main scraping logicsrc/services/langchainService.js
- LangChain integration and task processingsrc/puppeteerLauncher.js
- Puppeteer browser configuration
On this page
Share Actor: