
Intelligent Website Scrapper
Pricing
Pay per event
Go to Apify Store

Intelligent Website Scrapper
An intelligent website scraper that uses LangChain and LLM to extract and process content based on high-level goals like summarization, product extraction, service extraction, and FAQ extraction.
0.0 (0)
Pricing
Pay per event
1
40
23
Last modified
2 months ago
Intelligent Website Scraper
An Apify actor that uses LangChain and LLM to intelligently scrape and process website content based on high-level goals.
Features
- Universal Website Scraping: Works on any website, not limited to specific platforms
- Intelligent Content Processing: Uses LangChain + OpenAI to extract and summarize content
- Multiple Task Types: Support for summarization, product extraction, service extraction, and FAQ extraction
- Configurable Crawling: Optional internal link following with depth control
- Clean Content Extraction: Removes scripts, styles, and irrelevant content
Input
The actor accepts the following input format:
{"startUrls": [{ "url": "https://example.com" }],"taskType": "extractServices","maxDepth": 1,"followInternalLinks": false}
Input Parameters
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
startUrls | Array | Yes | - | Array of objects with url property |
taskType | String | No | summarize | Type of content processing task |
maxDepth | Number | No | 1 | Maximum depth for internal link following |
followInternalLinks | Boolean | No | false | Whether to follow internal links |
Supported Task Types
Task Type | Description |
---|---|
summarize | Summarize entire site content |
extractProducts | Identify and extract product-related sections |
extractServices | Extract service listings or offerings |
extractFAQs | Pull FAQ-like content from the page |
Output
The actor outputs structured data for each processed URL:
{"url": "https://example.com","title": "Example Website","taskType": "extractServices","processedContent": "AI-processed content based on task type...","rawContent": "First 1000 characters of raw content...","scrapedAt": "2024-01-01T00:00:00.000Z","metadata": {"wordCount": 1500,"linksFound": 25,"imagesFound": 10}}
Environment Variables
Variable | Required | Description |
---|---|---|
OPENAI_API_KEY | Yes | Your OpenAI API key for LangChain integration |
Example Usage
Basic Summarization
{"startUrls": [{ "url": "https://example.com" }],"taskType": "summarize"}
Extract Services with Internal Link Following
{"startUrls": [{ "url": "https://example.com" }],"taskType": "extractServices","followInternalLinks": true,"maxDepth": 2}
Extract Products from Multiple URLs
{"startUrls": [{ "url": "https://shop1.com" },{ "url": "https://shop2.com" }],"taskType": "extractProducts"}
How It Works
- Content Extraction: Uses Puppeteer to load pages and Cheerio to extract clean content
- Intelligent Processing: LangChain processes content based on the specified task type
- Structured Output: Returns processed content with metadata and original URL
- Optional Crawling: Can follow internal links to gather more comprehensive data
Installation
- Clone this repository
- Install dependencies:
npm install
- Set your
OPENAI_API_KEY
environment variable - Run the actor:
npm start
Development
npm start
- Run the actornpm run format
- Format code with Prettiernpm run lint
- Run ESLintnpm run lint:fix
- Fix ESLint issues
Architecture
src/main.js
- Main entry point and input validationsrc/routes.js
- Request routingsrc/handlers/websiteScraper.js
- Main scraping logicsrc/services/langchainService.js
- LangChain integration and task processingsrc/puppeteerLauncher.js
- Puppeteer browser configuration
On this page
Share Actor: