Pricing

Pay per event

Go to Store

Intelligent Website Scrapper

Try for free

Developed by

HappiTap

An intelligent website scraper that uses LangChain and LLM to extract and process content based on high-level goals like summarization, product extraction, service extraction, and FAQ extraction.

0.0 (0)

Pricing

Pay per event

Last modified

a month ago

Agents

Developer tools

Intelligent Website Scraper

An Apify actor that uses LangChain and LLM to intelligently scrape and process website content based on high-level goals.

Features

Universal Website Scraping: Works on any website, not limited to specific platforms
Intelligent Content Processing: Uses LangChain + OpenAI to extract and summarize content
Multiple Task Types: Support for summarization, product extraction, service extraction, and FAQ extraction
Configurable Crawling: Optional internal link following with depth control
Clean Content Extraction: Removes scripts, styles, and irrelevant content

Input

The actor accepts the following input format:

{
  "startUrls": [
    { "url": "https://example.com" }
  ],
  "taskType": "extractServices",
  "maxDepth": 1,
  "followInternalLinks": false
}

Input Parameters

Parameter	Type	Required	Default	Description
`startUrls`	Array	Yes	-	Array of objects with `url` property
`taskType`	String	No	`summarize`	Type of content processing task
`maxDepth`	Number	No	`1`	Maximum depth for internal link following
`followInternalLinks`	Boolean	No	`false`	Whether to follow internal links

Supported Task Types

Task Type	Description
`summarize`	Summarize entire site content
`extractProducts`	Identify and extract product-related sections
`extractServices`	Extract service listings or offerings
`extractFAQs`	Pull FAQ-like content from the page

Output

The actor outputs structured data for each processed URL:

{
  "url": "https://example.com",
  "title": "Example Website",
  "taskType": "extractServices",
  "processedContent": "AI-processed content based on task type...",
  "rawContent": "First 1000 characters of raw content...",
  "scrapedAt": "2024-01-01T00:00:00.000Z",
  "metadata": {
    "wordCount": 1500,
    "linksFound": 25,
    "imagesFound": 10
  }
}

Environment Variables

Variable	Required	Description
`OPENAI_API_KEY`	Yes	Your OpenAI API key for LangChain integration

Example Usage

Basic Summarization

{
  "startUrls": [
    { "url": "https://example.com" }
  ],
  "taskType": "summarize"
}

Extract Services with Internal Link Following

{
  "startUrls": [
    { "url": "https://example.com" }
  ],
  "taskType": "extractServices",
  "followInternalLinks": true,
  "maxDepth": 2
}

Extract Products from Multiple URLs

{
  "startUrls": [
    { "url": "https://shop1.com" },
    { "url": "https://shop2.com" }
  ],
  "taskType": "extractProducts"
}

How It Works

Content Extraction: Uses Puppeteer to load pages and Cheerio to extract clean content
Intelligent Processing: LangChain processes content based on the specified task type
Structured Output: Returns processed content with metadata and original URL
Optional Crawling: Can follow internal links to gather more comprehensive data

Installation

Clone this repository
Install dependencies: npm install
Set your OPENAI_API_KEY environment variable
Run the actor: npm start

Development

npm start - Run the actor
npm run format - Format code with Prettier
npm run lint - Run ESLint
npm run lint:fix - Fix ESLint issues

Architecture

src/main.js - Main entry point and input validation
src/routes.js - Request routing
src/handlers/websiteScraper.js - Main scraping logic
src/services/langchainService.js - LangChain integration and task processing
src/puppeteerLauncher.js - Puppeteer browser configuration

On this page

Intelligent Website Scraper

Share Actor:

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

1.5K

4.7

Website Scraper

grihithbhoir707/website-scraper

Grihith Bhoir

Website extract

mrahil/my-actor

It is website extractor

Mohammed Rahil

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

495

4.7

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

638

4.6

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

66K

4.3

Free Website Email and Social Media Scrapper

alirezaeidgah/my-actor

Free Website Email and Social Media Scrapper

Alireza Eidgah

Extract Website With URL

mrahil/extract-website-with-url

The Extract Website with URL API allows users to extract structured data from any webpage by providing a URL. It retrieves HTML, metadata, tables, and images, returning data in JSON format. Ideal for web scraping, SEO analysis, and content extraction. Use it for e-commerce data, news scraping

Mohammed Rahil

100

Instant web data scraper - Scrape any website

curious_coder/instant-web-scraper

Scrape any public and private website data by providing just URL and optionally cookies and proxy information. This scraper is similar to instant data scraper but runs on cloud and can be used as API too!

Curious Coder

1.4K

3.6

Website Checker Runner Puppeteer

lukaskrivka/website-checker-puppeteer

Checks the provided website using Puppeteer. This is a low level runner, most likely you want to use the high level master actor - https://apify.com/lukaskrivka/website-checker