
Lead Pipeline Scraper
No credit card required
This Actor may be unreliable while under maintenance. Would you like to try a similar Actor instead?
See alternative Actors
Lead Pipeline Scraper
No credit card required
Crawl contacts' websites and extract LLM-ready markdown content for AI-personalized messaging.
Actor Metrics
2 Monthly users
No reviews yet
No bookmarks yet
>99% runs succeeded
Created in Mar 2025
Modified 2 days ago
A powerful web scraper built with Apify SDK and Crawlee that crawls websites and extracts useful information in LLM-ready format.
Features
- Crawls multiple websites from a list of start URLs
- Organizes data by datasets for easy analysis
- Converts HTML to clean, structured Markdown for LLM consumption
- Removes or preserves links based on your needs
- Cleans up image references for better LLM context
- Type-safe implementation with TypeScript
- Controls crawl depth to prevent excessive requests
- Handles errors gracefully
- Provides detailed logging
LLM-Ready Data Extraction
This scraper is specifically designed to produce data that's optimized for Large Language Models:
- HTML to Markdown Conversion: Uses node-html-markdown to convert HTML content to clean, structured markdown
- Link Handling Options:
- Removes URLs from links to reduce token usage
- Converts image links to simple text descriptions
- Provides both link-free and link-preserved versions
- Noise Reduction: Automatically removes scripts, styles, iframes, and other non-content elements
- Optimized Formatting: Configures markdown output for optimal LLM consumption with consistent formatting
This makes the scraped data ideal for:
- Training or fine-tuning LLMs
- Building knowledge bases for RAG (Retrieval Augmented Generation)
- Creating context for AI customer support agents
- Generating summaries or analyses of web content
Input
The actor accepts the following input parameters:
1{ 2 "startUrls": [ 3 "https://example.com", 4 "https://another-example.com" 5 ], 6 "maxRequestsPerCrawl": 100, 7 "datasetName": "my-dataset" 8}
Parameter | Type | Description |
---|---|---|
startUrls | Array | Required. List of URLs to start crawling from. |
maxRequestsPerCrawl | Number | Maximum number of pages to crawl. Default is unlimited. |
datasetName | String | Optional name to identify this dataset of URLs. Default is "default". |
Output
The actor stores the crawled data in the default dataset. Each record contains:
url
: The URL of the crawled pagetitle
: The page titlemetaDescription
: The meta description contenth1Text
: The text of the first H1 headingmarkdownContent
: The page content converted to clean markdown with links simplifiedmarkdownWithLinks
: The page content with all links preserved (useful for reference)linkCount
: Number of links found on the pagedatasetName
: The dataset name provided in the inputinitialUrl
: The initial URL from which this page was discoveredcrawledAt
: Timestamp of when the page was crawled
Running Locally
You can run the scraper locally using the provided local.js
script, which follows Apify's best practices for running actors locally.
Prerequisites
- Node.js 16+
- Apify CLI:
npm install -g apify-cli
Running with Command Line Arguments
1# Basic usage with URLs 2node local.js --urls=https://example.com,https://another.com 3 4# With additional options 5node local.js --urls=https://example.com --max=50 --dataset=competitors
Running with a JSON Input File
Create a JSON file with your input parameters:
1{ 2 "startUrls": [ 3 "https://example.com", 4 "https://another-example.com" 5 ], 6 "maxRequestsPerCrawl": 100, 7 "datasetName": "my-dataset" 8}
Then run the scraper with:
node local.js --input=./my-input.json
Output Location
The scraped data will be saved to:
./storage/datasets/{datasetName}/
Usage Examples
Basic Usage
1const input = { 2 "startUrls": ["https://example.com"], 3 "maxRequestsPerCrawl": 50, 4 "datasetName": "example" 5}; 6 7// Run the actor 8const run = await apify.call('your-username/lead-pipeline-scraper', input); 9 10// Get the results 11const dataset = await apify.client.dataset(run.defaultDatasetId).listItems(); 12console.log(dataset.items);
Crawling Multiple Datasets
1// First dataset 2await apify.call('your-username/lead-pipeline-scraper', { 3 "startUrls": ["https://example1.com", "https://example2.com"], 4 "maxRequestsPerCrawl": 100, 5 "datasetName": "competitors" 6}); 7 8// Second dataset 9await apify.call('your-username/lead-pipeline-scraper', { 10 "startUrls": ["https://example3.com", "https://example4.com"], 11 "maxRequestsPerCrawl": 100, 12 "datasetName": "partners" 13});
Using the Data with LLMs
1// Get the scraped data 2const dataset = await apify.client.dataset(run.defaultDatasetId).listItems(); 3 4// Use the link-free markdown content as context for an LLM 5const pageData = dataset.items[0]; 6const context = ` 7Title: ${pageData.title} 8URL: ${pageData.url} 9Content: 10${pageData.markdownContent} 11`; 12 13// Send to your LLM API of choice 14const response = await llmClient.complete({ 15 prompt: "Based on the following context, answer the user's question: " + userQuestion, 16 context: context, 17 max_tokens: 500 18});
Limitations
- The actor respects
robots.txt
rules by default - Crawling is limited to the same domain as the start URL
- Maximum crawl depth is set to 2 levels to prevent excessive crawling
- Only 10 links per page are followed to maintain reasonable crawl sizes
Development
To run the actor locally:
- Clone the repository
- Install dependencies:
npm install
- Run the actor:
npm start
License
This project is licensed under the MIT License - see the LICENSE file for details.