
Lead Pipeline Scraper
Pricing
Pay per usage

Lead Pipeline Scraper
Crawl contacts' websites and extract LLM-ready markdown content for AI-personalized messaging.
0.0 (0)
Pricing
Pay per usage
0
Monthly users
3
Runs succeeded
>99%
Last modified
15 days ago
A powerful web scraper built with Apify SDK and Crawlee that crawls websites and extracts useful information in LLM-ready format.
Features
- Crawls multiple websites from a list of start URLs
- Organizes data by datasets for easy analysis
- Converts HTML to clean, structured Markdown for LLM consumption
- Removes or preserves links based on your needs
- Cleans up image references for better LLM context
- Type-safe implementation with TypeScript
- Controls crawl depth to prevent excessive requests
- Handles errors gracefully
- Provides detailed logging
LLM-Ready Data Extraction
This scraper is specifically designed to produce data that's optimized for Large Language Models:
- HTML to Markdown Conversion: Uses node-html-markdown to convert HTML content to clean, structured markdown
- Link Handling Options:
- Removes URLs from links to reduce token usage
- Converts image links to simple text descriptions
- Provides both link-free and link-preserved versions
- Noise Reduction: Automatically removes scripts, styles, iframes, and other non-content elements
- Optimized Formatting: Configures markdown output for optimal LLM consumption with consistent formatting
This makes the scraped data ideal for:
- Training or fine-tuning LLMs
- Building knowledge bases for RAG (Retrieval Augmented Generation)
- Creating context for AI customer support agents
- Generating summaries or analyses of web content
Input
The actor accepts the following input parameters:
1{ 2 "startUrls": [ 3 "https://example.com", 4 "https://another-example.com" 5 ], 6 "maxRequestsPerCrawl": 100, 7 "datasetName": "my-dataset" 8}
Parameter | Type | Description |
---|---|---|
startUrls | Array | Required. List of URLs to start crawling from. |
maxRequestsPerCrawl | Number | Maximum number of pages to crawl. Default is unlimited. |
datasetName | String | Optional name to identify this dataset of URLs. Default is "default". |
Output
The actor stores the crawled data in the default dataset. Each record contains:
url
: The URL of the crawled pagetitle
: The page titlemetaDescription
: The meta description contenth1Text
: The text of the first H1 headingmarkdownContent
: The page content converted to clean markdown with links simplifiedmarkdownWithLinks
: The page content with all links preserved (useful for reference)linkCount
: Number of links found on the pagedatasetName
: The dataset name provided in the inputinitialUrl
: The initial URL from which this page was discoveredcrawledAt
: Timestamp of when the page was crawled
Running Locally
You can run the scraper locally using the provided local.js
script, which follows Apify's best practices for running actors locally.
Prerequisites
- Node.js 16+
- Apify CLI:
npm install -g apify-cli
Running with Command Line Arguments
1# Basic usage with URLs 2node local.js --urls=https://example.com,https://another.com 3 4# With additional options 5node local.js --urls=https://example.com --max=50 --dataset=competitors
Running with a JSON Input File
Create a JSON file with your input parameters:
1{ 2 "startUrls": [ 3 "https://example.com", 4 "https://another-example.com" 5 ], 6 "maxRequestsPerCrawl": 100, 7 "datasetName": "my-dataset" 8}
Then run the scraper with:
node local.js --input=./my-input.json
Output Location
The scraped data will be saved to:
./storage/datasets/{datasetName}/
Usage Examples
Basic Usage
1const input = { 2 "startUrls": ["https://example.com"], 3 "maxRequestsPerCrawl": 50, 4 "datasetName": "example" 5}; 6 7// Run the actor 8const run = await apify.call('your-username/lead-pipeline-scraper', input); 9 10// Get the results 11const dataset = await apify.client.dataset(run.defaultDatasetId).listItems(); 12console.log(dataset.items);
Crawling Multiple Datasets
1// First dataset 2await apify.call('your-username/lead-pipeline-scraper', { 3 "startUrls": ["https://example1.com", "https://example2.com"], 4 "maxRequestsPerCrawl": 100, 5 "datasetName": "competitors" 6}); 7 8// Second dataset 9await apify.call('your-username/lead-pipeline-scraper', { 10 "startUrls": ["https://example3.com", "https://example4.com"], 11 "maxRequestsPerCrawl": 100, 12 "datasetName": "partners" 13});
Using the Data with LLMs
1// Get the scraped data 2const dataset = await apify.client.dataset(run.defaultDatasetId).listItems(); 3 4// Use the link-free markdown content as context for an LLM 5const pageData = dataset.items[0]; 6const context = ` 7Title: ${pageData.title} 8URL: ${pageData.url} 9Content: 10${pageData.markdownContent} 11`; 12 13// Send to your LLM API of choice 14const response = await llmClient.complete({ 15 prompt: "Based on the following context, answer the user's question: " + userQuestion, 16 context: context, 17 max_tokens: 500 18});
Limitations
- The actor respects
robots.txt
rules by default - Crawling is limited to the same domain as the start URL
- Maximum crawl depth is set to 2 levels to prevent excessive crawling
- Only 10 links per page are followed to maintain reasonable crawl sizes
Development
To run the actor locally:
- Clone the repository
- Install dependencies:
npm install
- Run the actor:
npm start
License
This project is licensed under the MIT License - see the LICENSE file for details.
Pricing
Pricing model
Pay per usageThis Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage.