Lead Pipeline Scraper avatar

Lead Pipeline Scraper

Under maintenance
Try for free

No credit card required

Go to Store
This Actor is under maintenance.

This Actor may be unreliable while under maintenance. Would you like to try a similar Actor instead?

See alternative Actors
Lead Pipeline Scraper

Lead Pipeline Scraper

maxforbang/lead-pipeline-scraper
Try for free

No credit card required

Crawl contacts' websites and extract LLM-ready markdown content for AI-personalized messaging.

Developer
Maintained by Community

Actor Metrics

  • 2 Monthly users

  • No reviews yet

  • No bookmarks yet

  • >99% runs succeeded

  • Created in Mar 2025

  • Modified 2 days ago

A powerful web scraper built with Apify SDK and Crawlee that crawls websites and extracts useful information in LLM-ready format.

Features

  • Crawls multiple websites from a list of start URLs
  • Organizes data by datasets for easy analysis
  • Converts HTML to clean, structured Markdown for LLM consumption
  • Removes or preserves links based on your needs
  • Cleans up image references for better LLM context
  • Type-safe implementation with TypeScript
  • Controls crawl depth to prevent excessive requests
  • Handles errors gracefully
  • Provides detailed logging

LLM-Ready Data Extraction

This scraper is specifically designed to produce data that's optimized for Large Language Models:

  • HTML to Markdown Conversion: Uses node-html-markdown to convert HTML content to clean, structured markdown
  • Link Handling Options:
    • Removes URLs from links to reduce token usage
    • Converts image links to simple text descriptions
    • Provides both link-free and link-preserved versions
  • Noise Reduction: Automatically removes scripts, styles, iframes, and other non-content elements
  • Optimized Formatting: Configures markdown output for optimal LLM consumption with consistent formatting

This makes the scraped data ideal for:

  • Training or fine-tuning LLMs
  • Building knowledge bases for RAG (Retrieval Augmented Generation)
  • Creating context for AI customer support agents
  • Generating summaries or analyses of web content

Input

The actor accepts the following input parameters:

1{
2  "startUrls": [
3    "https://example.com",
4    "https://another-example.com"
5  ],
6  "maxRequestsPerCrawl": 100,
7  "datasetName": "my-dataset"
8}
ParameterTypeDescription
startUrlsArrayRequired. List of URLs to start crawling from.
maxRequestsPerCrawlNumberMaximum number of pages to crawl. Default is unlimited.
datasetNameStringOptional name to identify this dataset of URLs. Default is "default".

Output

The actor stores the crawled data in the default dataset. Each record contains:

  • url: The URL of the crawled page
  • title: The page title
  • metaDescription: The meta description content
  • h1Text: The text of the first H1 heading
  • markdownContent: The page content converted to clean markdown with links simplified
  • markdownWithLinks: The page content with all links preserved (useful for reference)
  • linkCount: Number of links found on the page
  • datasetName: The dataset name provided in the input
  • initialUrl: The initial URL from which this page was discovered
  • crawledAt: Timestamp of when the page was crawled

Running Locally

You can run the scraper locally using the provided local.js script, which follows Apify's best practices for running actors locally.

Prerequisites

  • Node.js 16+
  • Apify CLI: npm install -g apify-cli

Running with Command Line Arguments

1# Basic usage with URLs
2node local.js --urls=https://example.com,https://another.com
3
4# With additional options
5node local.js --urls=https://example.com --max=50 --dataset=competitors

Running with a JSON Input File

Create a JSON file with your input parameters:

1{
2  "startUrls": [
3    "https://example.com",
4    "https://another-example.com"
5  ],
6  "maxRequestsPerCrawl": 100,
7  "datasetName": "my-dataset"
8}

Then run the scraper with:

node local.js --input=./my-input.json

Output Location

The scraped data will be saved to:

./storage/datasets/{datasetName}/

Usage Examples

Basic Usage

1const input = {
2  "startUrls": ["https://example.com"],
3  "maxRequestsPerCrawl": 50,
4  "datasetName": "example"
5};
6
7// Run the actor
8const run = await apify.call('your-username/lead-pipeline-scraper', input);
9
10// Get the results
11const dataset = await apify.client.dataset(run.defaultDatasetId).listItems();
12console.log(dataset.items);

Crawling Multiple Datasets

1// First dataset
2await apify.call('your-username/lead-pipeline-scraper', {
3  "startUrls": ["https://example1.com", "https://example2.com"],
4  "maxRequestsPerCrawl": 100,
5  "datasetName": "competitors"
6});
7
8// Second dataset
9await apify.call('your-username/lead-pipeline-scraper', {
10  "startUrls": ["https://example3.com", "https://example4.com"],
11  "maxRequestsPerCrawl": 100,
12  "datasetName": "partners"
13});

Using the Data with LLMs

1// Get the scraped data
2const dataset = await apify.client.dataset(run.defaultDatasetId).listItems();
3
4// Use the link-free markdown content as context for an LLM
5const pageData = dataset.items[0];
6const context = `
7Title: ${pageData.title}
8URL: ${pageData.url}
9Content:
10${pageData.markdownContent}
11`;
12
13// Send to your LLM API of choice
14const response = await llmClient.complete({
15  prompt: "Based on the following context, answer the user's question: " + userQuestion,
16  context: context,
17  max_tokens: 500
18});

Limitations

  • The actor respects robots.txt rules by default
  • Crawling is limited to the same domain as the start URL
  • Maximum crawl depth is set to 2 levels to prevent excessive crawling
  • Only 10 links per page are followed to maintain reasonable crawl sizes

Development

To run the actor locally:

  1. Clone the repository
  2. Install dependencies: npm install
  3. Run the actor: npm start

License

This project is licensed under the MIT License - see the LICENSE file for details.