HTML to Markdown Conversion: Uses node-html-markdown to convert HTML content to clean, structured markdown
Link Handling Options:
- Removes URLs from links to reduce token usage
- Converts image links to simple text descriptions
- Provides both link-free and link-preserved versions
Noise Reduction: Automatically removes scripts, styles, iframes, and other non-content elements
Optimized Formatting: Configures markdown output for optimal LLM consumption with consistent formatting

This makes the scraped data ideal for:

Training or fine-tuning LLMs
Building knowledge bases for RAG (Retrieval Augmented Generation)
Creating context for AI customer support agents
Generating summaries or analyses of web content

Input

The actor accepts the following input parameters:

{
  "startUrls": [
    "https://example.com",
    "https://another-example.com"
  ],
  "maxRequestsPerCrawl": 100,
  "datasetName": "my-dataset"
}

Parameter	Type	Description
`startUrls`	Array	Required. List of URLs to start crawling from.
`maxRequestsPerCrawl`	Number	Maximum number of pages to crawl. Default is unlimited.
`datasetName`	String	Optional name to identify this dataset of URLs. Default is "default".

Output

The actor stores the crawled data in the default dataset. Each record contains:

url: The URL of the crawled page
title: The page title
metaDescription: The meta description content
h1Text: The text of the first H1 heading
markdownContent: The page content converted to clean markdown with links simplified
markdownWithLinks: The page content with all links preserved (useful for reference)
linkCount: Number of links found on the page
datasetName: The dataset name provided in the input
initialUrl: The initial URL from which this page was discovered
crawledAt: Timestamp of when the page was crawled

Running Locally

You can run the scraper locally using the provided local.js script, which follows Apify's best practices for running actors locally.

Prerequisites

Node.js 16+
Apify CLI: npm install -g apify-cli

Running with Command Line Arguments

# Basic usage with URLs
node local.js --urls=https://example.com,https://another.com

# With additional options
node local.js --urls=https://example.com --max=50 --dataset=competitors

Running with a JSON Input File

Create a JSON file with your input parameters:

{
  "startUrls": [
    "https://example.com",
    "https://another-example.com"
  ],
  "maxRequestsPerCrawl": 100,
  "datasetName": "my-dataset"
}

Then run the scraper with:

$node local.js --input=./my-input.json

Output Location

The scraped data will be saved to:

./storage/datasets/{datasetName}/

Usage Examples

Basic Usage

const input = {
  "startUrls": ["https://example.com"],
  "maxRequestsPerCrawl": 50,
  "datasetName": "example"
};

// Run the actor
const run = await apify.call('your-username/lead-pipeline-scraper', input);

// Get the results
const dataset = await apify.client.dataset(run.defaultDatasetId).listItems();
console.log(dataset.items);

Crawling Multiple Datasets

// First dataset
await apify.call('your-username/lead-pipeline-scraper', {
  "startUrls": ["https://example1.com", "https://example2.com"],
  "maxRequestsPerCrawl": 100,
  "datasetName": "competitors"
});

// Second dataset
await apify.call('your-username/lead-pipeline-scraper', {
  "startUrls": ["https://example3.com", "https://example4.com"],
  "maxRequestsPerCrawl": 100,
  "datasetName": "partners"
});

Using the Data with LLMs

// Get the scraped data
const dataset = await apify.client.dataset(run.defaultDatasetId).listItems();

// Use the link-free markdown content as context for an LLM
const pageData = dataset.items[0];
const context = `
Title: ${pageData.title}
URL: ${pageData.url}
Content:
${pageData.markdownContent}
`;

// Send to your LLM API of choice
const response = await llmClient.complete({
  prompt: "Based on the following context, answer the user's question: " + userQuestion,
  context: context,
  max_tokens: 500
});

Limitations

The actor respects robots.txt rules by default
Crawling is limited to the same domain as the start URL
Maximum crawl depth is set to 2 levels to prevent excessive crawling
Only 10 links per page are followed to maintain reasonable crawl sizes

Development

To run the actor locally:

Clone the repository
Install dependencies: npm install
Run the actor: npm start

License

This project is licensed under the MIT License - see the LICENSE file for details.

On this page

Lead Pipeline Scraper

Share Actor:

Realtor.ca Agent Leads Extractor

lead.gen.labs/realtor-ca-agent-leads-extractor

Realtor.ca Agent Leads Extractor lets you scrape real estate agent contact details—name, phone number, website, office address—from Realtor.ca by selecting any major Canadian city and province. Ideal for building verified real estate outreach lists for marketing, partnerships, or CRM pipelines.

LeadGen Labs

Linkedin.com People/Users Profile Scraper (Richest output)

memo23/linkedin-profiles-cheerio

Stealthily unlock LinkedIn's professional goldmine—extract verified profiles, employment histories, and skills with cookie-authenticated scraping. Bypass detection via residential proxies to fuel recruitment pipelines, sales leads, and competitor intel—no API limits, just raw strategic data.

Muhamed Didovic

5.0

Go High Level Apps Marketplace Scraper

louisdeconinck/gohighlevel-apps-marketplace-scraper

GoHighLevel Apps Marketplace Scraper is your ultimate tool for extracting comprehensive app data effortlessly. Gain insights into apps, descriptions, pricing, support, and more. Perfect for developers and marketers looking to analyze or integrate marketplace apps with ease. Start scraping today!

Louis Deconinck

5.0

Linkedin jobs scraper

saswave/linkedin-jobs-scraper

Linkedin job scraper. Extract linkedin job postings with company and recruiter details. Parser linkedin jobs at scale. Get data like hiring team, company infos (employees, industry, followers ...), job infos (description, salary, remote, views, applies ..) and more

SASWAVE

5.0

Realtor Discovery - By Url

autoscraping/realtor-discovery-by-url

Get structured data from Realtor.com search URLs. Fetch titles, prices, property types, addresses, realtors, and photos. Free tier available – no login required. Pricing: $3.5/1000 results.

AUTOScraping

Dubizzle Property Scraper UAE

datafusion_x/dubizzle-property-scraper-uae

Automatically extract up-to-date property listings from Dubizzle across Dubai, Abu Dhabi, Sharjah, and other UAE cities. Get detailed real estate data including prices, locations, agents, and project info for property analysis, marketing, or investment research.

DataFusionX Technology

Extract Emails from any website

openai/extract-emails-from-any-website

Extract email addresses from any website. Whether you're scraping a single company website or automating bulk email collection across thousands of URLs, this actor ensures high accuracy and scalability.

Scraplib

LinkedIn posts - Discover by company URL

autoscraping/linkedin-posts-discover-by-company-url

Extract public LinkedIn posts by company URL. Retrieve full post content, images, engagement data, and metadata in structured format. Easy & fast. Pricing: $4/1000 results.

AUTOScraping

TikTok Scraper

clockworks/tiktok-scraper

Extract data from TikTok videos, hashtags, and users. Use URLs or search queries to scrape TikTok profiles, hashtags, posts, URLs, shares, followers, hearts, names, video, and music-related data. Export scraped data, run the scraper via API, schedule and monitor runs or integrate with other tools.

Clockworks

52K

4.3

Linkedin-company-scraper

logical_scrapers/linkedin-company-scraper

Fastest public LinkedIn company scraper. Pull 1,000+ enriched company profiles in under 10 minutes. Company name, address, description, employee count, logo URL, website, industry, company size/type, headquarters, founding year, specialties, similar/affiliated pages, stock info and more.

Goldmine

577

3.1

Apollo Scraper: Get Leads & Emails

scrapefull/apollo-scraper

Get more leads from Apollo.io without the hassle of export limits. Our scraper grabs 10k emails monthly, plus phone numbers and job titles even on the free plan. It's great for filling your CRM or starting new outreach campaigns. Save time and grow your business faster.