Pricing

from $40.00 / 1,000 results

Easy URL & Title

What if getting URLs and titles was just easier? Crawl any site and extract page titles and URLs. Choose Cheerio, JSDOM, or Playwright. Presets, proxy, robots.txt, and depth included.

Pricing

from $40.00 / 1,000 results

Rating

0.0

(0)

Developer

Petros Hong

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Website Title Crawler — Best-in-class web crawler

Crawl any website and extract page titles and URLs with one actor. Three engines (Cheerio · JSDOM · Playwright), respect robots.txt, configurable timeouts and retries, crawl depth and scrapedAt in output, live status during runs, optional URL globs and Playwright wait-for-selector. Ideal for sitemaps, SEO checks, and link inventories.

What this actor does

Crawler type – Cheerio (fast, static), JSDOM (light JS), or Playwright (full browser).
Polite crawling – Respect robots.txt, configurable request timeout and retries.
Live progress – Status message updates as pages are crawled.
Rich output – Each row: title, url, depth (crawl depth from start URL), scrapedAt (ISO timestamp), and error when failed.
URL globs – Optionally only follow links matching patterns (e.g. **/blog/**).
Playwright – Optional “wait for selector” before extracting (for JS-heavy pages).
Crawl presets – Quick (10), Standard (100), Large (1000), or Custom.
User charge limit – Stops the crawl when the run’s max result charge limit is reached (pay-per-event).

Input (run configuration)

Field	Description
Crawler type	Cheerio = fastest, static HTML only. JSDOM = light JavaScript. Playwright = full browser, any JS-heavy site.
Start URLs	One or more URLs where the crawl starts. The crawler follows links from these pages.
Crawl size	Quick (10 pages), Standard (100), Large (1000), or Custom (use Max requests below).
Max requests (custom only)	Used only when Crawl size is Custom. Maximum number of pages to crawl.
Use proxy	Use Apify proxy to rotate IPs and reduce blocking. Turn off for quick local tests.
Max concurrency	How many pages to fetch in parallel (1–50).
Limit to same domain	Only follow links on the same domain as the start URLs. Recommended for focused crawls.
Respect robots.txt	Skip URLs disallowed by the site’s robots.txt. Recommended for polite crawling.
Request timeout (seconds)	Max seconds to wait for each page load.
Max request retries	How many times to retry a failed request before recording an error.
URL globs (optional)	Only follow links matching these patterns (e.g. `/blog/`). Empty = all links (within same domain if enabled).
Playwright: wait for selector	Optional CSS selector to wait for before extracting (Playwright only).

Output

The actor writes one row per page to the run dataset:

title – Page <title> (or (no title) if missing)
url – Final URL (after redirects)
depth – Crawl depth from start URL (0 = start page)
scrapedAt – ISO timestamp when the page was scraped
error – Set only when the request failed (e.g. timeout, block)

In Apify Console you can view, filter, and export the dataset (JSON, CSV, etc.).

Pricing and cost (per 1,000 results)

For max cost at highest capacity (Playwright + proxy + large crawl) and recommended price per 1,000 results, see ./COST-ESTIMATE.md.

Quick start

pnpm install
apify run

Push changes to your Actor on Apify

Install Apify CLI (if needed): npm install -g apify-cli
Log in: apify login (opens browser; use your Apify account).
Link the project (first time only): from the actor folder run apify init and follow the prompts to link to an existing actor or create one.
Push: apify push — builds the Docker image, pushes it to Apify, and updates the actor.

Your code and .actor/ config (input schema, dataset schema, etc.) are uploaded; the actor on Apify Console will use the new version on the next run. To run locally first: apify run.

Project structure

.actor/
├── actor.json          # Actor config: name, version, runtime settings
├── dataset_schema.json # How dataset output is displayed in Console
├── input_schema.json   # Input validation & run form (presets, options)
└── output_schema.json # Where the Actor stores its output
src/
└── main.ts             # Actor entry point and crawler logic

See Actor definition for details.

Tech stack

Apify SDK – storage, input, proxy, lifecycle
Crawlee – CheerioCrawler, JSDOMCrawler, PlaywrightCrawler
Cheerio – fast HTML parsing (no browser)
JSDOM – DOM API, light JS execution
Playwright – headless browser for JS-heavy sites
Proxy – optional IP rotation

Resources

Robots Txt Analyzer

urban_quidnunc/robots-txt-analyzer

Donny

Robots.txt Validator

predictable_function/my-actor-3

List of website base URLs whose robots.txt files will be validated

riya rawat

5.0

Robots Txt Analyzer

consummate_mandala/robots-txt-analyzer

Robots Txt Analyzer. Automated analysis with detailed reports and actionable insights. Fast, accurate, and scalable.

Donny Nguyen

Robots.txt Validator - Check Rules, Sitemaps & Crawl Directives

scrappy_garden/robots-txt-validator

Validate robots.txt for one or more websites: fetches /robots.txt per host, parses directive groups (User-agent/Allow/Disallow/Crawl-delay/Sitemap), reports common errors and warnings, and can test URLs against the chosen User-Agent.

Bikram Adhikari

What Site

maged120/what-site

simple site lookup for title and description of any site

Maged

5.0

Fast Sitemap Generator

eunit/sitemap-generator

Boost SEO with this automatic Sitemap Generator. Crawl any site to create XML, HTML, & TXT sitemaps. Supports custom depth, regex filters, & robots.txt. Compatible with Google Search Console.

Emmanuel Uchenna

5.0

Parse Robots Txt — Data, Details & Metadata

tropical_quince/robots-txt-parser

Parse robots txt data at scale with this powerful Apify actor. Extracts data, details & metadata with automatic pagination and proxy rotation. Perfect for market research, competitive intelligence, and data-driven decision making.

Donny Nguyen

Website Scraper

quarterly_lettuce/website-scraper

Fast web scraper that extracts page titles and URLs from any website. Uses Cheerio for lightning-fast HTML parsing. Perfect for SEO audits, site mapping, and content discovery. Handles pagination and follows links automatically.

Abhishek Kumar Giri

Robots.txt Checker - CMS-Aware Analysis with AI Recommendations

alizarin_refrigerator-owner/robots-txt-checker

The Robots.txt Checker provides comprehensive analysis of your robots.txt file: Syntax Validation CMS Detection - Identify WordPress, Shopify, Drupal,& 6+ other CMS platforms Best Practice Check Companion File Checks - sitemap.xml, llms.txt, security.txt AI Recommendations - CMS-specific suggestions