# Website Content Crawler (`novashieldai/website-content-crawler`) Actor

Universal website crawler that extracts clean text/markdown content, metadata, links, and images from any URL. Features sitemap parsing, robots.txt respect, and multi-page BFS crawling with depth control.

- **URL**: https://apify.com/novashieldai/website-content-crawler.md
- **Developed by:** [Ali haydar Karadaş](https://apify.com/novashieldai) (community)
- **Categories:** Developer tools, AI
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Website Content Crawler

Website Content Crawler extracts clean text and markdown content from any website, along with metadata, links, and images. Whether you need to scrape a single page, crawl an entire site, or parse a sitemap, this actor handles it with minimal setup.

### What does Website Content Crawler do?

This actor provides four endpoints that cover different content extraction needs. **Crawl Page** scrapes a single URL and returns the page content, metadata, links, and images. **Crawl Site** follows links from a starting URL and crawls multiple pages up to a configurable depth and page limit. **Get Sitemap** parses a site's sitemap.xml and returns all listed URLs with their last modified dates, change frequencies, and priorities. **Extract Content** pulls just the main content from a page in either plain text or markdown format.

The crawler respects robots.txt by default (configurable), extracts Open Graph and meta tags, identifies internal vs. external links, and captures image alt text and dimensions. Output is clean and structured -- ready for AI training data, content analysis, SEO audits, or database storage.

### What data do you get?

**Page content:**
- **url**, **title**, **description**
- **text_content** -- extracted plain text
- **markdown_content** -- content converted to markdown
- **author**, **published_date**, **language**
- **word_count**, **char_count**

**Page metadata:**
- **status_code**, **content_type**, **response_time_ms**
- **canonical_url**, **og_tags**, **meta_tags**

**Links found on page:**
- **url**, **text**, **is_internal**, **is_nofollow**

**Images found on page:**
- **url**, **alt_text**, **width**, **height**

**Sitemap data:**
- **url**, **lastmod**, **changefreq**, **priority**

**Crawl summary:**
- **start_url**, **pages_crawled**, **total_links**

### Who is this for?

- **AI and ML engineers** -- collect training data from websites in clean text or markdown format
- **SEO professionals** -- audit site structure, meta tags, internal linking, and content quality
- **Content analysts** -- extract and compare content across competitor websites
- **Researchers** -- build text corpora from web sources for academic or commercial analysis
- **Developers** -- integrate website content extraction into pipelines, chatbots, or knowledge bases

### How to use it

1. Open the actor in Apify Console and select an endpoint (crawl_page, crawl_site, get_sitemap, or extract_content).
2. Enter the URL you want to crawl or extract content from.
3. For crawl_site, set the crawl depth and page limit.
4. Click "Start" to run the crawler.
5. Export results as JSON from the Dataset tab or use the Apify API.

### Input parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| endpoint | string | crawl_page | API endpoint: crawl_page, crawl_site, get_sitemap, or extract_content |
| url | string | -- | The URL to crawl or extract content from (required) |
| depth | integer | 1 | Maximum crawl depth, 1-5 (crawl_site only) |
| limit | integer | 10 | Maximum number of pages to crawl, 1-100 (crawl_site only) |
| output_format | string | text | Output format for extract_content: text or markdown |
| respect_robots | boolean | true | Whether to respect robots.txt rules |

### Sample output

```json
{
  "url": "https://example.com/blog/intro-to-web-scraping",
  "content": {
    "url": "https://example.com/blog/intro-to-web-scraping",
    "title": "Introduction to Web Scraping",
    "description": "A beginner's guide to web scraping with Python",
    "text_content": "Web scraping is the process of extracting data from websites...",
    "markdown_content": "# Introduction to Web Scraping\n\nWeb scraping is the process...",
    "author": "Jane Smith",
    "published_date": "2026-05-10",
    "language": "en",
    "word_count": 1245,
    "char_count": 7830
  },
  "metadata": {
    "url": "https://example.com/blog/intro-to-web-scraping",
    "status_code": 200,
    "content_type": "text/html",
    "response_time_ms": 234.5,
    "canonical_url": "https://example.com/blog/intro-to-web-scraping",
    "og_tags": {
      "og:title": "Introduction to Web Scraping",
      "og:type": "article"
    },
    "meta_tags": {
      "description": "A beginner's guide to web scraping with Python"
    }
  },
  "links": [
    {
      "url": "https://example.com/blog/advanced-scraping",
      "text": "Advanced Scraping Techniques",
      "is_internal": true,
      "is_nofollow": false
    }
  ],
  "images": [
    {
      "url": "https://example.com/images/scraping-diagram.png",
      "alt_text": "Web scraping workflow diagram",
      "width": 800,
      "height": 450
    }
  ]
}
````

### How much does it cost?

Each result costs **$0.002**. Crawling 1,000 pages costs just $2, and 10,000 pages costs $20.

Apify gives every new user $5 in free monthly credits, so you can crawl about 2,500 pages for free.

### Common questions

**Can I get the content in markdown format?**
Yes. Use the extract\_content endpoint and set output\_format to "markdown." The crawl\_page endpoint also returns markdown\_content alongside plain text by default.

**Does it follow links across different domains?**
The crawl\_site endpoint only follows internal links (same domain). External links are captured in the output but not followed. This prevents the crawl from spiraling across the entire web.

**Does it handle JavaScript-rendered pages?**
The crawler works with server-rendered HTML. Pages that require JavaScript execution to load content may return incomplete results. For heavy SPA sites, consider using a browser-based crawler instead.

### Contact & Custom Solutions

Need a custom scraper, higher volume, or a specific integration? We're here to help.

If anything isn't working right or you need support, don't hesitate to reach out.

- Telegram: [t.me/novashield\_dev](https://t.me/novashield_dev)
- Email: novashield.dev@gmail.com

# Actor input Schema

## `endpoint` (type: `string`):

API endpoint to call

## `url` (type: `string`):

The URL to crawl or extract content from

## `depth` (type: `integer`):

Maximum crawl depth (1-5). Only used with crawl\_site endpoint.

## `limit` (type: `integer`):

Maximum number of pages to crawl (1-100). Only used with crawl\_site endpoint.

## `output_format` (type: `string`):

Output format for extract\_content endpoint

## `respect_robots` (type: `boolean`):

Whether to respect robots.txt rules

## Actor input object example

```json
{
  "endpoint": "crawl_page",
  "url": "https://example.com",
  "depth": 1,
  "limit": 10,
  "output_format": "text",
  "respect_robots": true
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "url": "https://example.com"
};

// Run the Actor and wait for it to finish
const run = await client.actor("novashieldai/website-content-crawler").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "url": "https://example.com" }

# Run the Actor and wait for it to finish
run = client.actor("novashieldai/website-content-crawler").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "url": "https://example.com"
}' |
apify call novashieldai/website-content-crawler --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=novashieldai/website-content-crawler",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Website Content Crawler",
        "description": "Universal website crawler that extracts clean text/markdown content, metadata, links, and images from any URL. Features sitemap parsing, robots.txt respect, and multi-page BFS crawling with depth control.",
        "version": "1.0",
        "x-build-id": "k9DdopmvVb0OBdUny"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/novashieldai~website-content-crawler/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-novashieldai-website-content-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/novashieldai~website-content-crawler/runs": {
            "post": {
                "operationId": "runs-sync-novashieldai-website-content-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/novashieldai~website-content-crawler/run-sync": {
            "post": {
                "operationId": "run-sync-novashieldai-website-content-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "url"
                ],
                "properties": {
                    "endpoint": {
                        "title": "Endpoint",
                        "enum": [
                            "crawl_page",
                            "crawl_site",
                            "get_sitemap",
                            "extract_content"
                        ],
                        "type": "string",
                        "description": "API endpoint to call",
                        "default": "crawl_page"
                    },
                    "url": {
                        "title": "URL",
                        "type": "string",
                        "description": "The URL to crawl or extract content from"
                    },
                    "depth": {
                        "title": "Crawl Depth",
                        "minimum": 1,
                        "maximum": 5,
                        "type": "integer",
                        "description": "Maximum crawl depth (1-5). Only used with crawl_site endpoint.",
                        "default": 1
                    },
                    "limit": {
                        "title": "Page Limit",
                        "minimum": 1,
                        "maximum": 100,
                        "type": "integer",
                        "description": "Maximum number of pages to crawl (1-100). Only used with crawl_site endpoint.",
                        "default": 10
                    },
                    "output_format": {
                        "title": "Output Format",
                        "enum": [
                            "text",
                            "markdown"
                        ],
                        "type": "string",
                        "description": "Output format for extract_content endpoint",
                        "default": "text"
                    },
                    "respect_robots": {
                        "title": "Respect robots.txt",
                        "type": "boolean",
                        "description": "Whether to respect robots.txt rules",
                        "default": true
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
