Notion Page & Database Scraper avatar

Notion Page & Database Scraper

Pricing

Pay per usage

Go to Apify Store
Notion Page & Database Scraper

Notion Page & Database Scraper

Scrape public Notion pages and databases. Extract text, headings, lists, code blocks, tables, images, and linked sub-pages. Convert output to Markdown, HTML, or structured JSON. Ideal for knowledge base backup, content migration, and RAG pipelines.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Ricardo Akiyoshi

Ricardo Akiyoshi

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

6 hours ago

Last modified

Categories

Share

Scrape any public Notion page or database and extract structured content. Outputs clean Markdown, HTML, or structured JSON — perfect for knowledge base backup, content migration, RAG/LLM pipelines, and documentation archival.

What it does

This actor visits public Notion pages (both notion.so and notion.site domains), parses the rendered HTML, and extracts all content blocks including:

  • Text paragraphs with formatting (bold, italic, code, links)
  • Headings (H1, H2, H3)
  • Bullet lists and numbered lists
  • To-do / checkbox lists
  • Code blocks with language detection
  • Tables and databases (inline and full-page)
  • Images, embeds, and callouts
  • Toggle blocks (collapsible content)
  • Linked sub-pages (automatically followed when enabled)
  • Page metadata (title, icon, cover, last edited)

Use cases

  • Knowledge base backup — Export your team's Notion workspace to Markdown files for version control or offline access
  • Content migration — Move Notion content to another CMS, wiki, or documentation platform
  • RAG / LLM pipelines — Feed Notion content into vector databases (Pinecone, Weaviate, Chroma) for retrieval-augmented generation
  • Documentation archival — Create snapshots of public documentation pages for compliance or reference
  • Competitive intelligence — Monitor public Notion pages from competitors, startups, or industry resources
  • SEO analysis — Extract and analyze content structure from Notion-hosted websites
  • Data extraction — Pull structured data from Notion databases (tables, boards, galleries)

Input configuration

ParameterTypeDefaultDescription
pageUrlsarray(required)List of public Notion page URLs to scrape
includeSubpagesbooleantrueFollow links to child pages within the same workspace
maxPagesinteger100Maximum number of pages to scrape (including sub-pages)
outputFormatstring"markdown"Output format: markdown, html, or json
proxyConfigurationobjectApify proxy settings for avoiding rate limits

Output format

Each scraped page produces a dataset item with:

{
"pageTitle": "My Notion Page",
"url": "https://www.notion.so/My-Page-abc123",
"content": "# My Notion Page\n\nThis is the page content in markdown...",
"contentHtml": "<h1>My Notion Page</h1><p>This is the page content...</p>",
"blocks": [
{ "type": "heading_1", "text": "My Notion Page" },
{ "type": "paragraph", "text": "This is the page content..." }
],
"blockCount": 42,
"subPages": [
{ "title": "Child Page", "url": "https://www.notion.so/Child-Page-def456" }
],
"subPageCount": 3,
"images": ["https://..."],
"lastEdited": "2026-01-15T10:30:00.000Z",
"icon": "📄",
"cover": "https://...",
"depth": 0,
"scrapedAt": "2026-03-02T12:00:00.000Z"
}

The content field contains the output in your chosen format (Markdown by default). The blocks array is always included for structured access regardless of output format.

Example usage

Scrape a single page as Markdown

{
"pageUrls": ["https://www.notion.so/My-Public-Page-abc123def456"],
"outputFormat": "markdown"
}

Scrape an entire workspace (up to 500 pages)

{
"pageUrls": ["https://www.notion.so/Workspace-Root-abc123def456"],
"includeSubpages": true,
"maxPages": 500,
"outputFormat": "json"
}

Scrape without following sub-pages

{
"pageUrls": [
"https://www.notion.so/Page-One-abc123",
"https://www.notion.so/Page-Two-def456"
],
"includeSubpages": false,
"outputFormat": "html"
}

Supported Notion URL formats

  • https://www.notion.so/Page-Title-{id}
  • https://www.notion.so/{workspace}/Page-Title-{id}
  • https://{workspace}.notion.site/Page-Title-{id}
  • https://notion.so/{id}

Tips for best results

  1. Pages must be public — This actor cannot access private Notion pages. Make sure "Share to web" is enabled in Notion.
  2. Use proxies for large scrapes — If scraping 100+ pages, enable Apify proxy to avoid rate limiting.
  3. Start with a small maxPages — Test with 5-10 pages first to verify the output matches your needs.
  4. Markdown for RAG — If building an LLM/RAG pipeline, Markdown output gives the cleanest text with preserved structure.
  5. JSON for databases — When scraping Notion databases (tables), JSON output preserves column types and cell values.

Pricing

This actor uses pay-per-event pricing. You are charged $0.004 per page scraped (approximately $4 per 1,000 pages). Sub-pages count as individual pages.

Limitations

  • Only works with public Notion pages (shared to web)
  • Notion's dynamic rendering may delay content loading — the actor handles this with retry logic
  • Very large databases (10,000+ rows) may require multiple runs with pagination
  • Embedded content from third-party services (e.g., Google Docs, Figma) is captured as links, not content

Support

If you encounter issues or have feature requests, please open an issue on the actor's page or contact the developer.

Integration — Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("sovereigntaylor/notion-scraper").call(run_input={
"searchTerm": "notion",
"maxResults": 50
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(f"{item.get('title', item.get('name', 'N/A'))}")

Integration — JavaScript

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('sovereigntaylor/notion-scraper').call({
searchTerm: 'notion',
maxResults: 50
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach(item => console.log(item.title || item.name || 'N/A'));