Website Content Text Extractor avatar
Website Content Text Extractor

Pricing

Pay per usage

Go to Apify Store
Website Content Text Extractor

Website Content Text Extractor

Extract visible text content from websites as structured JSON blocks. Supports multi-URL batch processing, header/footer/cookie exclusion, and optional form extraction. Perfect for content analysis and translation workflows.

Pricing

Pay per usage

Rating

5.0

(1)

Developer

My Smart Digital

My Smart Digital

Maintained by Community

Actor stats

1

Bookmarked

4

Total users

2

Monthly active users

5 days ago

Last modified

Share

Apify Actor for extracting visible text content from websites as structured JSON blocks.

Description

Extract clean, visible text from websites as structured blocks. Perfect for content migration, translation workflows, and data analysis. This actor extracts text content that is actually visible to users, with options to exclude headers, footers, cookies, and extract form content.

Features

Multi-URL Batch Processing: Process multiple URLs in a single run
Viewport Presets: Choose between Desktop (1920x1080), Mobile (375x667), Tablet (768x1024), or custom dimensions
Header/Footer/Cookie Exclusion: Compatible with WordPress, Shopify, Webflow, Drupal, Joomla and most CMS platforms
Form Content Extraction: Optional extraction of form content (labels, placeholders, values, dropdown options)
DOM Order Preservation: Text blocks extracted in the order they appear on the page
Code Filtering: Automatically filters out JavaScript, CSS, and code snippets
Deduplication: Removes duplicate text blocks
Configurable Selectors: Customize which elements to include/exclude
Clean JSON Output: Structured output perfect for content analysis, translation, and data migration

Input

{
"startUrls": [
"https://example.com",
"https://example.com/about",
"https://example.com/contact"
],
"viewportType": "desktop",
"viewportWidth": 1920,
"viewportHeight": 1080,
"excludeHeader": false,
"excludeFooter": false,
"excludeCookies": false,
"excludeSelectors": [],
"includeSelectors": [],
"minTextLength": 3,
"deduplicate": true,
"waitForSelector": "",
"waitTimeout": 30000,
"removeEmptyBlocks": true,
"extractForms": false
}

Parameters

  • startUrl (optional): Single URL to extract text from (useful for quick tests). Leave empty if you only use startUrls
  • startUrls (optional): List of additional URLs to process in bulk in a single run. Duplicates and empty lines are ignored. Use this field to process multiple pages in one execution
  • viewportType (optional, default: "desktop"): Choose a predefined viewport size (desktop, mobile, tablet) or use custom dimensions
  • viewportWidth (optional, default: 1920): Custom viewport width in pixels (only used when viewportType is "custom")
  • viewportHeight (optional, default: 1080): Custom viewport height in pixels (only used when viewportType is "custom")
  • excludeHeader (optional, default: false): Exclude header and navigation elements. Compatible with WordPress, Shopify, Webflow, Drupal, Joomla and most CMS platforms. If this doesn't work for your site, use Exclude Selectors manually
  • excludeFooter (optional, default: false): Exclude footer elements. Compatible with WordPress, Shopify, Webflow, Drupal, Joomla and most CMS platforms. If this doesn't work for your site, use Exclude Selectors manually
  • excludeCookies (optional, default: false): Exclude cookie consent banners and GDPR notices. Compatible with Cookiebot, OneTrust, Iubenda and most cookie consent platforms
  • extractForms (optional, default: false): Extract form content (labels, placeholders, values). When disabled, form elements (form, input, textarea, select, label) are excluded from extraction. When enabled, form content is extracted including labels, placeholders, and input values. Each form field and dropdown option is extracted as a separate block
  • excludeSelectors (optional, default: []): Additional CSS selectors for custom elements to exclude. Use this if Exclude Header/Footer/Cookies options don't work for your site structure
  • includeSelectors (optional): CSS selectors to specifically include. If empty, all visible text is extracted
  • minTextLength (optional, default: 3): Minimum character length for text blocks
  • deduplicate (optional, default: true): Remove duplicate text blocks
  • waitForSelector (optional): CSS selector to wait for before extraction
  • waitTimeout (optional, default: 30000): Timeout in milliseconds for waiting for selector
  • removeEmptyBlocks (optional, default: true): Remove empty or whitespace-only text blocks
  • extractForms (optional, default: false): Extract form content (labels, placeholders, values). When disabled, form elements (form, input, textarea, select, label) are excluded from extraction. When enabled, form content is extracted including labels, placeholders, and input values.

Output

The actor returns a JSON object with the following structure:

{
"url": "https://example.com",
"title": "Page Title",
"viewport": {
"width": 1920,
"height": 1080
},
"textBlocks": [
{
"id": "block-1",
"text": "First visible text block",
"order": 1,
"tagName": "h1",
"selector": null
},
{
"id": "block-2",
"text": "Second text block",
"order": 2,
"tagName": "p",
"selector": null
}
],
"statistics": {
"totalBlocks": 25,
"totalCharacters": 5432,
"uniqueBlocks": 23,
"excludedElements": 8
}
}

Output Fields

  • url: The URL that was processed
  • title: Page title
  • viewport: Viewport dimensions used
  • textBlocks: Array of extracted text blocks, each with:
    • id: Unique identifier (block-1, block-2, etc.)
    • text: The extracted text content
    • order: Order of appearance (1, 2, 3, etc.)
    • tagName: HTML tag name (h1, p, li, etc.)
    • selector: CSS selector if extracted from specific selector
  • statistics: Summary statistics

Use Cases

  • Content Translation: Extract clean text blocks for translation workflows
  • Content Analysis: Analyze visible content without navigation/header/footer noise
  • SEO Content Extraction: Get only the main content for SEO analysis
  • Content Migration: Extract content for migration to new platforms
  • Form Data Extraction: Extract form labels, placeholders, and dropdown options for documentation or analysis

By default, nothing is excluded - all visible text is extracted. You can optionally exclude headers and/or footers using the excludeHeader and excludeFooter options.

When enabled, these options use universal selectors compatible with:

  • WordPress: .site-header, .elementor-location-header, .wp-block-navigation, etc.
  • Shopify: .shopify-section-header, .shopify-section-footer, etc.
  • Webflow: [class*='header'], [id*='header'], etc.
  • Drupal: .region-header, .region-footer, etc.
  • Joomla: .header, .footer, .moduletable-menu, etc.
  • Generic: header, footer, nav, [role='banner'], [role='contentinfo'], etc.

If these options don't work for your specific site structure, you can use the excludeSelectors parameter to manually specify CSS selectors.

Technical Details

  • Uses Playwright for rendering with configurable viewport sizes
  • Waits for networkidle to ensure all content is loaded
  • Automatically scrolls pages to load lazy-loaded content
  • Checks element visibility using getBoundingClientRect() and computed styles
  • Filters out elements with display: none, visibility: hidden, opacity: 0
  • Filters out JavaScript, CSS, and code snippets automatically
  • Only includes elements within the viewport bounds
  • Deduplicates text blocks by normalized (lowercase) content
  • Maintains DOM order for accurate content structure