Website text extractor avatar
Website text extractor

Pricing

Pay per usage

Go to Apify Store
Website text extractor

Website text extractor

Website text extractor

Pricing

Pay per usage

Rating

5.0

(1)

Developer

My Smart Digital

My Smart Digital

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

2

Monthly active users

a day ago

Last modified

Share

Desktop Text Extractor

Apify Actor for extracting visible desktop text from web pages, automatically excluding headers, footers, and mobile-only content.

Description

This actor extracts only the text that is actually visible to a desktop user, excluding:

  • Headers and footers (including Elementor blocks)
  • Navigation menus
  • Mobile-only content
  • Hidden elements
  • Duplicate text blocks

Perfect for content translation workflows where you need clean, ordered text blocks.

Features

Desktop Viewport Rendering: Uses real desktop viewport (default 1920x1080)
Automatic Header/Footer Removal: Removes common header/footer selectors (including Elementor)
Visible Text Only: Extracts only text visible in desktop viewport
Deduplication: Removes duplicate text blocks
Ordered Output: Text blocks in order of appearance
Configurable Selectors: Customize which elements to include/exclude
Clean JSON Output: Perfect for integration with n8n, Google Sheets, etc.
Batch Mode: Traitez plusieurs URLs en masse dans un seul run grâce à la liste d'URLs supplémentaires

Input

{
"startUrls": [
"https://example.com",
"https://example.com/about",
"https://example.com/contact"
],
"viewportWidth": 1920,
"viewportHeight": 1080,
"excludeHeader": false,
"excludeFooter": false,
"excludeCookies": false,
"excludeSelectors": [],
"includeSelectors": [],
"minTextLength": 3,
"deduplicate": true,
"waitForSelector": "",
"waitTimeout": 30000,
"removeEmptyBlocks": true
}

Parameters

  • startUrl (optionnel): URL unique à extraire (pratique pour tester). Laissez vide si vous utilisez seulement la liste d'URLs
  • startUrls (optionnel): Liste d'URLs supplémentaires à traiter en masse dans un seul run. Si présent, l'actor parcourt chaque URL dans l'ordre et évite les doublons. Parfait pour traiter plusieurs pages en une seule exécution. Fournissez au moins un des deux paramètres (startUrl ou startUrls)
  • viewportWidth (optional, default: 1920): Desktop viewport width in pixels
  • viewportHeight (optional, default: 1080): Desktop viewport height in pixels
  • excludeHeader (optional, default: false): Exclude header and navigation elements. Compatible with WordPress, Shopify, Webflow, Drupal, Joomla and most CMS platforms
  • excludeFooter (optional, default: false): Exclude footer elements. Compatible with WordPress, Shopify, Webflow, Drupal, Joomla and most CMS platforms
  • excludeCookies (optional, default: false): Exclude cookie consent banners and GDPR notices. Compatible with Cookiebot, OneTrust, Iubenda and most cookie consent platforms
  • excludeSelectors (optional, default: []): Additional CSS selectors for custom elements to exclude. Use this if Exclude Header/Footer/Cookies options don't work for your site structure
  • includeSelectors (optional): CSS selectors to specifically include. If empty, all visible text is extracted
  • minTextLength (optional, default: 3): Minimum character length for text blocks
  • deduplicate (optional, default: true): Remove duplicate text blocks
  • waitForSelector (optional): CSS selector to wait for before extraction
  • waitTimeout (optional, default: 30000): Timeout in milliseconds for waiting for selector
  • removeEmptyBlocks (optional, default: true): Remove empty or whitespace-only text blocks

Output

The actor returns a JSON object with the following structure:

{
"url": "https://example.com",
"title": "Page Title",
"viewport": {
"width": 1920,
"height": 1080
},
"textBlocks": [
{
"id": "block-1",
"text": "First visible text block",
"order": 1,
"tagName": "h1",
"selector": null
},
{
"id": "block-2",
"text": "Second text block",
"order": 2,
"tagName": "p",
"selector": null
}
],
"statistics": {
"totalBlocks": 25,
"totalCharacters": 5432,
"uniqueBlocks": 23,
"excludedElements": 8
}
}

Output Fields

  • url: The URL that was processed
  • title: Page title
  • viewport: Viewport dimensions used
  • textBlocks: Array of extracted text blocks, each with:
    • id: Unique identifier (block-1, block-2, etc.)
    • text: The extracted text content
    • order: Order of appearance (1, 2, 3, etc.)
    • tagName: HTML tag name (h1, p, li, etc.)
    • selector: CSS selector if extracted from specific selector
  • statistics: Summary statistics

Use Cases

  • Content Translation: Extract clean text blocks for translation workflows
  • Content Analysis: Analyze visible content without navigation/header/footer noise
  • SEO Content Extraction: Get only the main content for SEO analysis
  • Content Migration: Extract content for migration to new platforms
  • n8n Workflows: Perfect for sending to Google Sheets (1 row = 1 text block)

Example: n8n Integration

  1. HTTP Request Node → Call this Apify Actor with URL
  2. Function Node → Parse JSON response
  3. Loop Over Items → Iterate through textBlocks
  4. Google Sheets Node → Insert each text block as a row

By default, nothing is excluded - all visible text is extracted. You can optionally exclude headers and/or footers using the excludeHeader and excludeFooter options.

When enabled, these options use universal selectors compatible with:

  • WordPress: .site-header, .elementor-location-header, .wp-block-navigation, etc.
  • Shopify: .shopify-section-header, .shopify-section-footer, etc.
  • Webflow: [class*='header'], [id*='header'], etc.
  • Drupal: .region-header, .region-footer, etc.
  • Joomla: .header, .footer, .moduletable-menu, etc.
  • Generic: header, footer, nav, [role='banner'], [role='contentinfo'], etc.

If these options don't work for your specific site structure, you can use the excludeSelectors parameter to manually specify CSS selectors.

Technical Details

  • Uses Playwright for desktop rendering
  • Waits for networkidle to ensure all content is loaded
  • Checks element visibility using getBoundingClientRect() and computed styles
  • Filters out elements with display: none, visibility: hidden, opacity: 0
  • Only includes elements within the viewport bounds
  • Deduplicates text blocks by normalized (lowercase) content

Development

# Install dependencies
npm install
# Build TypeScript
npm run build
# Run locally
npm start

License

MIT