Website text extractor

Pricing

Pay per usage

Try for free

Go to Apify Store

Website text extractor

Try for free

Pricing

Pay per usage

Rating

5.0

(1)

Developer

My Smart Digital

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

a day ago

Last modified

Desktop Text Extractor

Apify Actor for extracting visible desktop text from web pages, automatically excluding headers, footers, and mobile-only content.

Description

This actor extracts only the text that is actually visible to a desktop user, excluding:

Headers and footers (including Elementor blocks)
Navigation menus
Mobile-only content
Hidden elements
Duplicate text blocks

Perfect for content translation workflows where you need clean, ordered text blocks.

Features

✅ Desktop Viewport Rendering: Uses real desktop viewport (default 1920x1080)
✅ Automatic Header/Footer Removal: Removes common header/footer selectors (including Elementor)
✅ Visible Text Only: Extracts only text visible in desktop viewport
✅ Deduplication: Removes duplicate text blocks
✅ Ordered Output: Text blocks in order of appearance
✅ Configurable Selectors: Customize which elements to include/exclude
✅ Clean JSON Output: Perfect for integration with n8n, Google Sheets, etc.
✅ Batch Mode: Traitez plusieurs URLs en masse dans un seul run grâce à la liste d'URLs supplémentaires

Input

{
  "startUrls": [
    "https://example.com",
    "https://example.com/about",
    "https://example.com/contact"
  ],
  "viewportWidth": 1920,
  "viewportHeight": 1080,
  "excludeHeader": false,
  "excludeFooter": false,
  "excludeCookies": false,
  "excludeSelectors": [],
  "includeSelectors": [],
  "minTextLength": 3,
  "deduplicate": true,
  "waitForSelector": "",
  "waitTimeout": 30000,
  "removeEmptyBlocks": true
}

Parameters

startUrl (optionnel): URL unique à extraire (pratique pour tester). Laissez vide si vous utilisez seulement la liste d'URLs
startUrls (optionnel): Liste d'URLs supplémentaires à traiter en masse dans un seul run. Si présent, l'actor parcourt chaque URL dans l'ordre et évite les doublons. Parfait pour traiter plusieurs pages en une seule exécution. Fournissez au moins un des deux paramètres (startUrl ou startUrls)
viewportWidth (optional, default: 1920): Desktop viewport width in pixels
viewportHeight (optional, default: 1080): Desktop viewport height in pixels
excludeHeader (optional, default: false): Exclude header and navigation elements. Compatible with WordPress, Shopify, Webflow, Drupal, Joomla and most CMS platforms
excludeFooter (optional, default: false): Exclude footer elements. Compatible with WordPress, Shopify, Webflow, Drupal, Joomla and most CMS platforms
excludeCookies (optional, default: false): Exclude cookie consent banners and GDPR notices. Compatible with Cookiebot, OneTrust, Iubenda and most cookie consent platforms
excludeSelectors (optional, default: []): Additional CSS selectors for custom elements to exclude. Use this if Exclude Header/Footer/Cookies options don't work for your site structure
includeSelectors (optional): CSS selectors to specifically include. If empty, all visible text is extracted
minTextLength (optional, default: 3): Minimum character length for text blocks
deduplicate (optional, default: true): Remove duplicate text blocks
waitForSelector (optional): CSS selector to wait for before extraction
waitTimeout (optional, default: 30000): Timeout in milliseconds for waiting for selector
removeEmptyBlocks (optional, default: true): Remove empty or whitespace-only text blocks

Output

The actor returns a JSON object with the following structure:

{
  "url": "https://example.com",
  "title": "Page Title",
  "viewport": {
    "width": 1920,
    "height": 1080
  },
  "textBlocks": [
    {
      "id": "block-1",
      "text": "First visible text block",
      "order": 1,
      "tagName": "h1",
      "selector": null
    },
    {
      "id": "block-2",
      "text": "Second text block",
      "order": 2,
      "tagName": "p",
      "selector": null
    }
  ],
  "statistics": {
    "totalBlocks": 25,
    "totalCharacters": 5432,
    "uniqueBlocks": 23,
    "excludedElements": 8
  }
}

Output Fields

url: The URL that was processed
title: Page title
viewport: Viewport dimensions used
textBlocks: Array of extracted text blocks, each with:
- id: Unique identifier (block-1, block-2, etc.)
- text: The extracted text content
- order: Order of appearance (1, 2, 3, etc.)
- tagName: HTML tag name (h1, p, li, etc.)
- selector: CSS selector if extracted from specific selector
statistics: Summary statistics

Use Cases

Content Translation: Extract clean text blocks for translation workflows
Content Analysis: Analyze visible content without navigation/header/footer noise
SEO Content Extraction: Get only the main content for SEO analysis
Content Migration: Extract content for migration to new platforms
n8n Workflows: Perfect for sending to Google Sheets (1 row = 1 text block)

Example: n8n Integration

HTTP Request Node → Call this Apify Actor with URL
Function Node → Parse JSON response
Loop Over Items → Iterate through textBlocks
Google Sheets Node → Insert each text block as a row

Header and Footer Exclusion

By default, nothing is excluded - all visible text is extracted. You can optionally exclude headers and/or footers using the excludeHeader and excludeFooter options.

When enabled, these options use universal selectors compatible with:

WordPress: .site-header, .elementor-location-header, .wp-block-navigation, etc.
Shopify: .shopify-section-header, .shopify-section-footer, etc.
Webflow: [class*='header'], [id*='header'], etc.
Drupal: .region-header, .region-footer, etc.
Joomla: .header, .footer, .moduletable-menu, etc.
Generic: header, footer, nav, [role='banner'], [role='contentinfo'], etc.

If these options don't work for your specific site structure, you can use the excludeSelectors parameter to manually specify CSS selectors.

Technical Details

Uses Playwright for desktop rendering
Waits for networkidle to ensure all content is loaded
Checks element visibility using getBoundingClientRect() and computed styles
Filters out elements with display: none, visibility: hidden, opacity: 0
Only includes elements within the viewport bounds
Deduplicates text blocks by normalized (lowercase) content

Development

# Install dependencies
npm install

# Build TypeScript
npm run build

# Run locally
npm start

License

MIT

Text Scraper (Free)

karamelo/text-scraper-free

Website Text Extractor. Extract Text from Webpages and Feed Your LLMs

karamelo

811

5.0

(1)

Google Keyword Suggestions Scraper

powerai/google-keywords-suggest-scraper

Get Google keyword suggestions and insights including search volume, competition level, and bid estimates for any keyword.

PowerAI

5.0

(1)

Google Keyword Suggestions by URL Scraper

powerai/google-keywords-suggest-by-url-scraper

Scrape Google keyword suggestions based on a specific URL using our API wrapper service

PowerAI

5.0

(1)

Web Scraper and AI processor

scraping_samurai/web-scraper-and-ai-processor

Adaptive AI controller classifies page quality from fast HTTP fetches and selectively triggers headless rendering, then converts raw text into structured JSON from natural-language extraction prompts. Optimizes cost vs. accuracy with AI-guided escalation, retry, and thin/blocked content heuristics.

Scraping Samurai

No-BS Content Crawler 🖕

successful_nonagon/no-bs-content-crawler

Fast web crawler that extracts clean text from websites. Returns readable content, headings, and links. Perfect for content aggregation, SEO research, and data collection.

hafsah nuzhat

5.0

(4)

Advanced Product Hunt Scraper

danpoletaev/product-hunt-scraper

Scrape product hunt "Top Products Launching Today" section. Actor crawls products and extracts information about the product: title, description, categories, images, maker info with contact links and website info with raw text and email. Export scraped datasets in JSON, csv, etc. Run via API.

Danil Poletaev

630

5.0

(2)

Facebook group member scraper

curious_coder/facebook-group-member-scraper

Scrape facebook group members of any group with all available information

Curious Coder

1.7K

4.5

(9)

Smart Article Scraper - Text, Data & Insights

xtech/article-extractor

𝗔𝗿𝘁𝗶𝗰𝗹𝗲 𝗦𝗰𝗿𝗮𝗽𝗲𝗿 & 𝗖𝗼𝗻𝘁𝗲𝗻𝘁 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗼𝗿 - Extract clean text, metadata, keywords & summaries from any web article or blog post. Perfect for 𝗿𝗲𝘀𝗲𝗮𝗿𝗰𝗵, 𝗰𝗼𝗺𝗽𝗲𝘁𝗶𝘁𝗶𝘃𝗲 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀 & 𝗰𝗼𝗻𝘁𝗲𝗻𝘁 𝗺𝗮𝗿𝗸𝗲𝘁𝗶𝗻𝗴.

Xtech

1.0

(1)

JobServe Jobs Scraper

fetchclub/jobserve-jobs-scraper

Actively Maintained - Jobs Scraper to extract job listings using keywords and filters from jobserve.com, gathering all details for each role. Works for all countries. Export results for analysis, connect via API or Python & integrate with other apps. Save hours searching. Unofficial JobServe API.

FetchClub

5.0

(1)

PDF Text Extractor

jirimoravcik/pdf-text-extractor

PDF Text Extractor allows you to extract text from PDF files. It also supports chunking of the text to prepare the data for usage with large language models.