Website text extractor
Pricing
Pay per usage
Pricing
Pay per usage
Rating
5.0
(1)
Developer
My Smart Digital
Actor stats
0
Bookmarked
3
Total users
2
Monthly active users
a day ago
Last modified
Categories
Share
Desktop Text Extractor
Apify Actor for extracting visible desktop text from web pages, automatically excluding headers, footers, and mobile-only content.
Description
This actor extracts only the text that is actually visible to a desktop user, excluding:
- Headers and footers (including Elementor blocks)
- Navigation menus
- Mobile-only content
- Hidden elements
- Duplicate text blocks
Perfect for content translation workflows where you need clean, ordered text blocks.
Features
✅ Desktop Viewport Rendering: Uses real desktop viewport (default 1920x1080)
✅ Automatic Header/Footer Removal: Removes common header/footer selectors (including Elementor)
✅ Visible Text Only: Extracts only text visible in desktop viewport
✅ Deduplication: Removes duplicate text blocks
✅ Ordered Output: Text blocks in order of appearance
✅ Configurable Selectors: Customize which elements to include/exclude
✅ Clean JSON Output: Perfect for integration with n8n, Google Sheets, etc.
✅ Batch Mode: Traitez plusieurs URLs en masse dans un seul run grâce à la liste d'URLs supplémentaires
Input
{"startUrls": ["https://example.com","https://example.com/about","https://example.com/contact"],"viewportWidth": 1920,"viewportHeight": 1080,"excludeHeader": false,"excludeFooter": false,"excludeCookies": false,"excludeSelectors": [],"includeSelectors": [],"minTextLength": 3,"deduplicate": true,"waitForSelector": "","waitTimeout": 30000,"removeEmptyBlocks": true}
Parameters
- startUrl (optionnel): URL unique à extraire (pratique pour tester). Laissez vide si vous utilisez seulement la liste d'URLs
- startUrls (optionnel): Liste d'URLs supplémentaires à traiter en masse dans un seul run. Si présent, l'actor parcourt chaque URL dans l'ordre et évite les doublons. Parfait pour traiter plusieurs pages en une seule exécution. Fournissez au moins un des deux paramètres (
startUrloustartUrls) - viewportWidth (optional, default: 1920): Desktop viewport width in pixels
- viewportHeight (optional, default: 1080): Desktop viewport height in pixels
- excludeHeader (optional, default: false): Exclude header and navigation elements. Compatible with WordPress, Shopify, Webflow, Drupal, Joomla and most CMS platforms
- excludeFooter (optional, default: false): Exclude footer elements. Compatible with WordPress, Shopify, Webflow, Drupal, Joomla and most CMS platforms
- excludeCookies (optional, default: false): Exclude cookie consent banners and GDPR notices. Compatible with Cookiebot, OneTrust, Iubenda and most cookie consent platforms
- excludeSelectors (optional, default: []): Additional CSS selectors for custom elements to exclude. Use this if Exclude Header/Footer/Cookies options don't work for your site structure
- includeSelectors (optional): CSS selectors to specifically include. If empty, all visible text is extracted
- minTextLength (optional, default: 3): Minimum character length for text blocks
- deduplicate (optional, default: true): Remove duplicate text blocks
- waitForSelector (optional): CSS selector to wait for before extraction
- waitTimeout (optional, default: 30000): Timeout in milliseconds for waiting for selector
- removeEmptyBlocks (optional, default: true): Remove empty or whitespace-only text blocks
Output
The actor returns a JSON object with the following structure:
{"url": "https://example.com","title": "Page Title","viewport": {"width": 1920,"height": 1080},"textBlocks": [{"id": "block-1","text": "First visible text block","order": 1,"tagName": "h1","selector": null},{"id": "block-2","text": "Second text block","order": 2,"tagName": "p","selector": null}],"statistics": {"totalBlocks": 25,"totalCharacters": 5432,"uniqueBlocks": 23,"excludedElements": 8}}
Output Fields
- url: The URL that was processed
- title: Page title
- viewport: Viewport dimensions used
- textBlocks: Array of extracted text blocks, each with:
- id: Unique identifier (block-1, block-2, etc.)
- text: The extracted text content
- order: Order of appearance (1, 2, 3, etc.)
- tagName: HTML tag name (h1, p, li, etc.)
- selector: CSS selector if extracted from specific selector
- statistics: Summary statistics
Use Cases
- Content Translation: Extract clean text blocks for translation workflows
- Content Analysis: Analyze visible content without navigation/header/footer noise
- SEO Content Extraction: Get only the main content for SEO analysis
- Content Migration: Extract content for migration to new platforms
- n8n Workflows: Perfect for sending to Google Sheets (1 row = 1 text block)
Example: n8n Integration
- HTTP Request Node → Call this Apify Actor with URL
- Function Node → Parse JSON response
- Loop Over Items → Iterate through
textBlocks - Google Sheets Node → Insert each text block as a row
Header and Footer Exclusion
By default, nothing is excluded - all visible text is extracted. You can optionally exclude headers and/or footers using the excludeHeader and excludeFooter options.
When enabled, these options use universal selectors compatible with:
- WordPress:
.site-header,.elementor-location-header,.wp-block-navigation, etc. - Shopify:
.shopify-section-header,.shopify-section-footer, etc. - Webflow:
[class*='header'],[id*='header'], etc. - Drupal:
.region-header,.region-footer, etc. - Joomla:
.header,.footer,.moduletable-menu, etc. - Generic:
header,footer,nav,[role='banner'],[role='contentinfo'], etc.
If these options don't work for your specific site structure, you can use the excludeSelectors parameter to manually specify CSS selectors.
Technical Details
- Uses Playwright for desktop rendering
- Waits for
networkidleto ensure all content is loaded - Checks element visibility using
getBoundingClientRect()and computed styles - Filters out elements with
display: none,visibility: hidden,opacity: 0 - Only includes elements within the viewport bounds
- Deduplicates text blocks by normalized (lowercase) content
Development
# Install dependenciesnpm install# Build TypeScriptnpm run build# Run locallynpm start
License
MIT