Website Content Text Extractor
Pricing
Pay per usage
Website Content Text Extractor
Extract visible text content from websites as structured JSON blocks. Supports multi-URL batch processing, header/footer/cookie exclusion, and optional form extraction. Perfect for content analysis and translation workflows.
Pricing
Pay per usage
Rating
5.0
(1)
Developer
My Smart Digital
Actor stats
1
Bookmarked
4
Total users
2
Monthly active users
5 days ago
Last modified
Categories
Share
Apify Actor for extracting visible text content from websites as structured JSON blocks.
Description
Extract clean, visible text from websites as structured blocks. Perfect for content migration, translation workflows, and data analysis. This actor extracts text content that is actually visible to users, with options to exclude headers, footers, cookies, and extract form content.
Features
✅ Multi-URL Batch Processing: Process multiple URLs in a single run
✅ Viewport Presets: Choose between Desktop (1920x1080), Mobile (375x667), Tablet (768x1024), or custom dimensions
✅ Header/Footer/Cookie Exclusion: Compatible with WordPress, Shopify, Webflow, Drupal, Joomla and most CMS platforms
✅ Form Content Extraction: Optional extraction of form content (labels, placeholders, values, dropdown options)
✅ DOM Order Preservation: Text blocks extracted in the order they appear on the page
✅ Code Filtering: Automatically filters out JavaScript, CSS, and code snippets
✅ Deduplication: Removes duplicate text blocks
✅ Configurable Selectors: Customize which elements to include/exclude
✅ Clean JSON Output: Structured output perfect for content analysis, translation, and data migration
Input
{"startUrls": ["https://example.com","https://example.com/about","https://example.com/contact"],"viewportType": "desktop","viewportWidth": 1920,"viewportHeight": 1080,"excludeHeader": false,"excludeFooter": false,"excludeCookies": false,"excludeSelectors": [],"includeSelectors": [],"minTextLength": 3,"deduplicate": true,"waitForSelector": "","waitTimeout": 30000,"removeEmptyBlocks": true,"extractForms": false}
Parameters
- startUrl (optional): Single URL to extract text from (useful for quick tests). Leave empty if you only use startUrls
- startUrls (optional): List of additional URLs to process in bulk in a single run. Duplicates and empty lines are ignored. Use this field to process multiple pages in one execution
- viewportType (optional, default: "desktop"): Choose a predefined viewport size (desktop, mobile, tablet) or use custom dimensions
- viewportWidth (optional, default: 1920): Custom viewport width in pixels (only used when viewportType is "custom")
- viewportHeight (optional, default: 1080): Custom viewport height in pixels (only used when viewportType is "custom")
- excludeHeader (optional, default: false): Exclude header and navigation elements. Compatible with WordPress, Shopify, Webflow, Drupal, Joomla and most CMS platforms. If this doesn't work for your site, use Exclude Selectors manually
- excludeFooter (optional, default: false): Exclude footer elements. Compatible with WordPress, Shopify, Webflow, Drupal, Joomla and most CMS platforms. If this doesn't work for your site, use Exclude Selectors manually
- excludeCookies (optional, default: false): Exclude cookie consent banners and GDPR notices. Compatible with Cookiebot, OneTrust, Iubenda and most cookie consent platforms
- extractForms (optional, default: false): Extract form content (labels, placeholders, values). When disabled, form elements (form, input, textarea, select, label) are excluded from extraction. When enabled, form content is extracted including labels, placeholders, and input values. Each form field and dropdown option is extracted as a separate block
- excludeSelectors (optional, default: []): Additional CSS selectors for custom elements to exclude. Use this if Exclude Header/Footer/Cookies options don't work for your site structure
- includeSelectors (optional): CSS selectors to specifically include. If empty, all visible text is extracted
- minTextLength (optional, default: 3): Minimum character length for text blocks
- deduplicate (optional, default: true): Remove duplicate text blocks
- waitForSelector (optional): CSS selector to wait for before extraction
- waitTimeout (optional, default: 30000): Timeout in milliseconds for waiting for selector
- removeEmptyBlocks (optional, default: true): Remove empty or whitespace-only text blocks
- extractForms (optional, default: false): Extract form content (labels, placeholders, values). When disabled, form elements (form, input, textarea, select, label) are excluded from extraction. When enabled, form content is extracted including labels, placeholders, and input values.
Output
The actor returns a JSON object with the following structure:
{"url": "https://example.com","title": "Page Title","viewport": {"width": 1920,"height": 1080},"textBlocks": [{"id": "block-1","text": "First visible text block","order": 1,"tagName": "h1","selector": null},{"id": "block-2","text": "Second text block","order": 2,"tagName": "p","selector": null}],"statistics": {"totalBlocks": 25,"totalCharacters": 5432,"uniqueBlocks": 23,"excludedElements": 8}}
Output Fields
- url: The URL that was processed
- title: Page title
- viewport: Viewport dimensions used
- textBlocks: Array of extracted text blocks, each with:
- id: Unique identifier (block-1, block-2, etc.)
- text: The extracted text content
- order: Order of appearance (1, 2, 3, etc.)
- tagName: HTML tag name (h1, p, li, etc.)
- selector: CSS selector if extracted from specific selector
- statistics: Summary statistics
Use Cases
- Content Translation: Extract clean text blocks for translation workflows
- Content Analysis: Analyze visible content without navigation/header/footer noise
- SEO Content Extraction: Get only the main content for SEO analysis
- Content Migration: Extract content for migration to new platforms
- Form Data Extraction: Extract form labels, placeholders, and dropdown options for documentation or analysis
Header and Footer Exclusion
By default, nothing is excluded - all visible text is extracted. You can optionally exclude headers and/or footers using the excludeHeader and excludeFooter options.
When enabled, these options use universal selectors compatible with:
- WordPress:
.site-header,.elementor-location-header,.wp-block-navigation, etc. - Shopify:
.shopify-section-header,.shopify-section-footer, etc. - Webflow:
[class*='header'],[id*='header'], etc. - Drupal:
.region-header,.region-footer, etc. - Joomla:
.header,.footer,.moduletable-menu, etc. - Generic:
header,footer,nav,[role='banner'],[role='contentinfo'], etc.
If these options don't work for your specific site structure, you can use the excludeSelectors parameter to manually specify CSS selectors.
Technical Details
- Uses Playwright for rendering with configurable viewport sizes
- Waits for
networkidleto ensure all content is loaded - Automatically scrolls pages to load lazy-loaded content
- Checks element visibility using
getBoundingClientRect()and computed styles - Filters out elements with
display: none,visibility: hidden,opacity: 0 - Filters out JavaScript, CSS, and code snippets automatically
- Only includes elements within the viewport bounds
- Deduplicates text blocks by normalized (lowercase) content
- Maintains DOM order for accurate content structure