SEO Data Extractor
Pricing
from $2.00 / 1,000 results
SEO Data Extractor
Extract comprehensive SEO metadata, headings, links, images, Open Graph tags, Twitter Cards, and technical data from websites. Perfect for SEO audits, competitor analysis, and content optimization. Runs on Apify platform with structured JSON output.
Pricing
from $2.00 / 1,000 results
Rating
0.0
(0)
Developer

No-Code Venture
Actor stats
1
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
Extract comprehensive SEO metadata, headings, links, images, Open Graph tags, Twitter Cards, and technical data from websites. Perfect for SEO audits, competitor analysis, and content optimization. Runs on Apify platform with structured JSON output.
A comprehensive SEO data extraction tool that runs on the Apify platform.
Features
Extract comprehensive SEO data from any webpage including:
- Meta Information: Title, description, keywords, robots directives, canonical URLs, author, and generator tags with length counts
- Heading Structure: All H1-H6 tags with text content and counts for each level
- Content Analysis: Word count, link analysis (total/internal/external), and image audit (total/without alt text)
- Open Graph Tags: Complete Open Graph metadata (title, description, image, URL, type, site name)
- Twitter Cards: Twitter Card metadata for social sharing
- Technical SEO: Status codes, response time, charset, language, viewport settings
- Structured Data: JSON-LD detection and schema type identification
- Branding Assets: Favicon, Apple touch icon, and theme color detection
- Sitemap Extraction: Optionally fetch and include all URLs from each domain's sitemap.xml
- Error Handling: Graceful handling of HTTP errors (404, 500, etc.) with proper error codes and messages
Use Cases
- SEO Monitoring: Track SEO data for your websites or competitors over time
- Content Analysis: Analyze meta tags to optimize webpage content for search engines
- SEO Audits: Collect data for comprehensive SEO audits across multiple pages
- Competitor Analysis: Track SEO data for your competitors
- Bulk Data Extraction: Process 1 to 100,000+ pages efficiently
Input Configuration
| Field | Type | Description | Default |
|---|---|---|---|
startUrls | Array | List of URLs to extract SEO data from | https://nocodeventure.com |
extractSitemapUrls | Boolean | Fetch and include sitemap data for each domain | false |
sitemapUrl | String | Custom sitemap path (e.g., sitemap_index.xml or /sitemaps/main.xml) | /sitemap.xml |
maxRequestsPerCrawl | Integer | Maximum pages to scrape (0 = unlimited) | 100 |
requestTimeout | Integer | Request timeout in seconds (3-10) | 5 |
maxConcurrency | Integer | Parallel requests (1-50) | 10 |
maxRequestRetries | Integer | Max retries for failed requests (0-5) | 1 |
proxyConfiguration | Object | Proxy settings for anti-blocking | Apify Proxy disabled |
Output Schema
The Actor returns structured JSON data with the following fields:
| Field | Type | Description |
|---|---|---|
url | String | The URL that was scraped |
scrapedAt | String | ISO 8601 timestamp of when the page was scraped |
error | String (optional) | Error code if scraping failed (e.g., "404", "500", "REQUEST_FAILED") |
errorMessage | String (optional) | Human-readable error message |
Meta Information (meta)
| Field | Type | Description |
|---|---|---|
title | String | Page title from <title> tag |
titleLength | Number | Character count of the title |
description | String | Meta description content |
descriptionLength | Number | Character count of the description |
keywords | String | Meta keywords content |
robots | String | Robots meta directive (e.g., "index, follow") |
canonical | String | Canonical URL from meta tag |
author | String | Author meta tag content |
generator | String | Generator meta tag content |
Headings (headings)
| Field | Type | Description |
|---|---|---|
h1.text | String | Combined text content of all H1 tags |
h1.count | Number | Number of H1 tags found |
h2.text | String | Combined text content of all H2 tags |
h2.count | Number | Number of H2 tags found |
h3.text | String | Combined text content of all H3 tags |
h3.count | Number | Number of H3 tags found |
h4.text | String | Combined text content of all H4 tags |
h4.count | Number | Number of H4 tags found |
h5.text | String | Combined text content of all H5 tags |
h5.count | Number | Number of H5 tags found |
h6.text | String | Combined text content of all H6 tags |
h6.count | Number | Number of H6 tags found |
Open Graph Tags (openGraph)
| Field | Type | Description |
|---|---|---|
title | String | Open Graph title |
description | String | Open Graph description |
image | String | Open Graph image URL |
url | String | Open Graph URL |
type | String | Open Graph type (e.g., "website", "article") |
siteName | String | Open Graph site name |
Twitter Cards (twitterCard)
| Field | Type | Description |
|---|---|---|
card | String | Twitter card type (e.g., "summary", "summary_large_image") |
title | String | Twitter card title |
description | String | Twitter card description |
image | String | Twitter card image URL |
site | String | Twitter site handle |
Content Analysis (content)
| Field | Type | Description |
|---|---|---|
wordCount | Number | Total word count in page body |
links.total | Number | Total number of links found |
links.internal | Number | Number of internal links (same domain) |
links.external | Number | Number of external links (different domain) |
images.total | Number | Total number of images found |
images.withoutAlt | Number | Number of images missing alt text |
Technical SEO (technical)
| Field | Type | Description |
|---|---|---|
statusCode | Number | HTTP response status code |
responseTime | Number | Response time in milliseconds |
charset | String | Character encoding (e.g., "UTF-8") |
language | String | Page language from HTML lang attribute |
viewport | String | Viewport meta tag content |
structuredData.hasStructuredData | Boolean | Whether JSON-LD structured data was found |
structuredData.types | Array | Array of structured data schema types found |
Branding Assets (branding)
| Field | Type | Description |
|---|---|---|
favicon | String | Favicon URL |
appleTouchIcon | String | Apple touch icon URL |
themeColor | String | Theme color meta tag content |
Sitemap Data (sitemap) - Optional
Note: This field is only included when
extractSitemapUrlsis enabled. If the page scrape fails (HTTP error or request failure), thesitemapobject will not be included in the output.
| Field | Type | Description |
|---|---|---|
found | Boolean | Whether a sitemap was found and parsed |
sitemapUrl | String | The sitemap URL that was fetched |
isKnownPath | Boolean | Whether a known/custom sitemap path was used (see below) |
urlCount | Number | Total number of URLs found in the sitemap |
urls | Array | List of all URLs from the sitemap |
error | String (optional) | Error message if sitemap fetch failed |
Example output with sitemap enabled:
{"url": "https://example.com","meta": { ... },"sitemap": {"found": true,"sitemapUrl": "https://example.com/sitemap.xml","isKnownPath": false,"urlCount": 156,"urls": ["https://example.com/","https://example.com/about","https://example.com/contact",...]},"scrapedAt": "2025-12-12T10:00:00.000Z"}
Sitemap caching: If you have multiple URLs from the same domain, the sitemap is only fetched once and reused for all pages from that domain.
Known Sitemap Paths
Some websites don't use the standard /sitemap.xml location. The Actor includes built-in support for these sites with isKnownPath: true in the output.
| Domain | Sitemap Location |
|---|---|
amazon.com, www.amazon.com, aws.amazon.com | https://aws.amazon.com/ar/sitemaps/index/ |
When a known path is used, you'll see it in the logs:
Using known sitemap path for www.amazon.com: https://aws.amazon.com/ar/sitemaps/index/
To add support for more domains, edit src/utils/sitemap-paths.ts.
Error Output Example
When a URL returns an HTTP error (like 404), the Actor returns an error item instead of failing:
{"url": "https://example.com/broken-link","meta": {"title": "","titleLength": 0,"description": "","descriptionLength": 0,"keywords": "","robots": "","canonical": "","author": "","generator": ""},"technical": {"statusCode": 404,"responseTime": 150},"error": "404","errorMessage": "Page not found","scrapedAt": "2025-12-11T20:23:04.317Z"}
This allows you to:
- Continue processing other URLs without failing the entire run
- Identify broken links and problematic URLs in your dataset
- Filter error results using the dedicated "Errors" view
Output Views
The Actor provides multiple dataset views for different analysis needs:
- SEO Overview: Quick summary with URL, error status, title, description, canonical, robots, H1 count, and links
- Errors: Dedicated view for URLs that returned HTTP errors (404, 500, etc.) with error codes and messages
- Heading Structure: H1-H6 tags with text content and counts for each level
- Open Graph: Complete Open Graph metadata for social sharing
- Twitter Cards: Twitter Card metadata for social sharing
- Content Analysis: Word count, link breakdown (internal/external), and image audit data
- Technical SEO: HTTP status, response time, charset, language, viewport, and structured data
- Branding Assets: Favicon, Apple touch icon, and theme color information
- Sitemap Data: URLs found in each domain's sitemap (when sitemap extraction is enabled)
How to Export
- Access Results: After running, view collected data in Apify's interface
- Select Export Option: Download as CSV, JSON, Excel, or XML
- Open in Tools: Import into Excel, Google Sheets, or your analysis tool
- API Access: Use the Apify API to integrate with your workflows
Pricing Model
This Actor uses Pay-Per-Event (PPE) pricing with automatic charging via Apify's synthetic events:
- Actor Start: Charged automatically when the Actor starts
- Dataset Item: Charged automatically for each result pushed to the dataset
Error Handling & Billing
URLs that return HTTP errors (404, 500, etc.) are still charged because:
- The Actor had to make a request to discover the error
- Error items are returned with proper error codes and messages
- This allows you to identify broken links without failing the entire run
You can set a maximum spending limit in the Apify Console to control costs.
What's Included
- Apify SDK - Toolkit for building Actors
- Input Schema - Input validation
- Dataset - Structured data storage
- Proxy Configuration - IP rotation for anti-blocking
Limitations
⚠️ JavaScript-Heavy Sites: This tool primarily extracts data from static HTML. It may not capture content that loads dynamically via JavaScript, potentially resulting in incomplete data extraction.
FAQ
Are duplicate URLs processed multiple times?
Yes. The Actor processes every URL in your input list, including duplicates. If you submit the same URL multiple times, it will be processed and charged each time.
Tip: Remove duplicates from your input list before running to save costs:
https://example.com/page1 ← processed, chargedhttps://example.com/page1 ← processed again, charged againhttps://example.com/page2 ← processed, charged
Am I charged for failed requests?
Yes. URLs that return HTTP errors (404, 500, etc.) or fail after retries are still charged because the Actor had to make a request to discover the error. However, you receive an error item in your dataset with the error code and message, so you know exactly what happened.
How can I control costs?
- Set a maximum spending limit in the Apify Console before running
- Use the
maxRequestsPerCrawlinput to limit the number of pages processed - Remove duplicate URLs from your input list before running
- Set
maxRequestRetriesto 0 if you don't want failed requests to be retried
Legal Disclaimer
⚠️ Important Legal Notice
This tool is provided for educational and research purposes only. By using this SEO Data Extractor, you agree to:
-
Comply with all applicable laws: You are solely responsible for ensuring your use of this tool complies with local, national, and international laws, including copyright laws, data protection regulations (such as GDPR, CCPA), and terms of service of target websites.
-
Respect website terms of service: Many websites prohibit automated scraping in their terms of service. You must review and comply with each website's terms before using this tool.
-
Respect robots.txt: This tool does not automatically check or respect robots.txt files. You are responsible for checking and honoring robots.txt directives.
-
Rate limiting and ethical use: Use reasonable request rates and respect website operators. Excessive requests may constitute a denial-of-service attack.
-
Data privacy compliance: Ensure your data collection and processing activities comply with privacy laws. Do not collect personal data without proper consent and legal basis.
-
No warranties: This tool is provided "as is" without warranties of any kind. The authors are not responsible for any damages or legal consequences arising from its use.
-
Use at your own risk: You assume all risks associated with using this tool. The authors disclaim all liability for any direct, indirect, incidental, or consequential damages.
Before using this tool, consult with legal counsel to ensure compliance with applicable laws and regulations.