Ai SEO Content Curator avatar

Ai SEO Content Curator

Try for free

Pay $10.00 for 1,000 results

Go to Store
Ai SEO Content Curator

Ai SEO Content Curator

quaking_pail/ai-seo-content-markdown-scraper
Try for free

Pay $10.00 for 1,000 results

The SEO Actor performs a full SEO audit for each URL, extracting key SEO metrics like titles, meta descriptions, and keywords. It also retrieves network information and integrates SEO audit data providing a comprehensive analysis stored in an organized database for further use.

Developer
Maintained by Community

Actor Metrics

  • 10 monthly users

  • No reviews yet

  • 6 bookmarks

  • >99% runs succeeded

  • Created in Sep 2024

  • Modified a day ago

Categories

AI SEO Content Scraper

The Selenium SEO Scraper is an Apify actor that uses Selenium and a headless Chrome browser to scrape websites, extract SEO-related data, and store it in a structured format. Users provide starting URLs and optional parameters via an input schema, and the actor outputs detailed metadata, network information, SEO audits, and page content to the default Apify dataset.

This documentation explains the input you need to provide and the output you’ll receive.

Input

To run the actor, provide input in JSON format through the Apify console’s “Input” tab or via the API. The input defines the URLs to scrape and controls the scraping scope.

Input Schema

1{
2    "title": "Selenium SEO Scraper",
3    "type": "object",
4    "schemaVersion": 1,
5    "properties": {
6        "start_urls": {
7            "title": "Start URLs",
8            "type": "array",
9            "description": "The URLs where scraping begins. Can be a list of strings or objects with a 'url' field.",
10            "prefill": [{"url": "https://example.com"}],
11            "editor": "requestListSources"
12        },
13        "max_depth": {
14            "title": "Maximum Depth",
15            "type": "integer",
16            "description": "How deep to follow links (0 = only start URLs, 1 = one level of links, etc.).",
17            "default": 1,
18            "minimum": 0
19        },
20        "max_urls": {
21            "title": "Max URLs",
22            "type": "integer",
23            "description": "The maximum number of URLs to scrape.",
24            "default": 10,
25            "minimum": 1
26        },
27        "search_engine": {
28            "title": "Search Engine",
29            "type": "string",
30            "description": "Optional identifier for future features (e.g., search engine-specific scraping).",
31            "enum": ["Google", "Bing", "DuckDuckGo"],
32            "default": "Google"
33        }
34    },
35    "required": ["start_urls"]
36}
37
38Input Fields Explained
39start_urls (required):
40A list of URLs to start scraping from.
41
42Format: Either ["https://example.com"] or [{"url": "https://example.com"}].
43
44Example: [{"url": "https://www.girlsinparis.com/fr/"}].
45
46max_depth (optional, default: 1):
47Controls how many levels of links to follow.
48
490: Scrape only the start URLs.
50
511: Scrape start URLs and their direct links.
52
532: Include links from those links, and so on.
54
55Example: 2.
56
57max_urls (optional, default: 10):
58Limits the total number of URLs scraped.
59
60Example: 100.
61
62search_engine (optional, default: "Google"):
63Currently informational; reserved for future enhancements (e.g., search engine-specific behavior).
64
65Options: "Google", "Bing", "DuckDuckGo".
66
67Example Inputs
68Basic Example
69Scrape one URL and its direct links:
70json
71
72{
73    "start_urls": ["https://www.girlsinparis.com/fr/"],
74    "max_depth": 1,
75    "max_urls": 10
76}
77
78Advanced Example
79Deeper crawl with multiple URLs:
80json
81
82{
83    "start_urls": [
84        {"url": "https://www.girlsinparis.com/fr/"},
85        {"url": "https://example.com"}
86    ],
87    "max_depth": 2,
88    "max_urls": 100,
89    "search_engine": "Google"
90}
91
92How to Provide Input
93Apify Console:
94Go to your actor in the Apify console.
95
96Open the “Input” tab.
97
98Paste your JSON input or use the form (it matches the schema).
99
100Save and run the actor.
101
102API:
103Use the Apify API with a POST request to /v2/acts/<actor-id>/runs, including your JSON input in the body.
104
105Refer to the Apify API Docs for details.
106
107Output
108The actor stores results in the default Apify dataset, which you can access via the console’s “Dataset” tab or API. Each scraped URL generates a JSON object containing metadata, network stats, SEO audit data, and page content.
109Output Structure
110json
111
112{
113    "url": "https://www.girlsinparis.com/fr/",
114    "info": {
115        "status": "complete",
116        "title": "Girls in Paris - Lingerie & Swimwear",
117        "description": "Explore our collection of lingerie and swimwear designed for comfort and style.",
118        "firstH1": "Welcome to Girls in Paris",
119        "pageSize": 12345,
120        "metaCanonical": "https://www.girlsinparis.com/fr/",
121        "metaLang": "",
122        "metaLanguage": "",
123        "htmlLang": "fr",
124        "wordCount": 150,
125        "linksCount": 20,
126        "linksExternalCount": 5,
127        "linksInternalCount": 15
128    },
129    "network": {
130        "Ip": "unavailable",
131        "IpReverse": "unavailable",
132        "pageSizeCompressed": 12345,
133        "fileSize": 12345,
134        "connectTime": 0.5,
135        "loadTime": 1.2,
136        "HttpResponseCode": 200,
137        "HttpContentType": "text/html; charset=UTF-8",
138        "HttpResponse": "Content-Type: text/html; charset=UTF-8, ...",
139        "HttpRequest": "User-Agent: Mozilla/5.0, ..."
140    },
141    "seoAudit": {
142        "structuredDataPresent": "ok",
143        "titleLength": 30,
144        "titlePresent": "ok",
145        "descriptionLength": 50,
146        "descriptionPresent": "ok",
147        "keywordsPresent": "absent",
148        "h1Count": 1,
149        "h2Count": 3,
150        "headingStructureOk": "ok",
151        "inlineCssCount": 2,
152        "jsFilesCount": 5,
153        "styleFilesCount": 3,
154        "iframeCount": 0,
155        "canonicalPresent": "ok",
156        "htmlLangPresent": "ok",
157        "metaViewportPresent": "ok",
158        "robotsMetaPresent": "ok",
159        "ogTagsPresent": "ok",
160        "twitterTagsPresent": "absent"
161    },
162    "content": "# Welcome to Girls in Paris\nExplore our collection...",
163    "timestamp": "2025-03-19T06:04:49Z",
164    "search_engine": "Google"
165}
166
167Output Fields Explained
168url (string):
169The URL that was scraped.
170
171info (object):
172Metadata and statistics about the page:
173status: Page load status (e.g., "complete").
174
175title: The page’s title.
176
177description: Meta description, if present.
178
179firstH1: Text of the first <h1> tag.
180
181pageSize: Size of the HTML source in bytes.
182
183metaCanonical: Canonical URL from <link rel="canonical">.
184
185metaLang, metaLanguage, htmlLang: Language attributes from meta tags or <html>.
186
187wordCount: Total words in the page text.
188
189linksCount: Total number of <a> tags.
190
191linksExternalCount: Number of external links.
192
193linksInternalCount: Number of internal links.
194
195network (object):
196HTTP request and response details:
197Ip, IpReverse: IP address and reverse DNS (currently "unavailable" due to Apify environment limitations).
198
199pageSizeCompressed, fileSize: Size of the response content in bytes.
200
201connectTime: Time to first byte in seconds.
202
203loadTime: Total request time in seconds.
204
205HttpResponseCode: HTTP status code (e.g., 200 for success).
206
207HttpContentType: MIME type (e.g., "text/html; charset=UTF-8").
208
209HttpResponse: Full response headers as a string.
210
211HttpRequest: Full request headers as a string.
212
213seoAudit (object):
214SEO analysis metrics:
215structuredDataPresent: "ok" if structured data (e.g., schema.org) is found, else "missing".
216
217titleLength: Character length of the title.
218
219titlePresent: "ok" if a title exists, else "absent".
220
221descriptionLength: Character length of the meta description.
222
223descriptionPresent: "ok" if a description exists, else "absent".
224
225keywordsPresent: "ok" if meta keywords exist, else "absent".
226
227h1Count, h2Count: Number of <h1> and <h2> tags.
228
229headingStructureOk: "ok" if exactly one <h1> is present, else "problematic".
230
231inlineCssCount: Number of elements with inline CSS.
232
233jsFilesCount: Number of external <script> tags.
234
235styleFilesCount: Number of external <link rel="stylesheet"> tags.
236
237iframeCount: Number of <iframe> tags.
238
239canonicalPresent, htmlLangPresent, metaViewportPresent, robotsMetaPresent, ogTagsPresent, twitterTagsPresent: "ok" if present, else "absent".
240
241content (string):
242The main page content converted to Markdown, with scripts and unwanted elements removed.
243
244timestamp (string):
245UTC timestamp of when the data was scraped (e.g., "2025-03-19T06:04:49Z").
246
247search_engine (string):
248The value provided in the input (e.g., "Google"), currently for informational purposes.
249
250Accessing the Output
251Apify Console:
252After the actor runs, go to the “Dataset” tab in the Apify console.
253
254View the data online, download it as JSON or CSV, or preview it.
255
256API:
257Use the Apify API to fetch the dataset with a GET request to /v2/datasets/<dataset-id>/items.
258
259Example:
260bash
261
262curl "https://api.apify.com/v2/datasets/<dataset-id>/items?token=<your-api-token>"
263
264Replace <dataset-id> with the ID from the run and <your-api-token> with your Apify API token.
265
266Notes
267IP Information: The Ip and IpReverse fields are marked "unavailable" because direct DNS lookups are restricted in the Apify environment. Other network data (e.g., HttpResponseCode, loadTime) is still provided.
268
269Dynamic Pages: The actor excels at scraping JavaScript-rendered content, ensuring accurate data from modern websites.
270
271Error Handling: If a URL fails to load or data extraction encounters issues, check the “Log” tab for details.