Internal Links Scraper avatar

Internal Links Scraper

Try for free

1 day trial then $25.00/month - No credit card required now

Go to Store
Internal Links Scraper

Internal Links Scraper

mysteriousshadow/internal-links-scraper
Try for free

1 day trial then $25.00/month - No credit card required now

When given a sitemap of a website, this scraper will go through every page listed on the sitemap and find all the internal links. Useful for SEO, finding orphaned pages, and visualizing internal linking structure.

Sitemap-Based Web Scraper

This tool crawls every page listed on a website's sitemap and retrieves all internal links from each page. It’s ideal for SEO analysis, identifying orphaned pages, and visualizing the internal linking structure of a site.

Features

  • Crawl Entire Website: Starts with a sitemap and navigates through each page listed for thorough coverage.
  • Internal Link Extraction: Finds and catalogs all internal links on each page.
  • Internal Link Validation: All internal links are validated to ensure accuracy. For example, self-referencing links, which point to the same page (e.g., /about linking to /about), are pointless and are ignored in the results.
  • SEO Insights: Helps identify orphaned pages and underlinked/overlinked pages.

Usage

  1. Provide Sitemap URL: Start by giving the scraper the URL of the sitemap (XML format).
  2. Run the Scraper: The scraper will visit each URL in the sitemap and collect internal links.
  3. Data Analysis: Use the output to get insights and make improvements on your site.

How to Get a Sitemap

If you're unsure how to find a website's sitemap, follow this guide:
How to Find a Sitemap on Any Website.

Note: For larger websites, more RAM, CPU, and time may be needed to handle the extensive data collection.

Output Format

The scraper produces a structured output showing internal link relationships for each URL in the sitemap. The output includes:

  • linking_structure: The complete internal linking structure of the site. Relative paths are shown for better clarity. For example:
    • The root domain is represented as "" (empty string).
    • /about instead of https://example.com/about.
  • incoming_links: The number of internal links pointing to the URL.
    • incoming_links[url] == 0 indicates an orphaned page (a page listed in the sitemap but not linked to from any other page).
  • outgoing_links: The number of internal links the URL contains, pointing to other pages within the site.

Troubleshooting

If there are no results or unexpected results:

  1. Wait: It can take a while for the result to show up after the Actor has exited.
  2. Ensure the sitemap is accessible and in XML format: Double-check that the sitemap is reachable and correctly formatted in XML.
  3. Ensure the pages are accessible: If pages are not being crawled, you might need to adjust the proxy settings.
  4. Contact me: If the issue persists, feel free to reach out, and I’ll address the problem as soon as possible.

Sample Output

1{
2    "linking_structure": {
3        "https://pliwriters.com": [
4            "/blog",
5            "/about",
6            "/contact",
7            "/contact",
8            "/contact",
9            "/about",
10            "/blog",
11            "/contact",
12            "/privacy-policy",
13            "/terms-and-conditions"
14        ],
15        "https://pliwriters.com/blog/how-to-find-internal-links-to-a-page": [
16            "",
17            "",
18            "/blog",
19            "/about",
20            "/contact",
21            "/blog/category/uncategorized",
22            "/blog/internal-links-vs-external-links",
23            "/internal-link-visualization-beta",
24            "/blog/how-to-find-a-sitemap-on-any-website",
25            "/blog/the-ultimate-guide-to-anchor-text",
26            "/blog/how-to-find-internal-links-to-a-page/",
27            "/about",
28            "/blog",
29            "/contact",
30            "/privacy-policy",
31            "/terms-and-conditions"
32        ],
33        "https://pliwriters.com/about": [
34            "",
35            "",
36            "/blog",
37            "/contact",
38            "/blog",
39            "/contact",
40            "/privacy-policy",
41            "/terms-and-conditions"
42        ],
43        "https://pliwriters.com/blog/category/uncategorized": [
44            "",
45            "",
46            "/blog",
47            "/about",
48            "/contact",
49            "/blog/the-ultimate-guide-to-anchor-text",
50            "/blog/best-practices-for-website-navigation",
51            "/blog/what-are-orphan-pages",
52            "/blog/internal-links-vs-external-links",
53            "/blog/internal-links-vs-external-links",
54            "/blog/how-to-find-internal-links-to-a-page",
55            "/blog/how-to-find-a-sitemap-on-any-website",
56            "/blog/how-to-find-a-sitemap-on-any-website",
57            "/blog/3-key-components-of-seo",
58            "/blog/3-key-components-of-seo",
59            "/about",
60            "/blog",
61            "/contact",
62            "/privacy-policy",
63            "/terms-and-conditions"
64        ],
65        "https://pliwriters.com/blog/what-are-orphan-pages": [
66            "",
67            "",
68            "/blog",
69            "/about",
70            "/contact",
71            "/blog/category/uncategorized",
72            "/blog/how-to-find-a-sitemap-on-any-website",
73            "/blog/what-are-orphan-pages/",
74            "/about",
75            "/blog",
76            "/contact",
77            "/privacy-policy",
78            "/terms-and-conditions"
79        ],
80        "https://pliwriters.com/terms-and-conditions": [
81            "",
82            "",
83            "/blog",
84            "/about",
85            "/contact",
86            "/about",
87            "/blog",
88            "/contact",
89            "/privacy-policy"
90        ],
91        "https://pliwriters.com/privacy-policy": [
92            "",
93            "",
94            "/blog",
95            "/about",
96            "/contact",
97            "/about",
98            "/blog",
99            "/contact",
100            "/terms-and-conditions"
101        ],
102        "https://pliwriters.com/blog": [
103            "",
104            "",
105            "/about",
106            "/contact",
107            "/blog/the-ultimate-guide-to-anchor-text",
108            "/blog/the-ultimate-guide-to-anchor-text",
109            "/blog/best-practices-for-website-navigation",
110            "/blog/best-practices-for-website-navigation",
111            "/about",
112            "/contact",
113            "/privacy-policy",
114            "/terms-and-conditions"
115        ],
116        "https://pliwriters.com/internal-link-visualization-beta": [
117            "",
118            "",
119            "/blog",
120            "/about",
121            "/contact",
122            "/blog/how-to-find-a-sitemap-on-any-website",
123            "/contact",
124            "/about",
125            "/blog",
126            "/contact",
127            "/privacy-policy",
128            "/terms-and-conditions"
129        ],
130        "https://pliwriters.com/blog/the-ultimate-guide-to-anchor-text": [
131            "",
132            "",
133            "/blog",
134            "/about",
135            "/contact",
136            "/blog/category/uncategorized",
137            "/blog/best-practices-for-website-navigation",
138            "/blog/the-ultimate-guide-to-anchor-text/",
139            "/about",
140            "/blog",
141            "/contact",
142            "/privacy-policy",
143            "/terms-and-conditions"
144        ],
145        "https://pliwriters.com/blog/3-key-components-of-seo": [
146            "",
147            "",
148            "/blog",
149            "/about",
150            "/contact",
151            "/blog/category/uncategorized",
152            "",
153            "/contact",
154            "/blog/3-key-components-of-seo/",
155            "/about",
156            "/blog",
157            "/contact",
158            "/privacy-policy",
159            "/terms-and-conditions"
160        ],
161        "https://pliwriters.com/contact": [
162            "",
163            "",
164            "/blog",
165            "/about",
166            "/about",
167            "/blog",
168            "/privacy-policy",
169            "/terms-and-conditions"
170        ],
171        "https://pliwriters.com/orphan-page-test": [
172            "",
173            "",
174            "/blog",
175            "/about",
176            "/contact",
177            "/about",
178            "/blog",
179            "/contact",
180            "/privacy-policy",
181            "/terms-and-conditions"
182        ],
183        "https://pliwriters.com/blog/best-practices-for-website-navigation": [
184            "",
185            "",
186            "/blog",
187            "/about",
188            "/contact",
189            "/blog/category/uncategorized",
190            "/blog/internal-links-vs-external-links",
191            "/blog/what-are-orphan-pages",
192            "/blog/how-to-find-a-sitemap-on-any-website",
193            "/blog/best-practices-for-website-navigation/",
194            "/about",
195            "/blog",
196            "/contact",
197            "/privacy-policy",
198            "/terms-and-conditions"
199        ],
200        "https://pliwriters.com/blog/how-to-find-a-sitemap-on-any-website": [
201            "",
202            "",
203            "/blog",
204            "/about",
205            "/contact",
206            "/blog/category/uncategorized",
207            "/blog/how-to-find-a-sitemap-on-any-website/",
208            "/about",
209            "/blog",
210            "/contact",
211            "/privacy-policy",
212            "/terms-and-conditions"
213        ],
214        "https://pliwriters.com/blog/internal-links-vs-external-links": [
215            "",
216            "",
217            "/blog",
218            "/about",
219            "/contact",
220            "/blog/category/uncategorized",
221            "/blog/what-are-orphan-pages",
222            "/blog/internal-links-vs-external-links/",
223            "/about",
224            "/blog",
225            "/contact",
226            "/privacy-policy",
227            "/terms-and-conditions"
228        ]
229    },
230    "incoming_links": {
231        "/orphan-page-test": 0,
232        "/internal-link-visualization-beta": 1,
233        "/blog/how-to-find-internal-links-to-a-page": 2,
234        "/blog/3-key-components-of-seo": 3,
235        "/blog/what-are-orphan-pages": 4,
236        "/blog/the-ultimate-guide-to-anchor-text": 5,
237        "/blog/best-practices-for-website-navigation": 5,
238        "/blog/internal-links-vs-external-links": 5,
239        "/blog/category/uncategorized": 7,
240        "/blog/how-to-find-a-sitemap-on-any-website": 7,
241        "/privacy-policy": 15,
242        "/terms-and-conditions": 15,
243        "/about": 30,
244        "/blog": 30,
245        "": 31,
246        "/contact": 34
247    },
248    "outgoing_links": {
249        "/blog/category/uncategorized": 20,
250        "/blog/how-to-find-internal-links-to-a-page": 16,
251        "/blog/best-practices-for-website-navigation": 15,
252        "/blog/3-key-components-of-seo": 14,
253        "/blog/what-are-orphan-pages": 13,
254        "/blog/the-ultimate-guide-to-anchor-text": 13,
255        "/blog/internal-links-vs-external-links": 13,
256        "/blog": 12,
257        "/internal-link-visualization-beta": 12,
258        "/blog/how-to-find-a-sitemap-on-any-website": 12,
259        "": 10,
260        "/orphan-page-test": 10,
261        "/terms-and-conditions": 9,
262        "/privacy-policy": 9,
263        "/about": 8,
264        "/contact": 8
265    }
266}
Developer
Maintained by Community

Actor Metrics

  • 10 monthly users

  • 2 stars

  • 91% runs succeeded

  • Created in Nov 2024

  • Modified 17 days ago