Internal Links Scraper avatar

Internal Links Scraper

Try for free

1 day trial then $25.00/month - No credit card required now

Go to Store
Internal Links Scraper

Internal Links Scraper

mysteriousshadow/internal-links-scraper
Try for free

1 day trial then $25.00/month - No credit card required now

When given a sitemap of a website, this scraper will go through every page listed on the sitemap and find all the internal links. Useful for SEO, finding orphaned pages, and visualizing internal linking structure.

Developer
Maintained by Community

Actor Metrics

  • 20 Monthly users

  • No reviews yet

  • 2 bookmarks

  • 90% runs succeeded

  • Created in Nov 2024

  • Modified 2 months ago

Sitemap-Based Web Scraper

This tool crawls every page listed on a website's sitemap and retrieves all internal links from each page. It’s ideal for SEO analysis, identifying orphaned pages, and visualizing the internal linking structure of a site.

Features

  • Crawl Entire Website: Starts with a sitemap and navigates through each page listed for thorough coverage.
  • Internal Link Extraction: Finds and catalogs all internal links on each page.
  • Internal Link Validation: All internal links are validated to ensure accuracy. For example, self-referencing links, which point to the same page (e.g., /about linking to /about), are pointless and are ignored in the results.
  • SEO Insights: Helps identify orphaned pages and underlinked/overlinked pages.

Usage

  1. Provide Sitemap URL: Start by giving the scraper the URL of the sitemap (XML format).
  2. Run the Scraper: The scraper will visit each URL in the sitemap and collect internal links.
  3. Data Analysis: Use the output to get insights and make improvements on your site.

How to Get a Sitemap

If you're unsure how to find a website's sitemap, follow this guide:
How to Find a Sitemap on Any Website.

Note: For larger websites, more RAM, CPU, and time may be needed to handle the extensive data collection.

Output Format

The scraper produces a structured output showing internal link relationships for each URL in the sitemap. The output includes:

  • linking_structure: The complete internal linking structure of the site. Relative paths are shown for better clarity. For example:
    • The root domain is represented as "" (empty string).
    • /about instead of https://example.com/about.
  • incoming_links: The number of internal links pointing to the URL.
    • incoming_links[url] == 0 indicates an orphaned page (a page listed in the sitemap but not linked to from any other page).
  • outgoing_links: The number of internal links the URL contains, pointing to other pages within the site.

Troubleshooting

If there are no results or unexpected results:

  1. Wait: It can take a while for the result to show up after the Actor has exited.
  2. Ensure the sitemap is accessible and in XML format: Double-check that the sitemap is reachable and correctly formatted in XML.
  3. Ensure the pages are accessible: If pages are not being crawled, you might need to adjust the proxy settings.
  4. Contact me: If the issue persists, feel free to reach out, and I’ll address the problem as soon as possible.

Sample Output

1{
2    "linking_structure": {
3        "https://pliwriters.com": [
4            "/blog",
5            "/about",
6            "/contact",
7            "/contact",
8            "/contact",
9            "/about",
10            "/blog",
11            "/contact",
12            "/privacy-policy",
13            "/terms-and-conditions"
14        ],
15        "https://pliwriters.com/blog/how-to-find-internal-links-to-a-page": [
16            "",
17            "",
18            "/blog",
19            "/about",
20            "/contact",
21            "/blog/category/uncategorized",
22            "/blog/internal-links-vs-external-links",
23            "/internal-link-visualization-beta",
24            "/blog/how-to-find-a-sitemap-on-any-website",
25            "/blog/the-ultimate-guide-to-anchor-text",
26            "/blog/how-to-find-internal-links-to-a-page/",
27            "/about",
28            "/blog",
29            "/contact",
30            "/privacy-policy",
31            "/terms-and-conditions"
32        ],
33        "https://pliwriters.com/about": [
34            "",
35            "",
36            "/blog",
37            "/contact",
38            "/blog",
39            "/contact",
40            "/privacy-policy",
41            "/terms-and-conditions"
42        ],
43        "https://pliwriters.com/blog/category/uncategorized": [
44            "",
45            "",
46            "/blog",
47            "/about",
48            "/contact",
49            "/blog/the-ultimate-guide-to-anchor-text",
50            "/blog/best-practices-for-website-navigation",
51            "/blog/what-are-orphan-pages",
52            "/blog/internal-links-vs-external-links",
53            "/blog/internal-links-vs-external-links",
54            "/blog/how-to-find-internal-links-to-a-page",
55            "/blog/how-to-find-a-sitemap-on-any-website",
56            "/blog/how-to-find-a-sitemap-on-any-website",
57            "/blog/3-key-components-of-seo",
58            "/blog/3-key-components-of-seo",
59            "/about",
60            "/blog",
61            "/contact",
62            "/privacy-policy",
63            "/terms-and-conditions"
64        ],
65        "https://pliwriters.com/blog/what-are-orphan-pages": [
66            "",
67            "",
68            "/blog",
69            "/about",
70            "/contact",
71            "/blog/category/uncategorized",
72            "/blog/how-to-find-a-sitemap-on-any-website",
73            "/blog/what-are-orphan-pages/",
74            "/about",
75            "/blog",
76            "/contact",
77            "/privacy-policy",
78            "/terms-and-conditions"
79        ],
80        "https://pliwriters.com/terms-and-conditions": [
81            "",
82            "",
83            "/blog",
84            "/about",
85            "/contact",
86            "/about",
87            "/blog",
88            "/contact",
89            "/privacy-policy"
90        ],
91        "https://pliwriters.com/privacy-policy": [
92            "",
93            "",
94            "/blog",
95            "/about",
96            "/contact",
97            "/about",
98            "/blog",
99            "/contact",
100            "/terms-and-conditions"
101        ],
102        "https://pliwriters.com/blog": [
103            "",
104            "",
105            "/about",
106            "/contact",
107            "/blog/the-ultimate-guide-to-anchor-text",
108            "/blog/the-ultimate-guide-to-anchor-text",
109            "/blog/best-practices-for-website-navigation",
110            "/blog/best-practices-for-website-navigation",
111            "/about",
112            "/contact",
113            "/privacy-policy",
114            "/terms-and-conditions"
115        ],
116        "https://pliwriters.com/internal-link-visualization-beta": [
117            "",
118            "",
119            "/blog",
120            "/about",
121            "/contact",
122            "/blog/how-to-find-a-sitemap-on-any-website",
123            "/contact",
124            "/about",
125            "/blog",
126            "/contact",
127            "/privacy-policy",
128            "/terms-and-conditions"
129        ],
130        "https://pliwriters.com/blog/the-ultimate-guide-to-anchor-text": [
131            "",
132            "",
133            "/blog",
134            "/about",
135            "/contact",
136            "/blog/category/uncategorized",
137            "/blog/best-practices-for-website-navigation",
138            "/blog/the-ultimate-guide-to-anchor-text/",
139            "/about",
140            "/blog",
141            "/contact",
142            "/privacy-policy",
143            "/terms-and-conditions"
144        ],
145        "https://pliwriters.com/blog/3-key-components-of-seo": [
146            "",
147            "",
148            "/blog",
149            "/about",
150            "/contact",
151            "/blog/category/uncategorized",
152            "",
153            "/contact",
154            "/blog/3-key-components-of-seo/",
155            "/about",
156            "/blog",
157            "/contact",
158            "/privacy-policy",
159            "/terms-and-conditions"
160        ],
161        "https://pliwriters.com/contact": [
162            "",
163            "",
164            "/blog",
165            "/about",
166            "/about",
167            "/blog",
168            "/privacy-policy",
169            "/terms-and-conditions"
170        ],
171        "https://pliwriters.com/orphan-page-test": [
172            "",
173            "",
174            "/blog",
175            "/about",
176            "/contact",
177            "/about",
178            "/blog",
179            "/contact",
180            "/privacy-policy",
181            "/terms-and-conditions"
182        ],
183        "https://pliwriters.com/blog/best-practices-for-website-navigation": [
184            "",
185            "",
186            "/blog",
187            "/about",
188            "/contact",
189            "/blog/category/uncategorized",
190            "/blog/internal-links-vs-external-links",
191            "/blog/what-are-orphan-pages",
192            "/blog/how-to-find-a-sitemap-on-any-website",
193            "/blog/best-practices-for-website-navigation/",
194            "/about",
195            "/blog",
196            "/contact",
197            "/privacy-policy",
198            "/terms-and-conditions"
199        ],
200        "https://pliwriters.com/blog/how-to-find-a-sitemap-on-any-website": [
201            "",
202            "",
203            "/blog",
204            "/about",
205            "/contact",
206            "/blog/category/uncategorized",
207            "/blog/how-to-find-a-sitemap-on-any-website/",
208            "/about",
209            "/blog",
210            "/contact",
211            "/privacy-policy",
212            "/terms-and-conditions"
213        ],
214        "https://pliwriters.com/blog/internal-links-vs-external-links": [
215            "",
216            "",
217            "/blog",
218            "/about",
219            "/contact",
220            "/blog/category/uncategorized",
221            "/blog/what-are-orphan-pages",
222            "/blog/internal-links-vs-external-links/",
223            "/about",
224            "/blog",
225            "/contact",
226            "/privacy-policy",
227            "/terms-and-conditions"
228        ]
229    },
230    "incoming_links": {
231        "/orphan-page-test": 0,
232        "/internal-link-visualization-beta": 1,
233        "/blog/how-to-find-internal-links-to-a-page": 2,
234        "/blog/3-key-components-of-seo": 3,
235        "/blog/what-are-orphan-pages": 4,
236        "/blog/the-ultimate-guide-to-anchor-text": 5,
237        "/blog/best-practices-for-website-navigation": 5,
238        "/blog/internal-links-vs-external-links": 5,
239        "/blog/category/uncategorized": 7,
240        "/blog/how-to-find-a-sitemap-on-any-website": 7,
241        "/privacy-policy": 15,
242        "/terms-and-conditions": 15,
243        "/about": 30,
244        "/blog": 30,
245        "": 31,
246        "/contact": 34
247    },
248    "outgoing_links": {
249        "/blog/category/uncategorized": 20,
250        "/blog/how-to-find-internal-links-to-a-page": 16,
251        "/blog/best-practices-for-website-navigation": 15,
252        "/blog/3-key-components-of-seo": 14,
253        "/blog/what-are-orphan-pages": 13,
254        "/blog/the-ultimate-guide-to-anchor-text": 13,
255        "/blog/internal-links-vs-external-links": 13,
256        "/blog": 12,
257        "/internal-link-visualization-beta": 12,
258        "/blog/how-to-find-a-sitemap-on-any-website": 12,
259        "": 10,
260        "/orphan-page-test": 10,
261        "/terms-and-conditions": 9,
262        "/privacy-policy": 9,
263        "/about": 8,
264        "/contact": 8
265    }
266}