Website Links Graph Generator avatar
Website Links Graph Generator

Pricing

from $5.00 / 1,000 results

Go to Apify Store
Website Links Graph Generator

Website Links Graph Generator

Creates an oriented graph visualizing links between webpages. Outputs: graph.png (visual network diagram) and graph.json (structured data) saved to Key-Value Store, plus detailed dataset of all crawled pages. Configure depth, boundaries, and layout.

Pricing

from $5.00 / 1,000 results

Rating

5.0

(5)

Developer

Crawler Bros

Crawler Bros

Maintained by Community

Actor stats

0

Bookmarked

5

Total users

3

Monthly active users

3 days ago

Last modified

Share

Web Link Graph Visualizer

Creates oriented graphs visualizing links between webpages

Crawl a website starting from a URL, extract all links, build a directed graph of the link structure, and export it as a PNG image or JSON file.


๐Ÿ“ฅ What You'll Get

After the actor completes, you'll receive:

๐Ÿ–ผ๏ธ graph.png - Visual Network Diagram

  • Location: Key-Value Store โ†’ graph.png
  • Format: High-resolution PNG (2000x1600px)
  • Content: Visual graph with color-coded nodes and directed edges
  • Download: Click "Actions" โ†’ "Download" in Key-Value Store tab

๐Ÿ“Š graph.json - Structured Data

  • Location: Key-Value Store โ†’ graph.json
  • Format: JSON file with complete graph structure
  • Content: All nodes, edges, and statistics
  • Use: Import into analysis tools or custom visualizations

๐Ÿ“‘ Dataset - All Crawled Pages

  • Location: Dataset tab (Storage section)
  • Format: JSON records (one per page)
  • Content: URL, title, depth, all links per page
  • Export: CSV, JSON, or Excel from Dataset tab

๐Ÿ” Where to Find in Apify Console:

  1. After actor finishes, go to "Storage" section
  2. Key-Value Store tab:
    • Download graph.png (your visual graph image)
    • Download graph.json (data for analysis)
  3. Dataset tab:
    • View/export all crawled pages
    • See links extracted from each page

Features

โœ… Smart Crawling:

  • Start from any URL
  • Follow links matching a boundary regex
  • Configurable depth and page limits
  • Respects robots.txt (via Playwright)
  • Adjustable request delays

โœ… Graph Building:

  • Directed graph (oriented edges)
  • Track internal vs external links
  • URL normalization (remove fragments, trailing slashes)
  • Depth tracking for each node
  • Duplicate link detection

โœ… Visualization:

  • Multiple layout algorithms (hierarchical, spring, circular, random)
  • Customizable node labels (URL, path, title, or index)
  • Color-coded nodes (internal=blue, external=red)
  • High-resolution PNG export
  • JSON export for programmatic use

โœ… Statistics:

  • Total nodes and edges
  • Average outgoing links per page
  • Max depth reached
  • Internal vs external link counts

Input Parameters

ParameterTypeDefaultDescription
startUrlStringRequiredThe URL to start crawling from
boundaryRegexString.*Regex to limit which URLs to crawl
maxDepthInteger3Maximum crawl depth (1-10)
maxPagesInteger50Maximum pages to crawl (1-1000)
exportFormatSelectbothOutput format: both, image, or json
graphLayoutSelecthierarchicalLayout: hierarchical, spring, circular, random
nodeLabelsSelectpathLabel type: url, path, title, index
includeExternalBooleantrueShow external links in graph
waitForSelectorString-CSS selector to wait for (optional)
requestDelayInteger1000Delay between requests (ms)

Example Inputs

Example 1: Small Website

{
"startUrl": "https://example.com",
"boundaryRegex": "^https://example\\.com/.*",
"maxDepth": 2,
"maxPages": 20,
"exportFormat": "both",
"graphLayout": "hierarchical",
"nodeLabels": "path"
}

Example 2: Documentation Site

{
"startUrl": "https://docs.python.org/3/",
"boundaryRegex": "^https://docs\\.python\\.org/3/tutorial/.*",
"maxDepth": 3,
"maxPages": 50,
"exportFormat": "image",
"graphLayout": "spring",
"nodeLabels": "title",
"includeExternal": false,
"requestDelay": 500
}

Example 3: Blog with Subdomains

{
"startUrl": "https://blog.example.com",
"boundaryRegex": "^https://.*\\.example\\.com/.*",
"maxDepth": 2,
"maxPages": 30,
"exportFormat": "both",
"graphLayout": "circular",
"nodeLabels": "path"
}

Output

Dataset

Each crawled page is saved to the dataset with:

  • url - Page URL
  • title - Page title
  • depth - Depth from start URL
  • links - All extracted links
  • internal_links - Links matching boundary
  • external_links - Links outside boundary
  • crawled_at - Timestamp

Key-Value Store

graph.json (if JSON export enabled):

{
"graph": {
"nodes": [
{
"id": "https://example.com",
"url": "https://example.com",
"title": "Example Domain",
"depth": 0,
"is_internal": true,
"outgoing_links": 3
}
],
"edges": [
{
"source": "https://example.com",
"target": "https://example.com/page1"
}
],
"directed": true
},
"statistics": {
"nodes": 15,
"edges": 42,
"crawled_pages": 15,
"external_links": 3,
"avg_outgoing_links": 2.8,
"max_depth_reached": 2
}
}

graph.png (if image export enabled):

  • High-resolution PNG image (2000x1600px)
  • Color-coded nodes (blue=internal, red=external)
  • Directed edges with arrows
  • Legend and statistics

OUTPUT:

{
"start_url": "https://example.com",
"statistics": {
"nodes": 15,
"edges": 42,
"crawled_pages": 15
},
"exports": {
"json": true,
"image": true
}
}

Boundary Regex Examples

PatternMatches
^https://example\\.com/.*All pages on example.com
^https://example\\.com/blog/.*Only blog section
^https://.*\\.example\\.com/.*All subdomains
^https://example\\.com/(?!admin).*Exclude admin section
.*Everything (no boundary)

Use Cases

๐Ÿ” SEO Analysis:

  • Visualize site structure
  • Find orphan pages
  • Identify link depth issues

๐Ÿ“Š Content Strategy:

  • Map content relationships
  • Find hub pages
  • Identify external dependencies

๐Ÿ”— Link Building:

  • Discover internal linking opportunities
  • Find broken link paths
  • Analyze link distribution

๐Ÿ› ๏ธ Site Migration:

  • Document current structure
  • Plan URL redirects
  • Validate link integrity

Graph Layouts

Hierarchical (Default)

Best for: Sites with clear hierarchy (docs, blogs)

  • Top-down structure
  • Shows depth clearly

Spring (Force-Directed)

Best for: Discovering clusters

  • Nodes repel/attract based on connections
  • Reveals natural groupings

Circular

Best for: Small sites

  • Nodes arranged in a circle
  • Shows connections clearly

Random

Best for: Quick visualization

  • Fast to generate
  • Good for dense graphs

Node Label Types

TypeExampleBest For
urlhttps://example.com/pageSmall graphs
path/blog/post-titleMedium graphs (default)
titleMy Blog PostReadable labels
index1, 2, 3Large graphs

Performance Tips

  1. Start Small:

    • Use maxPages: 20 for initial runs
    • Increase gradually
  2. Tight Boundaries:

    • Use specific regex patterns
    • Avoid crawling entire domains
  3. Adjust Depth:

    • Depth 2-3 is usually sufficient
    • Depth 4+ can explode exponentially
  4. Request Delays:

    • Use 1000ms+ for courtesy
    • Reduce for fast sites
  5. External Links:

    • Set includeExternal: false for cleaner graphs
    • Enable to see dependencies

Limitations

  • Max Pages: 1000 (configurable limit)
  • Max Depth: 10 (configurable limit)
  • JavaScript: Rendered via Playwright (may be slow)
  • Image Size: Large graphs (100+ nodes) may have small labels

Technical Details

Built With:

  • Python 3.11
  • Apify SDK
  • Playwright (browser automation)
  • BeautifulSoup4 (HTML parsing)
  • NetworkX (graph algorithms)
  • Matplotlib (visualization)

Graph Type:

  • Directed graph (DiGraph)
  • Nodes = URLs
  • Edges = Links (from โ†’ to)

URL Normalization:

  • Removes fragments (#section)
  • Removes trailing slashes
  • Preserves query strings
  • Converts relative to absolute

Example Output

Small Site (10 pages)

Nodes: 10
Edges: 28
Crawled pages: 10
External links: 3
Avg links per page: 2.8
Max depth reached: 2

Documentation Site (50 pages)

Nodes: 53 (50 internal + 3 external)
Edges: 142
Crawled pages: 50
External links: 3
Avg links per page: 2.7
Max depth reached: 3

Troubleshooting

Issue: No links found

  • Check waitForSelector for dynamic sites
  • Verify boundary regex matches start URL

Issue: Too many nodes

  • Reduce maxPages or maxDepth
  • Tighten boundary regex

Issue: Image labels too small

  • Use nodeLabels: "index" for large graphs
  • Reduce number of nodes

Issue: Slow crawling

  • Reduce requestDelay
  • Decrease maxPages
  • Check site performance

Support

For issues or questions:

  1. Check input parameters
  2. Verify boundary regex
  3. Test with small maxPages first
  4. Review dataset for crawl results

License

MIT License - Free for commercial and personal use


Built with โค๏ธ using Apify SDK