No credit card required

Website Content Crawler

apify/website-content-crawler

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo

Back to issues Create new issue

hyperlinks in markdown are broken

Closed

biscayneworks opened this issue

when exporting the results of a scrape, the markdown has broken hyperlinks such as:

[

Admin & Relations

](/careers_category/admin-relations)

is it possible to remove the unwanted space after [ and before ] so that the markdown is as expected?

Jan Buchar (janbuchar)

Hello, and thank you for your interest in the Actor!

Website Content Crawler uses the Turndown library to convert from HTML to markdown. I have verified that the library behaves like you describe:

1import TurndownService from 'turndown';
2const str = `<a href="/careers/career-fields/admin-relations" class="promo-box tab-focus" style="background-image: url(&quot;https://content.nationalguard.com/sites/default/files/adminrelations2023.jpg&quot;);"><div class="inner"><h4 class="promo-title">Admin &amp; Relations</h4> <h6>
3         Learn more.
4         <i class="fas fa-chevron-right" aria-hidden="true"></i></h6></div> <div class="overlay"></div></a>`;
5const td = TurndownService({headingStyle: 'atx', codeBlockStyle: 'fenced'});
6td.turndown(str); // returns broken markdown link

I'm afraid that there is nothing we can do on our end. You can open an issue with turndown (https://github.com/mixmark-io/turndown), or you could remove the offending links completely with the removeElementsCssSelector option. Or you could ignore the markdown produced by Website Content Crawler and perform the conversion using a different tool after you download the dataset (using the HTML as input - you need to enable that with the "Save HTML" option).

Hope that helps!

Add comment

Developer

Apify

Actor Metrics

3.8k monthly users
762 stars
>99% runs succeeded
1.8 days response time
Created in Mar 2023
Modified 3 days ago

Categories

Developer tools

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

127

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

229

Example Website Screenshot Crawler

dz_omar/example-website-screenshot-crawler

Automated website screenshot crawler using Pyppeteer and Apify. This open-source actor captures screenshots from specified URLs, uploads them to the Apify Key-Value Store, and provides easy access to the results, making it ideal for monitoring website changes and archiving web content.

Omar Abdlhakim

Google Maps Scraper

compass/crawler-google-places

Extract data from hundreds of Google Maps locations and businesses. Get Google Maps data including reviews, images, contact info, opening hours, location, popular times, prices & more. Export scraped data, run the scraper via API, schedule and monitor runs, or integrate with other tools.

Compass

80.9k

665

Facebook Posts Scraper

apify/facebook-posts-scraper

Extract data from hundreds of Facebook posts from one or multiple Facebook pages and profiles. Get post URL, post text, page or profile URL, timestamp, number of likes, shares, comments, and more. Download the data in JSON, CSV, and Excel and use it in apps, spreadsheets, and reports.

Apify

16.1k

220

Google Maps Reviews Scraper

compass/Google-Maps-Reviews-Scraper

Extract all reviews of Google Maps places using place URLs. Get review text, published date, response from owner, review URL, and reviewer's details. Download scraped data, run the scraper via API, schedule and monitor runs or integrate with other tools.

Compass

4.9k

109

Google Images Scraper

hooli/google-images-scraper

Scrape image details from images.google.com. Add your query and number of images and extract image details such as image URL, image source, description, image dimensions, thumbnail, and more. Export scraped data, run the scraper via API, schedule and monitor runs, or integrate with other tools.

Hooli

952

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Apify

5.8k