HTML to PDF Converter
Open a web page in headless Chrome using Puppeteer and print it to PDF. The input is a JSON object and output is a PDF file.
PDF to HTML Converter
Converts a PDF document to HTML using the pdf2htmlEX tool.
Broken Links Checker
Crawls a website and finds broken links. Unlike other similar SEO analysis tools, the actor also reports broken URL #fragments. The results are stored in a JSON and HTML report.
Crawls a website using one or more sitemaps and imports the data to Algolia search index. The text content is identified using simple CSS selectors.
Naked domains analyzer
Crawls and downloads web pages running on a list of provided naked domains (e.g. "example.com"). The actor stores a HTML snapshot, screenshot, text body, and HTTP response headers of all the pages. It also extracts email addresses...
Data, what now?
Simple example showing how to scrape a list of posts from a personal blog.
A small efficient act that loads a web page, parses its HTML using Cheerio library and extracts the following meta-data from the <HEAD> tag, such as page title, description, author etc.
Downloads a list of heavy-duty construction equipment for sale or rent, such as heavy duty trucks, trailers etc.
Send Email On Crawler Finish
Fetches information about a crawler run and sends it to the user by email. For example, this actor can be used to inform the user that the crawler run finished. To do that, simply put the following URL into "Finish webhook URL" se...
Extracts texts from a German automotive discussion portal. For example, such data set can be used by a machine learning system for sentiment analysis to figure out how people perceive various car models.
Example Analyze Dom Css
Example showing how to use headless Chromium with Puppeteer to open a web page, fetch the list of DOM nodes on the pages and obtain CSS styling information for each HTML element. The actor uses the Chrome DevTools Protocol to acce...
Download CSS files
Downloads CSS files linked from a webpage.
Probe Resources Plus Webhook
Calls jancurn/probe-page-resources and then invokes a hard-coded webhook. The act takes same input as jancurn/probe-page-resources
Probe Page Resources
Sequentially loads a list of URLs in headless Chrome and analyzes HTTP resources requested by each page. Source code at https://github.com/jancurn/act-probe-page-resources
Cz President Election
Collects voting data from Czech statistical office about the Czech presidential election 2018.
Example Sitemap Cheerio
An example actor that first downloads a sitemap in XML format and the crawls each page from the sitemap using the fast CheerioCrawler from Apify SDK.
Downloads a list of all news articles from novinky.cz from the past one week. Note that we're using the mobile version of the website, because it has a simpler structure and it's faster to load.