Analyze Website Content
Pricing
$2.00/month + usage
Analyze Website Content
The tool analyzes the textual content of a website. It scrapes pages, cleans the html, analyze text and extract the content terminology (keywords, words and n-grams). This is useful to identify the main topics covered, analyze competitor content, find new ideas or trends and help for SEO.
Pricing
$2.00/month + usage
Rating
0.0
(0)
Developer
Leila Khouas
Actor stats
0
Bookmarked
3
Total users
2
Monthly active users
17 hours ago
Last modified
Categories
Share
Description
This tool allows you to analyze the textual content of a given website or domain name. The tool scrapes the pages at a given depth, cleans the html pages (removes unimportant text such as navigation links and menus), analyze the text and extract the content terminology (keywords, words and ngrams). Terminology or keywords extraction allows to summarize the content of a website and identify the main topics covered. The tool can be used for many applications such as: analyzing competitor content, find new ideas and trends, help for SEO, etc.
Main features
- Scrape a given website at a specific depth (extra domain links are ignored)
- Clean and process HTML and plain text pages
- Extract the most frequent words (single word terms) and most frequent ngrams (multiple word terms made of 2 to 4 words / bigrams, trigrams and quadrigrams)
- Extract keywords from the HTML metadata
- Merge all the extracted data, by language, for a global website analysis.
- Extract social media links and emails
- Output results in JSON formats and SVG wordcloud images
Supported formats
- HTML
- Plain text
Language identification
- The language is identified for each scraped page.
- The identified language is affected to the terms extracted from the page.
- Language stopwords (the most common words, short function words, such as the, is, at, which, etc for english) are used to filter the final term list.
- Stopwords are discarded in words, and forbidden as first or last word af an ngram.
Supported languages
French, English, German, Spanish, Italian, Portuguese
Input
The main input of the tool is a starting url for the website to process.
Output
The result of analysis is:
- A dataset with the most frequent extracted terms. The data includes keywords, words and ngrams. For each term: value, frequency, language and type are given. The dataset can be found in Output and storage (terms.json)

- The scraped pages list is provided in JSON format. Each page is described by: url, title, description, author, date, keywords and language. The file can be found in storage (pages.json)
- The emails and social media links are provided in JSON format. The file can be found in storage (contact.json)
- A global file combining all the output (terms, contact and pages) can be found in storage (all.json)

- The extracted terms can be represented as wordcloud SVG images. The images can be found in storage (wordcloud..svg)

Your feedback
If you’ve got any technical feedback, a bug to report or any suggestion to improve the actor usage, please create an issue on the Actor’s Issues tab.