Analyze Website Content avatar
Analyze Website Content

Pricing

$2.00/month + usage

Go to Apify Store
Analyze Website Content

Analyze Website Content

The tool analyzes the textual content of a website. It scrapes pages, cleans the html, analyze text and extract the content terminology (keywords, words and n-grams). This is useful to identify the main topics covered, analyze competitor content, find new ideas or trends and help for SEO.

Pricing

$2.00/month + usage

Rating

0.0

(0)

Developer

Leila Khouas

Leila Khouas

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

2

Monthly active users

17 hours ago

Last modified

Share

Description

This tool allows you to analyze the textual content of a given website or domain name. The tool scrapes the pages at a given depth, cleans the html pages (removes unimportant text such as navigation links and menus), analyze the text and extract the content terminology (keywords, words and ngrams). Terminology or keywords extraction allows to summarize the content of a website and identify the main topics covered. The tool can be used for many applications such as: analyzing competitor content, find new ideas and trends, help for SEO, etc.

Main features

  • Scrape a given website at a specific depth (extra domain links are ignored)
  • Clean and process HTML and plain text pages
  • Extract the most frequent words (single word terms) and most frequent ngrams (multiple word terms made of 2 to 4 words / bigrams, trigrams and quadrigrams)
  • Extract keywords from the HTML metadata
  • Merge all the extracted data, by language, for a global website analysis.
  • Extract social media links and emails
  • Output results in JSON formats and SVG wordcloud images

Supported formats

  • HTML
  • Plain text

Language identification

  • The language is identified for each scraped page.
  • The identified language is affected to the terms extracted from the page.
  • Language stopwords (the most common words, short function words, such as the, is, at, which, etc for english) are used to filter the final term list.
  • Stopwords are discarded in words, and forbidden as first or last word af an ngram.

Supported languages

French, English, German, Spanish, Italian, Portuguese

Input

The main input of the tool is a starting url for the website to process.

Output

The result of analysis is:

  • A dataset with the most frequent extracted terms. The data includes keywords, words and ngrams. For each term: value, frequency, language and type are given. The dataset can be found in Output and storage (terms.json)

Terms table view

  • The scraped pages list is provided in JSON format. Each page is described by: url, title, description, author, date, keywords and language. The file can be found in storage (pages.json)
  • The emails and social media links are provided in JSON format. The file can be found in storage (contact.json)
  • A global file combining all the output (terms, contact and pages) can be found in storage (all.json)

Storage key Store contents

  • The extracted terms can be represented as wordcloud SVG images. The images can be found in storage (wordcloud..svg)

Wordcloud representation

Your feedback

If you’ve got any technical feedback, a bug to report or any suggestion to improve the actor usage, please create an issue on the Actor’s Issues tab.