
Web Page Analyzer
- apify/page-analyzer
- Modified
- Users 714
- Runs 42.7k
- Created by
Apify
Performs analysis of a webpage to figure the best way how to scrapes its data. On input, it takes an URL and array of strings to search for, and on output, it returns a definition of a crawler.
Web page analyzer performs an analysis on the input website. It will search the content of the website for each keyword and output all the possible ways a keyword data can be scraped.
Intended users
This actor has been developed mainly for two categories of the users:
- Analysts and non-deverlopers that would like to get an insight on how the web page can be scraped. The output of the analysis can give a great clue of how difficult it is to scraped the website or how many resources will be needed.
- Developers that are developing web scrapers. Data from the analysis can be directly used for scraping purposes.
Website url
Url of a website to be analyzed.
Keywords
Keywords are strings that analyzer will search for in given website.
Input
Input can be set using JSON or the visual input UI through Apify console.
{
// url of a website to be analyzed
"url": "http://example.com",
// array of strings too look for on the website, we will refer to those string as "keywords"
"keywords": [
"About us",
// numbers are also passed as strings
"125"
],
}
How to use
- Add the analyzer actor to your Apify console.
- Enter url of a website to be analysed.
- Add keywords to be searched for. It is recommended to directly copy the text from the website.
- Run the actor.
- View the analysis results by opening DASHBOARD.html file inside the key-value storage.
Output
Output of this actor is saved in Apify key-value store of the particular actor run.
Results of the analysis are saved in the OUTPUT.json file and can be viewed by opening the DASHBOARD.html file.
How to run analzyer locally on your computer
- Download this repository.
- Run npm install
- Download analyzer-ui repository https://github.com/vahlunten/analyzer-ui
- Run npm install 5 Run npm run dev
- Run npm run start in analyzer-ts repository. It will run with ./apify_storage/key_value_stores/default/INPUT.json file as an input.
- Open the web app started by analyzer-ui
Files stored in key-value store
Actor also saves some additional files with futher information, useful mainly for developers.
- OUTPUT.json: Most of the analysis results are stored in this file.
- DASHBOARD.html. Visual analysis results explorer.
- initial.html. Initial response retrieved by the chromium browser.
- dom-content.html. Htlm of a website rendered inside chromium browser.
- initial-cheerio.html. Initial response retrieved by the CheerioCrawler.
- INPUT. Actors input.
- screenshot.jpeg. Screenshot of a loaded website.
- xhr.json. Additional details about XHR validation.
How analyzer works
The goal of this actor is to find out the optimal way to scrape the website. It tries to find a way to scrape a website without using a browser to minimize the resources needed for scraping.
Analyzer uses a Playwright library to control the chromium browser. It navigates to the website and scans all the sources for the input keywords.
The analysis steps:
- Analyzer opens the chromium browser and navigates to the website.
- It searches the initial response of a website and fully rendered DOM.
- XHR requests are intercepted and searched for the keywords.
- These search results are then validated against the initial response retrieved by CheerioCrawler.
- XHR request containing the keywords are then replicated using got-scraping library with different sets of headers.
Planned features
- flexible formatting of numbers & strings (4,400 <=> 4000)
- testing different proxy configurations (datacenter -> residential)
- generation of scraper/crawler code
- custom clicks