JSDOM Scraper avatar
JSDOM Scraper
Try for free

No credit card required

View all Actors
JSDOM Scraper

JSDOM Scraper

apify/jsdom-scraper
Try for free

No credit card required

Parses the HTML using the JSDOM library, providing the same DOM API as browsers do (e.g. `window`). It is able to process client-side JavaScript without using a real browser. Performance-wise, it stands somewhere between the Cheerio Scraper and the browser scrapers.

Do you want to learn more about this Actor?

Get a demo

The code examples below show how to run the Actor and get its results. To run the code, you need to have an Apify account. Replace <YOUR_API_TOKEN> in the code with your API token, which you can find under Settings > Integrations in Apify Console. Learn more

1from apify_client import ApifyClient
2
3# Initialize the ApifyClient with your Apify API token
4client = ApifyClient("<YOUR_API_TOKEN>")
5
6# Prepare the Actor input
7run_input = {
8    "startUrls": [{ "url": "https://crawlee.dev" }],
9    "globs": [{ "glob": "https://crawlee.dev/*/*" }],
10    "pseudoUrls": [],
11    "excludes": [{ "glob": "/**/*.{png,jpg,jpeg,pdf}" }],
12    "linkSelector": "a[href]",
13    "pageFunction": """async function pageFunction(context) {
14    const { window, request, log } = context;
15
16    // The \"window\" property contains the JSDOM object which is useful
17    // for querying DOM elements and extracting data from them.
18    const pageTitle = window.document.title;
19
20    // The \"request\" property contains various information about the web page loaded. 
21    const url = request.url;
22    
23    // Use \"log\" object to print information to actor log.
24    log.info('Page scraped', { url, pageTitle });
25
26    // Return an object with the data extracted from the page.
27    // It will be stored to the resulting dataset.
28    return {
29        url,
30        pageTitle
31    };
32}""",
33    "proxyConfiguration": { "useApifyProxy": True },
34    "initialCookies": [],
35    "additionalMimeTypes": [],
36    "preNavigationHooks": """// We need to return array of (possibly async) functions here.
37// The functions accept two arguments: the \"crawlingContext\" object
38// and \"requestAsBrowserOptions\" which are passed to the `requestAsBrowser()`
39// function the crawler calls to navigate..
40[
41    async (crawlingContext, requestAsBrowserOptions) => {
42        // ...
43    }
44]""",
45    "postNavigationHooks": """// We need to return array of (possibly async) functions here.
46// The functions accept a single argument: the \"crawlingContext\" object.
47[
48    async (crawlingContext) => {
49        // ...
50    },
51]""",
52    "customData": {},
53}
54
55# Run the Actor and wait for it to finish
56run = client.actor("apify/jsdom-scraper").call(run_input=run_input)
57
58# Fetch and print Actor results from the run's dataset (if there are any)
59print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
60for item in client.dataset(run["defaultDatasetId"]).iterate_items():
61    print(item)
62
63# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start
Developer
Maintained by Apify
Actor metrics
  • 5 monthly users
  • 4 stars
  • 33.3% runs succeeded
  • Created in Dec 2022
  • Modified 3 months ago