JSDOM Scraper
Try for free
No credit card required
View all Actors
JSDOM Scraper
apify/jsdom-scraper
Try for free
No credit card required
Parses the HTML using the JSDOM library, providing the same DOM API as browsers do (e.g. `window`). It is able to process client-side JavaScript without using a real browser. Performance-wise, it stands somewhere between the Cheerio Scraper and the browser scrapers.
Do you want to learn more about this Actor?
Get a demoThe code examples below show how to run the Actor and get its results. To run the code, you need to have an Apify account. Replace <YOUR_API_TOKEN> in the code with your API token, which you can find under Settings > Integrations in Apify Console. Learn more
1from apify_client import ApifyClient
2
3# Initialize the ApifyClient with your Apify API token
4client = ApifyClient("<YOUR_API_TOKEN>")
5
6# Prepare the Actor input
7run_input = {
8 "startUrls": [{ "url": "https://crawlee.dev" }],
9 "globs": [{ "glob": "https://crawlee.dev/*/*" }],
10 "pseudoUrls": [],
11 "excludes": [{ "glob": "/**/*.{png,jpg,jpeg,pdf}" }],
12 "linkSelector": "a[href]",
13 "pageFunction": """async function pageFunction(context) {
14 const { window, request, log } = context;
15
16 // The \"window\" property contains the JSDOM object which is useful
17 // for querying DOM elements and extracting data from them.
18 const pageTitle = window.document.title;
19
20 // The \"request\" property contains various information about the web page loaded.
21 const url = request.url;
22
23 // Use \"log\" object to print information to actor log.
24 log.info('Page scraped', { url, pageTitle });
25
26 // Return an object with the data extracted from the page.
27 // It will be stored to the resulting dataset.
28 return {
29 url,
30 pageTitle
31 };
32}""",
33 "proxyConfiguration": { "useApifyProxy": True },
34 "initialCookies": [],
35 "additionalMimeTypes": [],
36 "preNavigationHooks": """// We need to return array of (possibly async) functions here.
37// The functions accept two arguments: the \"crawlingContext\" object
38// and \"requestAsBrowserOptions\" which are passed to the `requestAsBrowser()`
39// function the crawler calls to navigate..
40[
41 async (crawlingContext, requestAsBrowserOptions) => {
42 // ...
43 }
44]""",
45 "postNavigationHooks": """// We need to return array of (possibly async) functions here.
46// The functions accept a single argument: the \"crawlingContext\" object.
47[
48 async (crawlingContext) => {
49 // ...
50 },
51]""",
52 "customData": {},
53}
54
55# Run the Actor and wait for it to finish
56run = client.actor("apify/jsdom-scraper").call(run_input=run_input)
57
58# Fetch and print Actor results from the run's dataset (if there are any)
59print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
60for item in client.dataset(run["defaultDatasetId"]).iterate_items():
61 print(item)
62
63# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start
Developer
Maintained by Apify
Actor metrics
- 5 monthly users
- 4 stars
- 33.3% runs succeeded
- Created in Dec 2022
- Modified 3 months ago
Categories