Web Scraper Task

  • undrtkr984/web-scraper-task
  • Modified
  • Users 22
  • Runs 347
  • Created by Author's avatarMatt

To run the code examples, you need to have an Apify account. Replace <YOUR_API_TOKEN> in the code with your API token. For a more detailed explanation, please read about running Actors via the API in Apify Docs.

from apify_client import ApifyClient

# Initialize the ApifyClient with your API token
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "runMode": "DEVELOPMENT",
    "startUrls": [{ "url": "" }],
    "linkSelector": "a[href]",
    "globs": [{ "glob": "*/*" }],
    "pseudoUrls": [],
    "pageFunction": """// The function accepts a single argument: the \"context\" object.
// For a complete list of its properties and functions,
// see 
async function pageFunction(context) {
    // This statement works as a breakpoint when you're trying to debug your code. Works only with Run mode: DEVELOPMENT!
    // debugger; 

    // jQuery is handy for finding DOM elements and extracting data from them.
    // To use it, make sure to enable the \"Inject jQuery\" option.
    const $ = context.jQuery;
    const pageTitle = $('title').first().text();
    const h1 = $('h1').first().text();
    const first_h2 = $('h2').first().text();
    const random_text_from_the_page = $('p').first().text();

    // Print some information to actor log`URL: ${context.request.url}, TITLE: ${pageTitle}`);

    // Manually add a new page to the queue for scraping.
   await context.enqueueRequest({ url: '' });

    // Return an object with the data extracted from the page.
    // It will be stored to the resulting dataset.
    return {
        url: context.request.url,
    "proxyConfiguration": { "useApifyProxy": True },
    "initialCookies": [],
    "waitUntil": ["networkidle2"],
    "preNavigationHooks": """// We need to return array of (possibly async) functions here.
// The functions accept two arguments: the \"crawlingContext\" object
// and \"gotoOptions\".
    async (crawlingContext, gotoOptions) => {
        // ...
    "postNavigationHooks": """// We need to return array of (possibly async) functions here.
// The functions accept a single argument: the \"crawlingContext\" object.
    async (crawlingContext) => {
        // ...
    "breakpointLocation": "NONE",
    "customData": {},

# Run the Actor and wait for it to finish
run ="undrtkr984/web-scraper-task").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():