No credit card required

Selenium with wait

decorative_quorum/selenium-with-wait

No credit card required

Runs a simple selenium-based scrape of a site, but waits a given amount of time for the broswer to load the page

.actor/Dockerfile

1# First, specify the base Docker image.
2# You can see the Docker images from Apify at https://hub.docker.com/r/apify/.
3# You can also use any other image from Docker Hub.
4FROM apify/actor-python-selenium:3.11
5
6# Second, copy just requirements.txt into the Actor image,
7# since it should be the only file that affects the dependency install in the next step,
8# in order to speed up the build
9COPY requirements.txt ./
10
11# Install the packages specified in requirements.txt,
12# Print the installed Python version, pip version
13# and all installed packages with their versions for debugging
14RUN echo "Python version:" \
15 && python --version \
16 && echo "Pip version:" \
17 && pip --version \
18 && echo "Installing dependencies:" \
19 && pip install -r requirements.txt \
20 && echo "All installed Python packages:" \
21 && pip freeze
22
23# Next, copy the remaining files and directories with the source code.
24# Since we do this after installing the dependencies, quick build will be really fast
25# for most source file changes.
26COPY . ./
27
28# Use compileall to ensure the runnability of the Actor Python code.
29RUN python3 -m compileall -q .
30
31# Specify how to launch the source code of your Actor.
32# By default, the "python3 -m src" command is run
33CMD ["python3", "-m", "src"]

.actor/actor.json

1{
2    "actorSpecification": 1,
3    "name": "my-actor-2",
4    "title": "Getting started with Python and Selenium",
5    "description": "Scrapes titles of websites using Selenium.",
6    "version": "0.0",
7    "meta": {
8        "templateId": "python-selenium"
9    },
10    "input": "./input_schema.json",
11    "dockerfile": "./Dockerfile",
12    "storages": {
13        "dataset": {
14            "actorSpecification": 1,
15            "title": "URLs and their titles",
16            "views": {
17                "titles": {
18                    "title": "URLs and their titles",
19                    "transformation": {
20                        "fields": [
21                            "url",
22                            "title"
23                        ]
24                    },
25                    "display": {
26                        "component": "table",
27                        "properties": {
28                            "url": {
29                                "label": "URL",
30                                "format": "text"
31                            },
32                            "title": {
33                                "label": "Title",
34                                "format": "text"
35                            }
36                        }
37                    }
38                }
39            }
40        }
41    }
42}

.actor/input_schema.json

1{
2    "title": "Python Selenium Scraper",
3    "type": "object",
4    "schemaVersion": 1,
5    "properties": {
6        "start_urls": {
7            "title": "Start URLs",
8            "type": "array",
9            "description": "URLs to start with",
10            "prefill": [
11                { "url": "https://apify.com" }
12            ],
13            "editor": "requestListSources"
14        },
15        "max_depth": {
16            "title": "Maximum depth",
17            "type": "integer",
18            "description": "Depth to which to scrape to",
19            "default": 1
20        }
21    },
22    "required": ["start_urls"]
23}

src/main.py

1"""
2This module serves as the entry point for executing the Apify Actor. It handles the configuration of logging
3settings. The `main()` coroutine is then executed using `asyncio.run()`.
4
5Feel free to modify this file to suit your specific needs.
6"""
7
8import asyncio
9import logging
10
11from apify.log import ActorLogFormatter
12
13from .main import main
14
15# Configure loggers
16handler = logging.StreamHandler()
17handler.setFormatter(ActorLogFormatter())
18
19apify_client_logger = logging.getLogger('apify_client')
20apify_client_logger.setLevel(logging.INFO)
21apify_client_logger.addHandler(handler)
22
23apify_logger = logging.getLogger('apify')
24apify_logger.setLevel(logging.DEBUG)
25apify_logger.addHandler(handler)
26
27# Execute the Actor main coroutine
28asyncio.run(main())

src/main.py

1"""
2This module defines the `main()` coroutine for the Apify Actor, executed from the `__main__.py` file.
3
4Feel free to modify this file to suit your specific needs.
5
6To build Apify Actors, utilize the Apify SDK toolkit, read more at the official documentation:
7https://docs.apify.com/sdk/python
8"""
9
10from urllib.parse import urljoin
11from time import sleep
12
13from selenium import webdriver
14from selenium.webdriver.chrome.options import Options as ChromeOptions
15from selenium.webdriver.common.by import By
16
17from apify import Actor
18
19# To run this Actor locally, you need to have the Selenium Chromedriver installed.
20# https://www.selenium.dev/documentation/webdriver/getting_started/install_drivers/
21# When running on the Apify platform, it is already included in the Actor's Docker image.
22
23
24async def main() -> None:
25    """
26    The main coroutine is being executed using `asyncio.run()`, so do not attempt to make a normal function
27    out of it, it will not work. Asynchronous execution is required for communication with Apify platform,
28    and it also enhances performance in the field of web scraping significantly.
29    """
30    
31    async with Actor:
32        # Read the Actor input
33        actor_input = await Actor.get_input() or {}
34        start_urls = actor_input.get('start_urls', [
35            {'url': 'https://knowledge.alteryx.com/index/s/article/How-to-restart-Alteryx-Service-remotely-when-RDP-is-not-available'}
36            ])
37        max_depth = actor_input.get('max_depth', 1)
38        wait_time = actor_input.get('wait_time_in_seconds', 5)
39        if not isinstance(wait_time, int):
40            Actor.log.info('Sleep time is not a number. Setting default...')
41            wait_time = 5
42
43        if not start_urls:
44            Actor.log.info('No start URLs specified in actor input, exiting...')
45            await Actor.exit()
46
47        # Enqueue the starting URLs in the default request queue
48        default_queue = await Actor.open_request_queue()
49        for start_url in start_urls:
50            url = start_url.get('url')
51            Actor.log.info(f'Enqueuing {url} ...')
52            await default_queue.add_request({'url': url, 'userData': {'depth': 0}})
53
54        # Launch a new Selenium Chrome WebDriver
55        Actor.log.info('Launching Chrome WebDriver...')
56        chrome_options = ChromeOptions()
57        if Actor.config.headless:
58            chrome_options.add_argument('--headless')
59        chrome_options.add_argument('--no-sandbox')
60        chrome_options.add_argument('--disable-dev-shm-usage')
61        driver = webdriver.Chrome(options=chrome_options)
62
63        # TODO: Maybe i can comment this ones?
64        driver.get('http://www.example.com')
65        assert driver.title == 'Example Domain'
66
67        # Process the requests in the queue one by one
68        while request := await default_queue.fetch_next_request():
69            url = request['url']
70            depth = request['userData']['depth']
71            Actor.log.info(f'Scraping {url} ...')
72
73            try:
74                # Open the URL in the Selenium WebDriver
75                driver.get(url)
76                Actor.log.info(f'Sleeping for this much time: {wait_time} seconds ...')
77                sleep(wait_time)
78
79                # Push the title of the page into the default dataset
80                title = driver.title
81                html_content = driver.page_source
82                Actor.log.info("Got title and html content")
83                await Actor.push_data({'url': url, 'title': title, 'text': html_content})
84            except Exception:
85                Actor.log.exception(f'Cannot extract data from {url}.')
86            finally:
87                await default_queue.mark_request_as_handled(request)
88
89        driver.quit()

.dockerignore

1# configurations
2.idea
3
4# crawlee and apify storage folders
5apify_storage
6crawlee_storage
7storage
8
9# installed files
10.venv
11
12# git folder
13.git

.editorconfig

1root = true
2
3[*]
4indent_style = space
5indent_size = 4
6charset = utf-8
7trim_trailing_whitespace = true
8insert_final_newline = true
9end_of_line = lf

.gitignore

1# This file tells Git which files shouldn't be added to source control
2
3.idea
4.DS_Store
5
6apify_storage
7storage
8
9.venv/
10.env/
11__pypackages__
12dist/
13build/
14*.egg-info/
15*.egg
16
17__pycache__
18
19.mypy_cache
20.dmypy.json
21dmypy.json
22.pytest_cache
23.ruff_cache
24
25.scrapy
26*.log

requirements.txt

1# Feel free to add your Python dependencies below. For formatting guidelines, see:
2# https://pip.pypa.io/en/latest/reference/requirements-file-format/
3
4apify ~= 1.7.0
5selenium ~= 4.14.0

Developer

José Eduardo Piña Castro

Actor metrics

2 monthly users
1 star
100.0% runs succeeded
Created in May 2024
Modified 4 months ago

Categories

Other

Goodreads Scraper

epctex/goodreads-scraper

Scrape goodreads.com for data on millions of books. Crawl book details for images, ISBN, author, description, title, buy links, number of reviews, page number, language, and all other details. You can specify search terms, filters, and much more.

epctex

284

Trustpilot reviews scraper

casper11515/trustpilot-reviews-scraper

Easily filter and extract thousands of reviews with data such as title, description, score, reviewer, country, company response, and much more from companies on Trustpilot.com and download them to multiple file formats.

Casper Rubæk

728

Google Scholar Scraper

marco.gullo/google-scholar-scraper

Scrape publication details from scholar.google.com. Add your query, time range, and optionally document type (PDF or HTML only). Extract information about articles such as titles, authors, links, related articles, and more.

Marco Gullo

346

HorseAuction_Extractor

gallopdataagency/horseauction-extractor

This actor extracts data for Horse Auctions from a Data website.

GallopDataAgency

EventBrite Scraper

newpo/eventbrite-scraper

Scrapes EventBrite events

newpo

153

Yelp Scraper

tri_angle/yelp-scraper

Free Yelp web scraper to extract data from Yelp. Fast Yelp review scraper, but also gets business details and ratings without using the Yelp API.

Tri⟁angle

3.2k

LD+JSON Schema scraper

pocesar/json-ld-schema

Extract all LD+JSON tags from the given URLs.

Paulo Cesar

254

Onlyfans profile scraper

curious_coder/onlyfans-scraper

Onlyfans data extractor scrapes profiles in bulk with all details including, name, bio, last online, engagement insights, social media urls, etc

Curious Coder

797

Daily Bet Prediction Scraper

rikunk/bet-prediction-scraper

Daily Bet Prediction Scraper consolidates over 1000+ shared predictions from multiple renowned betting prediction sites. This powerful tool enhances accuracy, emphasizes common tips, aims to boost successful betting outcomes, and provides a valuable, structured dataset.