Apify Project 01

Pricing

Pay per usage

Try for free

Go to Apify Store

Apify Project 01

Try for free

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Carlos Sanchez

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

a year ago

Last modified

.dockerignore

# configurations
.idea

# crawlee and apify storage folders
apify_storage
crawlee_storage
storage

# installed files
.venv

# git folder
.git

.editorconfig

root = true

[*]
indent_style = space
indent_size = 4
charset = utf-8
trim_trailing_whitespace = true
insert_final_newline = true
end_of_line = lf

.gitignore

# This file tells Git which files shouldn't be added to source control

.idea
.DS_Store

apify_storage
storage

.venv/
.env/
__pypackages__
dist/
build/
*.egg-info/
*.egg

__pycache__

.mypy_cache
.dmypy.json
dmypy.json
.pytest_cache
.ruff_cache

.scrapy
*.log

# Added by Apify CLI
node_modules
.venv

requirements.txt

1# Feel free to add your Python dependencies below. For formatting guidelines, see:
2# https://pip.pypa.io/en/latest/reference/requirements-file-format/
3
4apify ~= 1.7.0
5beautifulsoup4 ~= 4.12.2
6httpx ~= 0.25.2
7types-beautifulsoup4 ~= 4.12.0.7

.actor/Dockerfile

# First, specify the base Docker image.
# You can see the Docker images from Apify at https://hub.docker.com/r/apify/.
# You can also use any other image from Docker Hub.
FROM apify/actor-python:3.11

# Second, copy just requirements.txt into the Actor image,
# since it should be the only file that affects the dependency install in the next step,
# in order to speed up the build
COPY requirements.txt ./

# Install the packages specified in requirements.txt,
# Print the installed Python version, pip version
# and all installed packages with their versions for debugging
RUN echo "Python version:" \
 && python --version \
 && echo "Pip version:" \
 && pip --version \
 && echo "Installing dependencies:" \
 && pip install -r requirements.txt \
 && echo "All installed Python packages:" \
 && pip freeze

# Next, copy the remaining files and directories with the source code.
# Since we do this after installing the dependencies, quick build will be really fast
# for most source file changes.
COPY . ./

# Use compileall to ensure the runnability of the Actor Python code.
RUN python3 -m compileall -q .

# Specify how to launch the source code of your Actor.
# By default, the "python3 -m src" command is run
CMD ["python3", "-m", "src"]

.actor/actor.json

{
	"actorSpecification": 1,
	"name": "apify-project-01",
	"title": "Getting started with Python and BeautifulSoup",
	"description": "Scrapes titles of websites using BeautifulSoup.",
	"version": "0.0",
	"meta": {
		"templateId": "python-beautifulsoup"
	},
	"input": "./input_schema.json",
	"dockerfile": "./Dockerfile",
	"storages": {
		"dataset": {
			"actorSpecification": 1,
			"title": "URLs and their titles",
			"views": {
				"titles": {
					"title": "URLs and their titles",
					"transformation": {
						"fields": [
							"url",
							"title"
						]
					},
					"display": {
						"component": "table",
						"properties": {
							"url": {
								"label": "URL",
								"format": "text"
							},
							"title": {
								"label": "Title",
								"format": "text"
							}
						}
					}
				}
			}
		}
	}
}

.actor/input_schema.json

{
    "title": "Python BeautifulSoup Scraper",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "start_urls": {
            "title": "Start URLs",
            "type": "array",
            "description": "URLs to start with",
            "prefill": [
                { "url": "https://apify.com" }
            ],
            "editor": "requestListSources"
        },
        "max_depth": {
            "title": "Maximum depth",
            "type": "integer",
            "description": "Depth to which to scrape to",
            "default": 1
        }
    },
    "required": ["start_urls"]
}

src/main.py

1"""
2This module serves as the entry point for executing the Apify Actor. It handles the configuration of logging
3settings. The `main()` coroutine is then executed using `asyncio.run()`.
4
5Feel free to modify this file to suit your specific needs.
6"""
7
8import asyncio
9import logging
10
11from apify.log import ActorLogFormatter
12
13from .main import main
14
15# Configure loggers
16handler = logging.StreamHandler()
17handler.setFormatter(ActorLogFormatter())
18
19apify_client_logger = logging.getLogger('apify_client')
20apify_client_logger.setLevel(logging.INFO)
21apify_client_logger.addHandler(handler)
22
23apify_logger = logging.getLogger('apify')
24apify_logger.setLevel(logging.DEBUG)
25apify_logger.addHandler(handler)
26
27# Execute the Actor main coroutine
28asyncio.run(main())

src/main.py

1"""
2This module defines the `main()` coroutine for the Apify Actor, executed from the `__main__.py` file.
3
4Feel free to modify this file to suit your specific needs.
5
6To build Apify Actors, utilize the Apify SDK toolkit, read more at the official documentation:
7https://docs.apify.com/sdk/python
8"""
9
10from urllib.parse import urljoin
11
12from bs4 import BeautifulSoup
13from httpx import AsyncClient
14
15from apify import Actor
16
17
18async def main() -> None:
19    """
20    The main coroutine is being executed using `asyncio.run()`, so do not attempt to make a normal function
21    out of it, it will not work. Asynchronous execution is required for communication with Apify platform,
22    and it also enhances performance in the field of web scraping significantly.
23    """
24    async with Actor:
25        # Read the Actor input
26        actor_input = await Actor.get_input() or {}
27        start_urls = actor_input.get('start_urls', [{'url': 'https://apify.com'}])
28        max_depth = actor_input.get('max_depth', 1)
29
30        if not start_urls:
31            Actor.log.info('No start URLs specified in actor input, exiting...')
32            await Actor.exit()
33
34        # Enqueue the starting URLs in the default request queue
35        default_queue = await Actor.open_request_queue()
36        for start_url in start_urls:
37            url = start_url.get('url')
38            Actor.log.info(f'Enqueuing {url} ...')
39            await default_queue.add_request({'url': url, 'userData': {'depth': 0}})
40
41        # Process the requests in the queue one by one
42        while request := await default_queue.fetch_next_request():
43            url = request['url']
44            depth = request['userData']['depth']
45            Actor.log.info(f'Scraping {url} ...')
46
47            try:
48                # Fetch the URL using `httpx`
49                async with AsyncClient() as client:
50                    response = await client.get(url, follow_redirects=True)
51
52                # Parse the response using `BeautifulSoup`
53                soup = BeautifulSoup(response.content, 'html.parser')
54
55                # If we haven't reached the max depth,
56                # look for nested links and enqueue their targets
57                if depth < max_depth:
58                    for link in soup.find_all('a'):
59                        link_href = link.get('href')
60                        link_url = urljoin(url, link_href)
61                        if link_url.startswith(('http://', 'https://')):
62                            Actor.log.info(f'Enqueuing {link_url} ...')
63                            await default_queue.add_request({
64                                'url': link_url,
65                                'userData': {'depth': depth + 1},
66                            })
67
68                # Push the title of the page into the default dataset
69                title = soup.title.string if soup.title else None
70                await Actor.push_data({'url': url, 'title': title})
71            except Exception:
72                Actor.log.exception(f'Cannot extract data from {url}.')
73            finally:
74                # Mark the request as handled so it's not processed again
75                await default_queue.mark_request_as_handled(request)

Amazon Sellers Finder

agenscrape/amazon-sellers-finder

Wondering who else sells that product? Find ALL Amazon sellers instantly. Get merchant IDs, names, reviews, ratings, FBA status. Perfect for competitor research. 100% accurate real-time data. Pay per result: $0.01/search + $0.01/seller. No subscriptions. Export to CSV.

Agenscrape

Odysee Data Extractor [RC-01]

jupri/odysee

💫 All-in-One Odysee.com Scraper

cat

MakerWorld.com Scraper

lexis-solutions/maker-world-com

Scrape maker profiles and projects from MakerWorld.com - including maker names, project titles, descriptions, materials, images, and links. Ideal for community aggregation, trend analysis, and project galleries. Fast, structured, and customizable extraction.

Lexis Solutions

5.0

houzz

scrapingxpert/houzz

Houzz is a well-known platform for discovering and exploring home design and remodeling ideas but it lacks an API for directly accessing comprehensive project details. Thankfully,with the help of Apify script users can effortlessly extract the latest project information without any coding knowledge"

scrapingxpert

170

5.0

Prolinker Project Scraper

deltaspider/prolinker-project-scraper

Scrape Prolinker project listings instantly - The most reliable Prolinker scraper with fast, accurate, real-time data extraction from the leading Dutch & Belgian freelance marketplace.

delta spider

Sherlock

misceres/sherlock

🔎 Hunt down social media accounts by username across social networks using open-source project https://github.com/sherlock-project/sherlock

Misceres

16K

4.6

Project Gutenberg Research Scraper

happyfhantum/project-gutenberg-research-scraper

Exhaustively searches Project Gutenberg's 70,000+ free ebooks using multi-page pagination and smart filtering. Perfect for academic research, finding complete author works, or discovering books on specialized topics. Gets ALL results, not just the first page.

Kelsey Todd

X/Twitter Trends Scraper｜ 2025

fastcrawler/x-twitter-trends-scraper-2025

1000 results only cost 0.01$. Monitor real-time X (Twitter) trends with this scraper. Ideal for social media analysis, content creation, and trend tracking. Provides topic and tweet volume data. no-code，online

fastcrawler

161

5.0

Southwest News Scraper

intelecta/southwest-news-scraper

Collects structured public data about construction projects from Southwest Construction News by searching for user-provided keywords within a specified date range. Extracts details such as job number, bid date, project name, county, location, and included works for each matching project.

Intelecta.ai

Kickstarter Scraper

epctex/kickstarter-scraper

The ultimate and most all-encompassing Kickstarter tool you'll ever discover. With powerful search features, you can instantly locate and access any live project on Kickstarter.com. Search by location, project status, funding progress, and more. User-friendly, cost-effective, and without limitations