Under maintenance

No credit card required

Go to Store

This Actor is under maintenance.

This Actor may be unreliable while under maintenance. Would you like to try a similar Actor instead?

See alternative Actors

BCB Correspondentes

mayara/bcb-correspondentes

Try for free

No credit card required

.actor/Dockerfile

1# First, specify the base Docker image.
2# You can see the Docker images from Apify at https://hub.docker.com/r/apify/.
3# You can also use any other image from Docker Hub.
4FROM apify/actor-python:3.12
5
6# Second, copy just requirements.txt into the Actor image,
7# since it should be the only file that affects the dependency install in the next step,
8# in order to speed up the build
9COPY requirements.txt ./
10
11# Install the packages specified in requirements.txt,
12# Print the installed Python version, pip version
13# and all installed packages with their versions for debugging
14RUN echo "Python version:" \
15 && python --version \
16 && echo "Pip version:" \
17 && pip --version \
18 && echo "Installing dependencies:" \
19 && pip install -r requirements.txt \
20 && echo "All installed Python packages:" \
21 && pip freeze
22
23# Next, copy the remaining files and directories with the source code.
24# Since we do this after installing the dependencies, quick build will be really fast
25# for most source file changes.
26COPY . ./
27
28# Use compileall to ensure the runnability of the Actor Python code.
29RUN python3 -m compileall -q .
30
31# Specify how to launch the source code of your Actor.
32# By default, the "python3 -m src" command is run
33CMD ["python3", "-m", "src"]

.actor/actor.json

1{
2    "actorSpecification": 1,
3    "name": "bcb-correspondentes",
4    "title": "Getting started with Python and Scrapy",
5    "description": "Scrapes titles of websites using Scrapy.",
6    "version": "0.0",
7    "buildTag": "latest",
8    "meta": {
9        "templateId": "python-scrapy"
10    },
11    "input": "./input_schema.json",
12    "dockerfile": "./Dockerfile"
13}

.actor/input_schema.json

1{
2    "title": "Python Scrapy Scraper",
3    "type": "object",
4    "schemaVersion": 1,
5    "properties": {
6        "startUrls": {
7            "title": "Start URLs",
8            "type": "array",
9            "description": "URLs to start with",
10            "prefill": [
11                { "url": "https://apify.com" }
12            ],
13            "editor": "requestListSources"
14        },
15        "proxyConfiguration": {
16            "sectionCaption": "Proxy and HTTP configuration",
17            "title": "Proxy configuration",
18            "type": "object",
19            "description": "Specifies proxy servers that will be used by the scraper in order to hide its origin.",
20            "editor": "proxy",
21            "prefill": { "useApifyProxy": true },
22            "default": { "useApifyProxy": true }
23        }
24    },
25    "required": ["startUrls"]
26}

src/spiders/init.py

1"""
2Scrapy spiders package
3
4This package contains the spiders for your Scrapy project. Spiders are the classes that define how to scrape
5and process data from websites.
6
7For detailed information on creating and utilizing spiders, refer to the official documentation:
8https://docs.scrapy.org/en/latest/topics/spiders.html
9"""

src/spiders/title.py

1from __future__ import annotations
2
3import json
4from typing import Generator
5from scrapy import Request, Spider
6from scrapy.responsetypes import Response
7
8
9class TitleSpider(Spider):
10    """
11    Scrapes data from Banco Central API and extracts JSON data.
12    """
13    name = 'title_spider'
14    start_urls = [
15        'https://olinda.bcb.gov.br/olinda/servico/Informes_Correspondentes/versao/v1/odata/Correspondentes?$format=json'
16    ]
17
18    def parse(self, response: Response) -> Generator[dict, None, None]:
19        """
20        Parse the API JSON response.
21
22        Args:
23            response: The JSON response from the API.
24
25        Yields:
26            Extracted data as dictionaries.
27        """
28        self.logger.info('Parsing API response from %s...', response.url)
29
30        try:
31            # Converte a resposta para um dicionário Python
32            data = json.loads(response.text)
33
34            # Verifica se a chave 'value' existe
35            records = data.get('value', [])
36            if not records:
37                self.logger.warning('Nenhum dado encontrado na chave "value". Resposta: %s', response.text[:500])
38                return
39
40            # Itera pelos registros e extrai os campos desejados
41            for record in records:
42                yield {
43                    'CnpjContratante': record.get('CnpjContratante'),
44                    'NomeContratante': record.get('NomeContratante'),
45                    'CnpjCorrespondente': record.get('CnpjCorrespondente'),
46                    'NomeCorrespondente': record.get('NomeCorrespondente'),
47                    'Tipo': record.get('Tipo'),
48                    'Ordem': record.get('Ordem'),
49                    'MunicipioIBGE': record.get('MunicipioIBGE'),
50                    'Municipio': record.get('Municipio'),
51                    'UF': record.get('UF'),
52                    'ServicosCorrespondentes': record.get('ServicosCorrespondentes'),
53                    'Posicao': record.get('Posicao'),
54                }
55        except json.JSONDecodeError:
56            self.logger.error('Erro ao decodificar JSON. Resposta: %s', response.text[:500])
57        except Exception as e:
58            self.logger.error('Erro inesperado: %s', str(e))

src/main.py

1"""Apify Actor integration for Scrapy projects.
2
3This module transforms a Scrapy project into an Apify Actor, handling the configuration of logging, patching Scrapy's
4logging system, and establishing the required environment to run the Scrapy spider within the Apify platform.
5
6This file is specifically designed to be executed when the project is run as an Apify Actor using `apify run` locally
7or being run on the Apify platform. It is not being executed when running the project as a Scrapy project using
8`scrapy crawl title_spider`.
9
10We recommend you do not modify this file unless you really know what you are doing.
11"""
12
13# We need to configure the logging first before we import anything else, so that nothing else imports
14# `scrapy.utils.log` before we patch it.
15from __future__ import annotations
16from logging import StreamHandler, getLogger
17from typing import Any
18from scrapy.utils import log as scrapy_logging
19from scrapy.utils.project import get_project_settings
20from apify.log import ActorLogFormatter
21
22# Define names of the loggers.
23MAIN_LOGGER_NAMES = ['apify', 'apify_client', 'scrapy']
24OTHER_LOGGER_NAMES = ['filelock', 'hpack', 'httpcore', 'httpx', 'protego', 'twisted']
25ALL_LOGGER_NAMES = MAIN_LOGGER_NAMES + OTHER_LOGGER_NAMES
26
27# To change the logging level, modify the `LOG_LEVEL` field in `settings.py`. If the field is not present in the file,
28# Scrapy will default to `DEBUG`. This setting applies to all loggers. If you wish to change the logging level for
29# a specific logger, do it in this file.
30settings = get_project_settings()
31LOGGING_LEVEL = settings['LOG_LEVEL']
32
33# Define a logging handler which will be used for the loggers.
34apify_handler = StreamHandler()
35apify_handler.setFormatter(ActorLogFormatter(include_logger_name=True))
36
37
38def configure_logger(logger_name: str | None, log_level: str, *handlers: StreamHandler) -> None:
39    """Configure a logger with the specified settings.
40
41    Args:
42        logger_name: The name of the logger to be configured.
43        log_level: The desired logging level ('DEBUG', 'INFO', 'WARNING', 'ERROR', ...).
44        handlers: Optional list of logging handlers.
45    """
46    logger = getLogger(logger_name)
47    logger.setLevel(log_level)
48    logger.handlers = []
49
50    for handler in handlers:
51        logger.addHandler(handler)
52
53
54# Apify loggers have to be set up here and in the `new_configure_logging` as well to be able to use them both from
55# the `main.py` and Scrapy components.
56for logger_name in MAIN_LOGGER_NAMES:
57    configure_logger(logger_name, LOGGING_LEVEL, apify_handler)
58
59# We can't attach our log handler to the loggers normally, because Scrapy would remove them in the `configure_logging`
60# call here: https://github.com/scrapy/scrapy/blob/2.11.0/scrapy/utils/log.py#L113 (even though
61# `disable_existing_loggers` is set to False :facepalm:). We need to monkeypatch Scrapy's `configure_logging` method
62# like this, so that our handler is attached right after Scrapy calls the `configure_logging` method, because
63# otherwise we would lose some log messages.
64old_configure_logging = scrapy_logging.configure_logging
65
66
67def new_configure_logging(*args: Any, **kwargs: Any) -> None:
68    """Configure logging for Scrapy and root loggers to ensure consistent logging behavior.
69
70    We need to manually configure both the root logger and all Scrapy-associated loggers. Configuring only the root
71    logger is not sufficient, as Scrapy will override it with its own settings. Scrapy uses these four primary
72    loggers - https://github.com/scrapy/scrapy/blob/2.11.0/scrapy/utils/log.py#L60:L77. Therefore, we configure here
73    these four loggers and the root logger.
74    """
75    old_configure_logging(*args, **kwargs)
76
77    # We modify the root (None) logger to ensure proper display of logs from spiders when using the `self.logger`
78    # property within spiders. See details in the Spider logger property:
79    # https://github.com/scrapy/scrapy/blob/2.11.0/scrapy/spiders/__init__.py#L43:L46.
80    configure_logger(None, LOGGING_LEVEL, apify_handler)
81
82    # We modify other loggers only by setting up their log level. A custom log handler is added
83    # only to the root logger to avoid duplicate log messages.
84    for logger_name in ALL_LOGGER_NAMES:
85        configure_logger(logger_name, LOGGING_LEVEL)
86
87    # Set the HTTPX logger explicitly to the WARNING level, because it is too verbose and spams the logs with useless
88    # messages, especially when running on the platform.
89    configure_logger('httpx', 'WARNING')
90
91
92scrapy_logging.configure_logging = new_configure_logging
93
94# Now we can do the rest of the setup.
95import asyncio
96import os
97import nest_asyncio
98from scrapy.utils.reactor import install_reactor
99from .main import main
100
101# For compatibility between Twisted (used by Scrapy) and AsyncIO (used by Apify) asynchronous libraries, it is
102# necessary to set the Twisted reactor to `AsyncioSelectorReactor`. This setup allows the two asynchronous libraries
103# to work together.
104#
105# Note: The reactor must be installed before applying `nest_asyncio.apply()`, otherwise, it will not work correctly
106# on Windows.
107install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
108nest_asyncio.apply()
109
110# Specify the path to the Scrapy project settings module.
111os.environ['SCRAPY_SETTINGS_MODULE'] = 'src.settings'
112
113# Run the Apify main coroutine in the event loop.
114asyncio.run(main())

src/items.py

1"""Scrapy item models module.
2
3This module defines Scrapy item models for scraped data. Items represent structured data
4extracted by spiders.
5
6For detailed information on creating and utilizing items, refer to the official documentation:
7https://docs.scrapy.org/en/latest/topics/items.html
8"""
9
10from scrapy import Field, Item
11
12
13class TitleItem(Item):
14    """
15    Represents a title item scraped from a web page.
16    """
17
18    url = Field()
19    title = Field()

src/main.py

1"""This module defines the main entry point for the Apify Actor.
2
3This module defines the main coroutine for the Apify Scrapy Actor, executed from the __main__.py file. The coroutine
4processes the Actor's input and executes the Scrapy spider. Additionally, it updates Scrapy project settings by
5applying Apify-related settings. Which includes adding a custom scheduler, retry middleware, and an item pipeline
6for pushing data to the Apify dataset.
7
8Customization:
9--------------
10
11Feel free to customize this file to add specific functionality to the Actor, such as incorporating your own Scrapy
12components like spiders and handling Actor input. However, make sure you have a clear understanding of your
13modifications. For instance, removing `apply_apify_settings` break the integration between Scrapy and Apify.
14
15Documentation:
16--------------
17
18For an in-depth description of the Apify-Scrapy integration process, our Scrapy components, known limitations and
19other stuff, please refer to the following documentation page: https://docs.apify.com/cli/docs/integrating-scrapy.
20"""
21
22from __future__ import annotations
23
24from scrapy.crawler import CrawlerProcess
25
26from apify import Actor
27from apify.scrapy.utils import apply_apify_settings
28
29# Import your Scrapy spider here.
30from .spiders.title import TitleSpider as Spider
31
32# Default input values for local execution using `apify run`.
33LOCAL_DEFAULT_START_URLS = [{'url': 'https://olinda.bcb.gov.br/olinda/servico/Informes_Correspondentes/versao/v1/odata/Correspondentes?$format=json'}]
34
35
36async def main() -> None:
37    """Apify Actor main coroutine for executing the Scrapy spider."""
38    async with Actor:
39        Actor.log.info('Actor is being executed...')
40
41        # Retrieve and process Actor input.
42        actor_input = await Actor.get_input() or {}
43        start_urls = actor_input.get('startUrls', LOCAL_DEFAULT_START_URLS)
44        proxy_config = actor_input.get('proxyConfiguration')
45
46        # Open the default request queue for handling URLs to be processed.
47        request_queue = await Actor.open_request_queue()
48
49        # Enqueue the start URLs.
50        for start_url in start_urls:
51            url = start_url.get('url')
52            await request_queue.add_request(url)
53
54        # Apply Apify settings, it will override the Scrapy project settings.
55        settings = apply_apify_settings(proxy_config=proxy_config)
56
57        # Execute the spider using Scrapy `CrawlerProcess`.
58        process = CrawlerProcess(settings, install_root_handler=False)
59        process.crawl(Spider)
60        process.start()

src/middlewares.py

1"""Scrapy middlewares module.
2
3This module defines Scrapy middlewares. Middlewares are processing components that handle requests and
4responses, typically used for adding custom headers, retrying requests, and handling exceptions.
5
6There are 2 types of middlewares: spider middlewares and downloader middlewares. For detailed information
7on creating and utilizing them, refer to the official documentation:
8https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
9https://docs.scrapy.org/en/latest/topics/spider-middleware.html
10"""
11
12from __future__ import annotations
13from typing import Generator, Iterable
14
15from scrapy import Request, Spider, signals
16from scrapy.crawler import Crawler
17from scrapy.http import Response
18
19# useful for handling different item types with a single interface
20from itemadapter import is_item, ItemAdapter
21
22
23class TitleSpiderMiddleware:
24    # Not all methods need to be defined. If a method is not defined,
25    # scrapy acts as if the spider middleware does not modify the
26    # passed objects.
27
28    @classmethod
29    def from_crawler(cls, crawler: Crawler) -> TitleSpiderMiddleware:
30        # This method is used by Scrapy to create your spiders.
31        s = cls()
32        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
33        return s
34
35    def process_spider_input(self, response: Response, spider: Spider) -> None:
36        # Called for each response that goes through the spider
37        # middleware and into the spider.
38
39        # Should return None or raise an exception.
40        return None
41
42    def process_spider_output(
43        self,
44        response: Response,
45        result: Iterable,
46        spider: Spider,
47    ) -> Generator[Iterable[Request] | None, None, None]:
48        # Called with the results returned from the Spider, after
49        # it has processed the response.
50
51        # Must return an iterable of Request, or item objects.
52        for i in result:
53            yield i
54
55    def process_spider_exception(
56        self,
57        response: Response,
58        exception: BaseException,
59        spider: Spider,
60    ) -> Iterable[Request] | None:
61        # Called when a spider or process_spider_input() method
62        # (from other spider middleware) raises an exception.
63
64        # Should return either None or an iterable of Request or item objects.
65        pass
66
67    def process_start_requests(
68        self, start_requests: Iterable[Request], spider: Spider
69    ) -> Iterable[Request]:  # Called with the start requests of the spider, and works
70        # similarly to the process_spider_output() method, except
71        # that it doesn’t have a response associated.
72
73        # Must return only requests (not items).
74        for r in start_requests:
75            yield r
76
77    def spider_opened(self, spider: Spider) -> None:
78        pass
79
80
81class TitleDownloaderMiddleware:
82    # Not all methods need to be defined. If a method is not defined,
83    # scrapy acts as if the downloader middleware does not modify the
84    # passed objects.
85
86    @classmethod
87    def from_crawler(cls, crawler: Crawler) -> TitleDownloaderMiddleware:
88        # This method is used by Scrapy to create your spiders.
89        s = cls()
90        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
91        return s
92
93    def process_request(self, request: Request, spider: Spider) -> Request | Response | None:
94        # Called for each request that goes through the downloader
95        # middleware.
96
97        # Must either:
98        # - return None: continue processing this request
99        # - or return a Response object
100        # - or return a Request object
101        # - or raise IgnoreRequest: process_exception() methods of
102        #   installed downloader middleware will be called
103        return None
104
105    def process_response(self, request: Request, response: Response, spider: Spider) -> Request | Response:
106        # Called with the response returned from the downloader.
107
108        # Must either;
109        # - return a Response object
110        # - return a Request object
111        # - or raise IgnoreRequest
112        return response
113
114    def process_exception(self, request: Request, exception: BaseException, spider: Spider) -> Response | None:
115        # Called when a download handler or a process_request()
116        # (from other downloader middleware) raises an exception.
117
118        # Must either:
119        # - return None: continue processing this exception
120        # - return a Response object: stops process_exception() chain
121        # - return a Request object: stops process_exception() chain
122        pass
123
124    def spider_opened(self, spider: Spider) -> None:
125        pass

src/pipelines.py

1"""Scrapy item pipelines module.
2
3This module defines Scrapy item pipelines for scraped data. Item pipelines are processing components
4that handle the scraped items, typically used for cleaning, validating, and persisting data.
5
6For detailed information on creating and utilizing item pipelines, refer to the official documentation:
7http://doc.scrapy.org/en/latest/topics/item-pipeline.html
8"""
9
10from scrapy import Spider
11
12from .items import TitleItem
13
14
15class TitleItemPipeline:
16    """
17    This item pipeline defines processing steps for TitleItem objects scraped by spiders.
18    """
19
20    def process_item(self, item: TitleItem, spider: Spider) -> TitleItem:
21        # Do something with the item here, such as cleaning it or persisting it to a database
22        return item

src/settings.py

1"""Scrapy settings module.
2
3This module contains Scrapy settings for the project, defining various configurations and options.
4
5For more comprehensive details on Scrapy settings, refer to the official documentation:
6http://doc.scrapy.org/en/latest/topics/settings.html
7"""
8
9# You can update these options and add new ones
10BOT_NAME = 'titlebot'
11DEPTH_LIMIT = 1
12LOG_LEVEL = 'INFO'
13NEWSPIDER_MODULE = 'src.spiders'
14REQUEST_FINGERPRINTER_IMPLEMENTATION = '2.7'
15ROBOTSTXT_OBEY = True
16SPIDER_MODULES = ['src.spiders']
17ITEM_PIPELINES = {
18    'src.pipelines.TitleItemPipeline': 123,
19}
20SPIDER_MIDDLEWARES = {
21    'src.middlewares.TitleSpiderMiddleware': 543,
22}
23DOWNLOADER_MIDDLEWARES = {
24    'src.middlewares.TitleDownloaderMiddleware': 543,
25}

.dockerignore

1.git
2.mise.toml
3.nvim.lua
4storage
5
6# The rest is copied from https://github.com/github/gitignore/blob/main/Python.gitignore
7
8# Byte-compiled / optimized / DLL files
9__pycache__/
10*.py[cod]
11*$py.class
12
13# C extensions
14*.so
15
16# Distribution / packaging
17.Python
18build/
19develop-eggs/
20dist/
21downloads/
22eggs/
23.eggs/
24lib/
25lib64/
26parts/
27sdist/
28var/
29wheels/
30share/python-wheels/
31*.egg-info/
32.installed.cfg
33*.egg
34MANIFEST
35
36# PyInstaller
37#  Usually these files are written by a python script from a template
38#  before PyInstaller builds the exe, so as to inject date/other infos into it.
39*.manifest
40*.spec
41
42# Installer logs
43pip-log.txt
44pip-delete-this-directory.txt
45
46# Unit test / coverage reports
47htmlcov/
48.tox/
49.nox/
50.coverage
51.coverage.*
52.cache
53nosetests.xml
54coverage.xml
55*.cover
56*.py,cover
57.hypothesis/
58.pytest_cache/
59cover/
60
61# Translations
62*.mo
63*.pot
64
65# Django stuff:
66*.log
67local_settings.py
68db.sqlite3
69db.sqlite3-journal
70
71# Flask stuff:
72instance/
73.webassets-cache
74
75# Scrapy stuff:
76.scrapy
77
78# Sphinx documentation
79docs/_build/
80
81# PyBuilder
82.pybuilder/
83target/
84
85# Jupyter Notebook
86.ipynb_checkpoints
87
88# IPython
89profile_default/
90ipython_config.py
91
92# pyenv
93#   For a library or package, you might want to ignore these files since the code is
94#   intended to run in multiple environments; otherwise, check them in:
95.python-version
96
97# pdm
98#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
99#pdm.lock
100#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
101#   in version control.
102#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
103.pdm.toml
104.pdm-python
105.pdm-build/
106
107# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
108__pypackages__/
109
110# Celery stuff
111celerybeat-schedule
112celerybeat.pid
113
114# SageMath parsed files
115*.sage.py
116
117# Environments
118.env
119.venv
120env/
121venv/
122ENV/
123env.bak/
124venv.bak/
125
126# Spyder project settings
127.spyderproject
128.spyproject
129
130# Rope project settings
131.ropeproject
132
133# mkdocs documentation
134/site
135
136# mypy
137.mypy_cache/
138.dmypy.json
139dmypy.json
140
141# Pyre type checker
142.pyre/
143
144# pytype static type analyzer
145.pytype/
146
147# Cython debug symbols
148cython_debug/
149
150# PyCharm
151#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
152#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
153#  and can be added to the global gitignore or merged into this file.  For a more nuclear
154#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
155.idea/

.gitignore

1.mise.toml
2.nvim.lua
3storage
4
5# The rest is copied from https://github.com/github/gitignore/blob/main/Python.gitignore
6
7# Byte-compiled / optimized / DLL files
8__pycache__/
9*.py[cod]
10*$py.class
11
12# C extensions
13*.so
14
15# Distribution / packaging
16.Python
17build/
18develop-eggs/
19dist/
20downloads/
21eggs/
22.eggs/
23lib/
24lib64/
25parts/
26sdist/
27var/
28wheels/
29share/python-wheels/
30*.egg-info/
31.installed.cfg
32*.egg
33MANIFEST
34
35# PyInstaller
36#  Usually these files are written by a python script from a template
37#  before PyInstaller builds the exe, so as to inject date/other infos into it.
38*.manifest
39*.spec
40
41# Installer logs
42pip-log.txt
43pip-delete-this-directory.txt
44
45# Unit test / coverage reports
46htmlcov/
47.tox/
48.nox/
49.coverage
50.coverage.*
51.cache
52nosetests.xml
53coverage.xml
54*.cover
55*.py,cover
56.hypothesis/
57.pytest_cache/
58cover/
59
60# Translations
61*.mo
62*.pot
63
64# Django stuff:
65*.log
66local_settings.py
67db.sqlite3
68db.sqlite3-journal
69
70# Flask stuff:
71instance/
72.webassets-cache
73
74# Scrapy stuff:
75.scrapy
76
77# Sphinx documentation
78docs/_build/
79
80# PyBuilder
81.pybuilder/
82target/
83
84# Jupyter Notebook
85.ipynb_checkpoints
86
87# IPython
88profile_default/
89ipython_config.py
90
91# pyenv
92#   For a library or package, you might want to ignore these files since the code is
93#   intended to run in multiple environments; otherwise, check them in:
94.python-version
95
96# pdm
97#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
98#pdm.lock
99#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
100#   in version control.
101#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
102.pdm.toml
103.pdm-python
104.pdm-build/
105
106# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
107__pypackages__/
108
109# Celery stuff
110celerybeat-schedule
111celerybeat.pid
112
113# SageMath parsed files
114*.sage.py
115
116# Environments
117.env
118.venv
119env/
120venv/
121ENV/
122env.bak/
123venv.bak/
124
125# Spyder project settings
126.spyderproject
127.spyproject
128
129# Rope project settings
130.ropeproject
131
132# mkdocs documentation
133/site
134
135# mypy
136.mypy_cache/
137.dmypy.json
138dmypy.json
139
140# Pyre type checker
141.pyre/
142
143# pytype static type analyzer
144.pytype/
145
146# Cython debug symbols
147cython_debug/
148
149# PyCharm
150#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
151#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
152#  and can be added to the global gitignore or merged into this file.  For a more nuclear
153#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
154.idea/

requirements.txt

1# Feel free to add your Python dependencies below. For formatting guidelines, see:
2# https://pip.pypa.io/en/latest/reference/requirements-file-format/
3
4apify[scrapy] ~= 2.0.0
5nest-asyncio
6scrapy

scrapy.cfg

1[settings]
2default = src.settings
3
4[deploy]
5project = src

Developer

Mayara Marques

Actor Metrics

1 monthly user
0 No stars yet
>99% runs succeeded
Created in Dec 2024
Modified a month ago

Categories

Real estate

Open source

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Apify

6.2k

109

Web Scraper

apify/web-scraper

Crawls arbitrary websites using the Chrome browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

Apify

75k

401

Puppeteer Scraper

apify/puppeteer-scraper

Crawls websites with the headless Chrome and Puppeteer library using a provided server-side Node.js code. This crawler is an alternative to apify/web-scraper that gives you finer control over the process. Supports both recursive crawling and list of URLs. Supports login to website.

Apify

5.2k

Zillow ZIP Code Search Scraper

maxcopell/zillow-zip-search

Scraper to find all Zillow real estate properties for sale, for rent or recently sold from given ZIP code locations.

Maximillian Copelli

531

Zillow Detail Scraper

maxcopell/zillow-detail-scraper

Get details of Zillow properties from URLs. This Actor can be easily integrated with other Zillow Scrapers.

Maximillian Copelli

994

Airbnb Scraper

onidivo/airbnb-scraper

Crawl the Airbnb site and extract data about rentals. Scrape rental details for any location with many filters. Download and use the data in whatever way you want.

Onidivo Technologies

368

Apartments.com Scraper 🏡

epctex/apartments-scraper

Scrape Apartments.com to crawl millions of real estate properties nationwide. Specify any US location and extract data on all available properties in that area. Our real estate scraper lets you filter and limit the results by page or total number. You can also target a specific property or area.

epctex

631

Zillow Search Scraper

maxcopell/zillow-scraper

Extract data about properties for sale and rent on Zillow using the Zillow API, but with no daily call limits. Scrape millions of listings and download your data as HTML, JSON, CSV, Excel, XML. Export scraped data, run the scraper via API, schedule and monitor runs, or integrate with other tools.

Maximillian Copelli

1.1k

Extended GPT Scraper

drobnikj/extended-gpt-scraper

Extract data from any website and feed it into GPT via the OpenAI API. Use ChatGPT to proofread content, analyze sentiment, summarize reviews, extract contact details, and much more.

Jakub Drobník

1.3k

Legacy PhantomJS Crawler

apify/legacy-phantomjs-crawler

Replacement for the legacy Apify Crawler product with a backward-compatible interface. The actor uses PhantomJS headless browser to recursively crawl websites and extract data from them using a piece of front-end JavaScript code.

Apify

1.6k