BCB Correspondentes
Under maintenance
Pricing
Pay per usage
Go to Store
BCB Correspondentes
Under maintenance
0.0 (0)
Pricing
Pay per usage
0
Total users
1
Monthly users
1
Runs succeeded
>99%
Last modified
4 months ago
.actor/Dockerfile
# First, specify the base Docker image.# You can see the Docker images from Apify at https://hub.docker.com/r/apify/.# You can also use any other image from Docker Hub.FROM apify/actor-python:3.12
# Second, copy just requirements.txt into the Actor image,# since it should be the only file that affects the dependency install in the next step,# in order to speed up the buildCOPY requirements.txt ./
# Install the packages specified in requirements.txt,# Print the installed Python version, pip version# and all installed packages with their versions for debuggingRUN echo "Python version:" \ && python --version \ && echo "Pip version:" \ && pip --version \ && echo "Installing dependencies:" \ && pip install -r requirements.txt \ && echo "All installed Python packages:" \ && pip freeze
# Next, copy the remaining files and directories with the source code.# Since we do this after installing the dependencies, quick build will be really fast# for most source file changes.COPY . ./
# Use compileall to ensure the runnability of the Actor Python code.RUN python3 -m compileall -q .
# Specify how to launch the source code of your Actor.# By default, the "python3 -m src" command is runCMD ["python3", "-m", "src"]
.actor/actor.json
{ "actorSpecification": 1, "name": "bcb-correspondentes", "title": "Getting started with Python and Scrapy", "description": "Scrapes titles of websites using Scrapy.", "version": "0.0", "buildTag": "latest", "meta": { "templateId": "python-scrapy" }, "input": "./input_schema.json", "dockerfile": "./Dockerfile"}
.actor/input_schema.json
{ "title": "Python Scrapy Scraper", "type": "object", "schemaVersion": 1, "properties": { "startUrls": { "title": "Start URLs", "type": "array", "description": "URLs to start with", "prefill": [ { "url": "https://apify.com" } ], "editor": "requestListSources" }, "proxyConfiguration": { "sectionCaption": "Proxy and HTTP configuration", "title": "Proxy configuration", "type": "object", "description": "Specifies proxy servers that will be used by the scraper in order to hide its origin.", "editor": "proxy", "prefill": { "useApifyProxy": true }, "default": { "useApifyProxy": true } } }, "required": ["startUrls"]}
src/spiders/__init__.py
1"""2Scrapy spiders package3
4This package contains the spiders for your Scrapy project. Spiders are the classes that define how to scrape5and process data from websites.6
7For detailed information on creating and utilizing spiders, refer to the official documentation:8https://docs.scrapy.org/en/latest/topics/spiders.html9"""
src/spiders/title.py
1from __future__ import annotations2
3import json4from typing import Generator5from scrapy import Request, Spider6from scrapy.responsetypes import Response7
8
9class TitleSpider(Spider):10 """11 Scrapes data from Banco Central API and extracts JSON data.12 """13 name = 'title_spider'14 start_urls = [15 'https://olinda.bcb.gov.br/olinda/servico/Informes_Correspondentes/versao/v1/odata/Correspondentes?$format=json'16 ]17
18 def parse(self, response: Response) -> Generator[dict, None, None]:19 """20 Parse the API JSON response.21
22 Args:23 response: The JSON response from the API.24
25 Yields:26 Extracted data as dictionaries.27 """28 self.logger.info('Parsing API response from %s...', response.url)29
30 try:31 # Converte a resposta para um dicionário Python32 data = json.loads(response.text)33
34 # Verifica se a chave 'value' existe35 records = data.get('value', [])36 if not records:37 self.logger.warning('Nenhum dado encontrado na chave "value". Resposta: %s', response.text[:500])38 return39
40 # Itera pelos registros e extrai os campos desejados41 for record in records:42 yield {43 'CnpjContratante': record.get('CnpjContratante'),44 'NomeContratante': record.get('NomeContratante'),45 'CnpjCorrespondente': record.get('CnpjCorrespondente'),46 'NomeCorrespondente': record.get('NomeCorrespondente'),47 'Tipo': record.get('Tipo'),48 'Ordem': record.get('Ordem'),49 'MunicipioIBGE': record.get('MunicipioIBGE'),50 'Municipio': record.get('Municipio'),51 'UF': record.get('UF'),52 'ServicosCorrespondentes': record.get('ServicosCorrespondentes'),53 'Posicao': record.get('Posicao'),54 }55 except json.JSONDecodeError:56 self.logger.error('Erro ao decodificar JSON. Resposta: %s', response.text[:500])57 except Exception as e:58 self.logger.error('Erro inesperado: %s', str(e))
src/__main__.py
1"""Apify Actor integration for Scrapy projects.2
3This module transforms a Scrapy project into an Apify Actor, handling the configuration of logging, patching Scrapy's4logging system, and establishing the required environment to run the Scrapy spider within the Apify platform.5
6This file is specifically designed to be executed when the project is run as an Apify Actor using `apify run` locally7or being run on the Apify platform. It is not being executed when running the project as a Scrapy project using8`scrapy crawl title_spider`.9
10We recommend you do not modify this file unless you really know what you are doing.11"""12
13# We need to configure the logging first before we import anything else, so that nothing else imports14# `scrapy.utils.log` before we patch it.15from __future__ import annotations16from logging import StreamHandler, getLogger17from typing import Any18from scrapy.utils import log as scrapy_logging19from scrapy.utils.project import get_project_settings20from apify.log import ActorLogFormatter21
22# Define names of the loggers.23MAIN_LOGGER_NAMES = ['apify', 'apify_client', 'scrapy']24OTHER_LOGGER_NAMES = ['filelock', 'hpack', 'httpcore', 'httpx', 'protego', 'twisted']25ALL_LOGGER_NAMES = MAIN_LOGGER_NAMES + OTHER_LOGGER_NAMES26
27# To change the logging level, modify the `LOG_LEVEL` field in `settings.py`. If the field is not present in the file,28# Scrapy will default to `DEBUG`. This setting applies to all loggers. If you wish to change the logging level for29# a specific logger, do it in this file.30settings = get_project_settings()31LOGGING_LEVEL = settings['LOG_LEVEL']32
33# Define a logging handler which will be used for the loggers.34apify_handler = StreamHandler()35apify_handler.setFormatter(ActorLogFormatter(include_logger_name=True))36
37
38def configure_logger(logger_name: str | None, log_level: str, *handlers: StreamHandler) -> None:39 """Configure a logger with the specified settings.40
41 Args:42 logger_name: The name of the logger to be configured.43 log_level: The desired logging level ('DEBUG', 'INFO', 'WARNING', 'ERROR', ...).44 handlers: Optional list of logging handlers.45 """46 logger = getLogger(logger_name)47 logger.setLevel(log_level)48 logger.handlers = []49
50 for handler in handlers:51 logger.addHandler(handler)52
53
54# Apify loggers have to be set up here and in the `new_configure_logging` as well to be able to use them both from55# the `main.py` and Scrapy components.56for logger_name in MAIN_LOGGER_NAMES:57 configure_logger(logger_name, LOGGING_LEVEL, apify_handler)58
59# We can't attach our log handler to the loggers normally, because Scrapy would remove them in the `configure_logging`60# call here: https://github.com/scrapy/scrapy/blob/2.11.0/scrapy/utils/log.py#L113 (even though61# `disable_existing_loggers` is set to False :facepalm:). We need to monkeypatch Scrapy's `configure_logging` method62# like this, so that our handler is attached right after Scrapy calls the `configure_logging` method, because63# otherwise we would lose some log messages.64old_configure_logging = scrapy_logging.configure_logging65
66
67def new_configure_logging(*args: Any, **kwargs: Any) -> None:68 """Configure logging for Scrapy and root loggers to ensure consistent logging behavior.69
70 We need to manually configure both the root logger and all Scrapy-associated loggers. Configuring only the root71 logger is not sufficient, as Scrapy will override it with its own settings. Scrapy uses these four primary72 loggers - https://github.com/scrapy/scrapy/blob/2.11.0/scrapy/utils/log.py#L60:L77. Therefore, we configure here73 these four loggers and the root logger.74 """75 old_configure_logging(*args, **kwargs)76
77 # We modify the root (None) logger to ensure proper display of logs from spiders when using the `self.logger`78 # property within spiders. See details in the Spider logger property:79 # https://github.com/scrapy/scrapy/blob/2.11.0/scrapy/spiders/__init__.py#L43:L46.80 configure_logger(None, LOGGING_LEVEL, apify_handler)81
82 # We modify other loggers only by setting up their log level. A custom log handler is added83 # only to the root logger to avoid duplicate log messages.84 for logger_name in ALL_LOGGER_NAMES:85 configure_logger(logger_name, LOGGING_LEVEL)86
87 # Set the HTTPX logger explicitly to the WARNING level, because it is too verbose and spams the logs with useless88 # messages, especially when running on the platform.89 configure_logger('httpx', 'WARNING')90
91
92scrapy_logging.configure_logging = new_configure_logging93
94# Now we can do the rest of the setup.95import asyncio96import os97import nest_asyncio98from scrapy.utils.reactor import install_reactor99from .main import main100
101# For compatibility between Twisted (used by Scrapy) and AsyncIO (used by Apify) asynchronous libraries, it is102# necessary to set the Twisted reactor to `AsyncioSelectorReactor`. This setup allows the two asynchronous libraries103# to work together.104#105# Note: The reactor must be installed before applying `nest_asyncio.apply()`, otherwise, it will not work correctly106# on Windows.107install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')108nest_asyncio.apply()109
110# Specify the path to the Scrapy project settings module.111os.environ['SCRAPY_SETTINGS_MODULE'] = 'src.settings'112
113# Run the Apify main coroutine in the event loop.114asyncio.run(main())
src/items.py
1"""Scrapy item models module.2
3This module defines Scrapy item models for scraped data. Items represent structured data4extracted by spiders.5
6For detailed information on creating and utilizing items, refer to the official documentation:7https://docs.scrapy.org/en/latest/topics/items.html8"""9
10from scrapy import Field, Item11
12
13class TitleItem(Item):14 """15 Represents a title item scraped from a web page.16 """17
18 url = Field()19 title = Field()
src/main.py
1"""This module defines the main entry point for the Apify Actor.2
3This module defines the main coroutine for the Apify Scrapy Actor, executed from the __main__.py file. The coroutine4processes the Actor's input and executes the Scrapy spider. Additionally, it updates Scrapy project settings by5applying Apify-related settings. Which includes adding a custom scheduler, retry middleware, and an item pipeline6for pushing data to the Apify dataset.7
8Customization:9--------------10
11Feel free to customize this file to add specific functionality to the Actor, such as incorporating your own Scrapy12components like spiders and handling Actor input. However, make sure you have a clear understanding of your13modifications. For instance, removing `apply_apify_settings` break the integration between Scrapy and Apify.14
15Documentation:16--------------17
18For an in-depth description of the Apify-Scrapy integration process, our Scrapy components, known limitations and19other stuff, please refer to the following documentation page: https://docs.apify.com/cli/docs/integrating-scrapy.20"""21
22from __future__ import annotations23
24from scrapy.crawler import CrawlerProcess25
26from apify import Actor27from apify.scrapy.utils import apply_apify_settings28
29# Import your Scrapy spider here.30from .spiders.title import TitleSpider as Spider31
32# Default input values for local execution using `apify run`.33LOCAL_DEFAULT_START_URLS = [{'url': 'https://olinda.bcb.gov.br/olinda/servico/Informes_Correspondentes/versao/v1/odata/Correspondentes?$format=json'}]34
35
36async def main() -> None:37 """Apify Actor main coroutine for executing the Scrapy spider."""38 async with Actor:39 Actor.log.info('Actor is being executed...')40
41 # Retrieve and process Actor input.42 actor_input = await Actor.get_input() or {}43 start_urls = actor_input.get('startUrls', LOCAL_DEFAULT_START_URLS)44 proxy_config = actor_input.get('proxyConfiguration')45
46 # Open the default request queue for handling URLs to be processed.47 request_queue = await Actor.open_request_queue()48
49 # Enqueue the start URLs.50 for start_url in start_urls:51 url = start_url.get('url')52 await request_queue.add_request(url)53
54 # Apply Apify settings, it will override the Scrapy project settings.55 settings = apply_apify_settings(proxy_config=proxy_config)56
57 # Execute the spider using Scrapy `CrawlerProcess`.58 process = CrawlerProcess(settings, install_root_handler=False)59 process.crawl(Spider)60 process.start()
src/middlewares.py
1"""Scrapy middlewares module.2
3This module defines Scrapy middlewares. Middlewares are processing components that handle requests and4responses, typically used for adding custom headers, retrying requests, and handling exceptions.5
6There are 2 types of middlewares: spider middlewares and downloader middlewares. For detailed information7on creating and utilizing them, refer to the official documentation:8https://docs.scrapy.org/en/latest/topics/downloader-middleware.html9https://docs.scrapy.org/en/latest/topics/spider-middleware.html10"""11
12from __future__ import annotations13from typing import Generator, Iterable14
15from scrapy import Request, Spider, signals16from scrapy.crawler import Crawler17from scrapy.http import Response18
19# useful for handling different item types with a single interface20from itemadapter import is_item, ItemAdapter21
22
23class TitleSpiderMiddleware:24 # Not all methods need to be defined. If a method is not defined,25 # scrapy acts as if the spider middleware does not modify the26 # passed objects.27
28 @classmethod29 def from_crawler(cls, crawler: Crawler) -> TitleSpiderMiddleware:30 # This method is used by Scrapy to create your spiders.31 s = cls()32 crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)33 return s34
35 def process_spider_input(self, response: Response, spider: Spider) -> None:36 # Called for each response that goes through the spider37 # middleware and into the spider.38
39 # Should return None or raise an exception.40 return None41
42 def process_spider_output(43 self,44 response: Response,45 result: Iterable,46 spider: Spider,47 ) -> Generator[Iterable[Request] | None, None, None]:48 # Called with the results returned from the Spider, after49 # it has processed the response.50
51 # Must return an iterable of Request, or item objects.52 for i in result:53 yield i54
55 def process_spider_exception(56 self,57 response: Response,58 exception: BaseException,59 spider: Spider,60 ) -> Iterable[Request] | None:61 # Called when a spider or process_spider_input() method62 # (from other spider middleware) raises an exception.63
64 # Should return either None or an iterable of Request or item objects.65 pass66
67 def process_start_requests(68 self, start_requests: Iterable[Request], spider: Spider69 ) -> Iterable[Request]: # Called with the start requests of the spider, and works70 # similarly to the process_spider_output() method, except71 # that it doesn’t have a response associated.72
73 # Must return only requests (not items).74 for r in start_requests:75 yield r76
77 def spider_opened(self, spider: Spider) -> None:78 pass79
80
81class TitleDownloaderMiddleware:82 # Not all methods need to be defined. If a method is not defined,83 # scrapy acts as if the downloader middleware does not modify the84 # passed objects.85
86 @classmethod87 def from_crawler(cls, crawler: Crawler) -> TitleDownloaderMiddleware:88 # This method is used by Scrapy to create your spiders.89 s = cls()90 crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)91 return s92
93 def process_request(self, request: Request, spider: Spider) -> Request | Response | None:94 # Called for each request that goes through the downloader95 # middleware.96
97 # Must either:98 # - return None: continue processing this request99 # - or return a Response object100 # - or return a Request object101 # - or raise IgnoreRequest: process_exception() methods of102 # installed downloader middleware will be called103 return None104
105 def process_response(self, request: Request, response: Response, spider: Spider) -> Request | Response:106 # Called with the response returned from the downloader.107
108 # Must either;109 # - return a Response object110 # - return a Request object111 # - or raise IgnoreRequest112 return response113
114 def process_exception(self, request: Request, exception: BaseException, spider: Spider) -> Response | None:115 # Called when a download handler or a process_request()116 # (from other downloader middleware) raises an exception.117
118 # Must either:119 # - return None: continue processing this exception120 # - return a Response object: stops process_exception() chain121 # - return a Request object: stops process_exception() chain122 pass123
124 def spider_opened(self, spider: Spider) -> None:125 pass
src/pipelines.py
1"""Scrapy item pipelines module.2
3This module defines Scrapy item pipelines for scraped data. Item pipelines are processing components4that handle the scraped items, typically used for cleaning, validating, and persisting data.5
6For detailed information on creating and utilizing item pipelines, refer to the official documentation:7http://doc.scrapy.org/en/latest/topics/item-pipeline.html8"""9
10from scrapy import Spider11
12from .items import TitleItem13
14
15class TitleItemPipeline:16 """17 This item pipeline defines processing steps for TitleItem objects scraped by spiders.18 """19
20 def process_item(self, item: TitleItem, spider: Spider) -> TitleItem:21 # Do something with the item here, such as cleaning it or persisting it to a database22 return item
src/settings.py
1"""Scrapy settings module.2
3This module contains Scrapy settings for the project, defining various configurations and options.4
5For more comprehensive details on Scrapy settings, refer to the official documentation:6http://doc.scrapy.org/en/latest/topics/settings.html7"""8
9# You can update these options and add new ones10BOT_NAME = 'titlebot'11DEPTH_LIMIT = 112LOG_LEVEL = 'INFO'13NEWSPIDER_MODULE = 'src.spiders'14REQUEST_FINGERPRINTER_IMPLEMENTATION = '2.7'15ROBOTSTXT_OBEY = True16SPIDER_MODULES = ['src.spiders']17ITEM_PIPELINES = {18 'src.pipelines.TitleItemPipeline': 123,19}20SPIDER_MIDDLEWARES = {21 'src.middlewares.TitleSpiderMiddleware': 543,22}23DOWNLOADER_MIDDLEWARES = {24 'src.middlewares.TitleDownloaderMiddleware': 543,25}
.dockerignore
.git.mise.toml.nvim.luastorage
# The rest is copied from https://github.com/github/gitignore/blob/main/Python.gitignore
# Byte-compiled / optimized / DLL files__pycache__/*.py[cod]*$py.class
# C extensions*.so
# Distribution / packaging.Pythonbuild/develop-eggs/dist/downloads/eggs/.eggs/lib/lib64/parts/sdist/var/wheels/share/python-wheels/*.egg-info/.installed.cfg*.eggMANIFEST
# PyInstaller# Usually these files are written by a python script from a template# before PyInstaller builds the exe, so as to inject date/other infos into it.*.manifest*.spec
# Installer logspip-log.txtpip-delete-this-directory.txt
# Unit test / coverage reportshtmlcov/.tox/.nox/.coverage.coverage.*.cachenosetests.xmlcoverage.xml*.cover*.py,cover.hypothesis/.pytest_cache/cover/
# Translations*.mo*.pot
# Django stuff:*.loglocal_settings.pydb.sqlite3db.sqlite3-journal
# Flask stuff:instance/.webassets-cache
# Scrapy stuff:.scrapy
# Sphinx documentationdocs/_build/
# PyBuilder.pybuilder/target/
# Jupyter Notebook.ipynb_checkpoints
# IPythonprofile_default/ipython_config.py
# pyenv# For a library or package, you might want to ignore these files since the code is# intended to run in multiple environments; otherwise, check them in:.python-version
# pdm# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.#pdm.lock# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it# in version control.# https://pdm.fming.dev/latest/usage/project/#working-with-version-control.pdm.toml.pdm-python.pdm-build/
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm__pypackages__/
# Celery stuffcelerybeat-schedulecelerybeat.pid
# SageMath parsed files*.sage.py
# Environments.env.venvenv/venv/ENV/env.bak/venv.bak/
# Spyder project settings.spyderproject.spyproject
# Rope project settings.ropeproject
# mkdocs documentation/site
# mypy.mypy_cache/.dmypy.jsondmypy.json
# Pyre type checker.pyre/
# pytype static type analyzer.pytype/
# Cython debug symbolscython_debug/
# PyCharm# JetBrains specific template is maintained in a separate JetBrains.gitignore that can# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore# and can be added to the global gitignore or merged into this file. For a more nuclear# option (not recommended) you can uncomment the following to ignore the entire idea folder..idea/
.gitignore
.mise.toml.nvim.luastorage
# The rest is copied from https://github.com/github/gitignore/blob/main/Python.gitignore
# Byte-compiled / optimized / DLL files__pycache__/*.py[cod]*$py.class
# C extensions*.so
# Distribution / packaging.Pythonbuild/develop-eggs/dist/downloads/eggs/.eggs/lib/lib64/parts/sdist/var/wheels/share/python-wheels/*.egg-info/.installed.cfg*.eggMANIFEST
# PyInstaller# Usually these files are written by a python script from a template# before PyInstaller builds the exe, so as to inject date/other infos into it.*.manifest*.spec
# Installer logspip-log.txtpip-delete-this-directory.txt
# Unit test / coverage reportshtmlcov/.tox/.nox/.coverage.coverage.*.cachenosetests.xmlcoverage.xml*.cover*.py,cover.hypothesis/.pytest_cache/cover/
# Translations*.mo*.pot
# Django stuff:*.loglocal_settings.pydb.sqlite3db.sqlite3-journal
# Flask stuff:instance/.webassets-cache
# Scrapy stuff:.scrapy
# Sphinx documentationdocs/_build/
# PyBuilder.pybuilder/target/
# Jupyter Notebook.ipynb_checkpoints
# IPythonprofile_default/ipython_config.py
# pyenv# For a library or package, you might want to ignore these files since the code is# intended to run in multiple environments; otherwise, check them in:.python-version
# pdm# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.#pdm.lock# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it# in version control.# https://pdm.fming.dev/latest/usage/project/#working-with-version-control.pdm.toml.pdm-python.pdm-build/
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm__pypackages__/
# Celery stuffcelerybeat-schedulecelerybeat.pid
# SageMath parsed files*.sage.py
# Environments.env.venvenv/venv/ENV/env.bak/venv.bak/
# Spyder project settings.spyderproject.spyproject
# Rope project settings.ropeproject
# mkdocs documentation/site
# mypy.mypy_cache/.dmypy.jsondmypy.json
# Pyre type checker.pyre/
# pytype static type analyzer.pytype/
# Cython debug symbolscython_debug/
# PyCharm# JetBrains specific template is maintained in a separate JetBrains.gitignore that can# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore# and can be added to the global gitignore or merged into this file. For a more nuclear# option (not recommended) you can uncomment the following to ignore the entire idea folder..idea/
requirements.txt
1# Feel free to add your Python dependencies below. For formatting guidelines, see:2# https://pip.pypa.io/en/latest/reference/requirements-file-format/3
4apify[scrapy] ~= 2.0.05nest-asyncio6scrapy
scrapy.cfg
[settings]default = src.settings
[deploy]project = src