GreenTrace-scrapper
Pricing
Pay per usage
GreenTrace-scrapper
Pricing
Pay per usage
Rating
0.0
(0)
Developer
And Sama
Actor stats
0
Bookmarked
3
Total users
0
Monthly active users
8 days ago
Last modified
Categories
Share
GreenTrace-scrapper searches Google for one company, collects link-bearing results, optionally forwards the crawlable URLs into 6sigmag/fast-website-content-crawler, or fetches page content with jina.ai, and stores one combined summary item in the default dataset.
Inputs
company— company name to searchquery_suffix— ESG-related terms appended to the company nameresults_per_page— Google results requested per pagemax_pages_per_query— maximum Google pages to fetchkeyword_terms— optional keywords used to annotate crawler resultsenable_fast_crawler— disabled by default, enable if you want to use6sigmag/fast-website-content-crawlerenable_jina_ai— enabled by default, usesjina.aito fetch page contentjina_api_key— optional user-provided Jina API keyjina_engine— Jina engine,directby default, orbrowserjina_timeout_secs— timeout passed to Jina requests
Pipeline
- Run
apify/google-search-scraperwith the company query. - Recursively collect link-bearing URLs from the Google output.
- Normalize and deduplicate those URLs for downstream crawling.
- If enabled, run
6sigmag/fast-website-content-crawlerwith the normalizedstartUrls. - If enabled, call
jina.aifor each forwarded URL, with or without a user-provided API key. - Store one summary dataset item containing:
- Google run metadata and raw results
- extracted link candidates
- forwarded crawler URLs
- optional fast crawler output
- optional Jina content output
- keyword-match annotations
- overall status and partial-failure details
Notes
- Local runs need a valid
APIFY_TOKENso this Actor can call Apify Actors. - The dataset item can become large because it contains output from both stages.
- Local
storage/data stays local and is not automatically uploaded to Apify Console. - Fast crawler usage is disabled by default.
- Jina usage is attempted without authentication if no
jina_api_keyis provided.
Integrating with Python apps or servers
This Actor is easiest to integrate through the official Python client.
1. Install the client
$pip install apify-client
2. Run the Actor from Python
Replace YOUR_USERNAME/YOUR_ACTOR_NAME with the deployed Actor name.
from apify_client import ApifyClientclient = ApifyClient("<APIFY_TOKEN>")actor_input = {"company": "H&M","query_suffix": "ESG sustainability greenwashing 2024 2025","results_per_page": 10,"max_pages_per_query": 1,"enable_fast_crawler": False,"enable_jina_ai": True,"jina_engine": "direct","jina_timeout_secs": 200,"keyword_terms": ["esg","sustainability","greenwashing","climate","emissions","governance",],}run = client.actor("sama4/greentrace-scrapper").call(run_input=actor_input)items = client.dataset(run["defaultDatasetId"]).list_items().itemsfor item in items:print(item["company"])print(item["overall_status"])print(item["forwarded_urls"])print(item["crawler_results"])print(item["jina_results"])
3. What the result contains
This is the most important part of the integration.
Each run currently writes one summary dataset item per company. That means when you call GreenTrace-scrapper for H&M, the dataset usually contains a single top-level object describing the full pipeline for that company.
The returned item is designed to be used as a combined response object for apps, APIs, dashboards, or downstream analysis jobs.
Top-level fields
-
company- The company name you passed in the input.
- Example:
"H&M"
-
query- The final Google query that was actually sent.
- Usually built from
company + query_suffix. - Example:
"H&M ESG sustainability greenwashing 2024 2025"
-
query_suffix- The ESG-related suffix used when building the query.
- Useful for debugging and reproducibility.
-
keyword_terms- The keywords used to annotate and score the crawler output.
- These terms are matched against crawler content to help identify ESG-relevant pages.
-
results_per_page- The requested Google result page size used during the search stage.
-
max_pages_per_query- The maximum number of Google result pages requested.
-
enable_fast_crawler- Whether
6sigmag/fast-website-content-crawlerwas enabled for this run. - Default is
false.
- Whether
-
enable_jina_ai- Whether
jina.aienrichment was enabled for this run. - Default is
true.
- Whether
-
jina_engine- Which Jina engine was selected for the run.
- Supported values are currently
directandbrowser.
-
overall_status- High-level result for the entire pipeline.
- Expected values:
succeeded— Google search worked and crawler either worked or was intentionally skippedpartial— one stage worked and another failedfailed— the pipeline did not produce a usable result
Google stage fields
-
google_stage- Summary metadata about the Google scraper run.
- Contains:
status—pending,succeeded, orfailedactor_id— usuallyapify/google-search-scraperrun_id— the Apify run ID of the Google stagerun_status— the actual Apify platform status such asSUCCEEDEDstatus_message— any run message returned by the stageresult_count— number of dataset items returned by the Google actorlink_candidate_count— number of link candidates extracted from the Google outputerror— error text if the Google stage failed
-
google_results- Raw dataset items returned by
apify/google-search-scraper. - This is the full upstream evidence collected before filtering.
- Use this if you want:
- the original search output
- search debugging
- auditing what links were discovered
- building your own filtering logic later
- Raw dataset items returned by
-
google_link_candidates- Flattened link candidates extracted from the raw Google output.
- Each entry is typically shaped like:
path— where in the Google result object the link was foundurl— the extracted URL string
- This is useful when you want to inspect exactly which URLs were discovered before normalization.
-
forwarded_urls- Final deduplicated URLs that were actually passed into
6sigmag/fast-website-content-crawler. - This is the most important bridge field between the two actors.
- If you want to know what the crawler was asked to fetch, use this field.
- Final deduplicated URLs that were actually passed into
Crawler stage fields
-
crawler_stage- Summary metadata about the fast crawler run.
- Contains:
status—pending,succeeded,failed, orskippedactor_id— usually6sigmag/fast-website-content-crawlerrun_id— the Apify run ID of the crawler stagerun_status— actual Apify platform status such asSUCCEEDEDstatus_message— any crawler-stage run messageresult_count— number of crawler dataset items collectedmatching_result_count— how many crawler results matched at least one ESG keyworderror— error text if the crawler stage failed
-
crawler_results- Raw content results returned by the fast crawler, with extra annotations added by GreenTrace-scrapper.
- This is the field you should use if you want the actual crawled website content.
- Each item comes from the downstream crawler and is then enriched with:
analysis_matched_keywordsanalysis_keyword_match_countanalysis_keyword_relevance
- In practice, this is where your application will usually read page-level content, extracted text, or other crawler metadata.
-
matching_crawler_results- Filtered subset of
crawler_results. - Only includes crawler items where at least one ESG keyword matched.
- This is often the best field for:
- ESG review pipelines
- RAG ingestion
- scoring workflows
- analyst-facing dashboards
- Filtered subset of
Jina stage fields
-
jina_stage- Summary metadata about the
jina.aienrichment stage. - Contains:
status—pending,succeeded,partial,failed,skipped, ordisabledprovider—jina.aiengine— selected engine, for exampledirectorbrowserattempted_count— number of URLs sent to Jinasuccess_count— number of successful Jina fetchesfailure_count— number of failed Jina fetchesused_api_key— whether a user-provided API key was usederror— stage-level error if all requests failed
- Summary metadata about the
-
jina_results- Page content returned by
jina.aifor the forwarded URLs. - This is the main field to use if you want extracted article or page text from Jina.
- Each item usually contains:
urlenginestatus_codecontentused_api_keyerrorwhen that specific fetch failed
- Page content returned by
Recommended usage patterns
Depending on your app, you will usually read the result like this:
-
For the overall run outcome:
- use
overall_status
- use
-
For debugging the Google phase:
- use
google_stage,google_results, andgoogle_link_candidates
- use
-
For seeing exactly what was sent to the crawler:
- use
forwarded_urls
- use
-
For all crawled content:
- use
crawler_results
- use
-
For only ESG-relevant content:
- use
matching_crawler_results
- use
-
For Jina-fetched page text:
- use
jina_results
- use
Important storage note
GreenTrace-scrapper currently stores the crawler output nested inside the final summary object.
So the current model is:
- one dataset row per company, not one dataset row per crawled page
That is why your integration code usually does:
- fetch the first dataset item
- then read
crawler_resultsormatching_crawler_resultsfrom inside it
If you need a flatter format later, the Actor can be extended to push:
- one summary item per company
- plus one additional dataset item per crawled page
4. Helper function for reuse inside apps
from apify_client import ApifyClientdef fetch_company_esg(company: str, token: str, actor_id: str) -> dict:client = ApifyClient(token)run = client.actor(actor_id).call(run_input={"company": company,"query_suffix": "ESG sustainability greenwashing 2024 2025","results_per_page": 10,"max_pages_per_query": 1,"enable_fast_crawler": False,"enable_jina_ai": True,"jina_engine": "direct",})items = client.dataset(run["defaultDatasetId"]).list_items().itemsreturn items[0] if items else {}
5. Example FastAPI integration
import osfrom apify_client import ApifyClientfrom fastapi import FastAPIapp = FastAPI()client = ApifyClient(os.environ["APIFY_TOKEN"])ACTOR_ID = "YOUR_USERNAME/YOUR_ACTOR_NAME"@app.get("/company-esg/{company}")def get_company_esg(company: str):run = client.actor(ACTOR_ID).call(run_input={"company": company,"query_suffix": "ESG sustainability greenwashing 2024 2025","results_per_page": 10,"max_pages_per_query": 1,"enable_fast_crawler": False,"enable_jina_ai": True,"jina_engine": "direct",})items = client.dataset(run["defaultDatasetId"]).list_items().itemsreturn items[0] if items else {"company": company, "overall_status": "empty"}
6. Recommended production usage
- Store
APIFY_TOKENin environment variables, not in source code. - Store
jina_api_keyin environment variables too, if you use one. - Deploy this Actor first, then call the deployed Actor from your app or API server.
- Add request timeouts and error handling around
client.actor(...).call(...). - Cache completed results if you expect repeated lookups for the same company.
- If you need page-by-page persistence instead of one nested summary object, extend the Actor before integrating.
- If you do not want the downstream crawler, leave
enable_fast_crawlerset tofalse.
Development
- Main implementation:
my_actor/main.py - Actor config:
.actor/actor.json - Input schema:
.actor/input_schema.json - Dataset view:
.actor/dataset_schema.json