ETF Holdings Overlap Intelligence Scraper
Pricing
from $1.00 / 1,000 results
ETF Holdings Overlap Intelligence Scraper
Reveal hidden ETF overlap by scraping holdings data, normalizing fund exposures, and showing which stocks your ETFs secretly share.
Pricing
from $1.00 / 1,000 results
Rating
0.0
(0)
Developer
Leafy
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share
ETF Holdings & Overlap Intelligence Scraper
Collect ETF holdings from public issuer sources, normalize the many different issuer formats into one clean schema, and calculate ETF overlap, shared holdings, and hidden concentration exposure.
You give it a list of ETF tickers. It returns:
- Every holding inside each ETF (one dataset row per holding).
- Which stocks are shared across your ETFs.
- How much overlap exists between each pair of ETFs.
- Your true combined weighted exposure to each underlying company across your whole ETF portfolio.
- Basic ETF metadata when the source exposes it.
⚠️ Disclaimer — This actor provides ETF holdings data processing and overlap calculations for research purposes only. It does not provide financial advice, investment recommendations, or predictions. Overlap and exposure figures are estimates based on each issuer's publicly reported holdings.
Who it's for
- Beginner and intermediate investors who want to see if their ETFs secretly own the same companies.
- ETF investors checking for hidden concentration (e.g. how much NVIDIA you really own across VOO + QQQ + SPY).
- Financial bloggers and personal-finance creators.
- Research analysts and developers building ETF comparison tools.
What it does (the flow)
- You provide ETF tickers.
- For each ticker it picks the best public source (issuer-first).
- It downloads the issuer's public holdings file (CSV / XLSX) or public data endpoint.
- It normalizes every holding into one consistent schema.
- It writes one dataset row per holding.
- It calculates pairwise overlap between every pair of ETFs.
- It calculates combined weighted exposure across all your ETFs, using your portfolio allocation.
- It saves summary objects to the key-value store.
- It records what worked, what failed, and which source was used per ticker.
A single failing ticker never crashes the run — it is reported in FAILED_TICKERS and the rest continue.
Supported sources in this MVP
| Issuer | Method | Status |
|---|---|---|
| Vanguard | Public portfolio-holdings JSON endpoint | ✅ Working (VOO, VTI, VUG, …) — top 500 holdings |
| State Street / SPDR | Public daily-holdings XLSX download | ✅ Working (SPY, XLF, XLK, DIA, …) |
| Invesco (bonus) | Public holdings JSON API (dng-api.invesco.com) | ✅ Working (QQQ, QQQM, RSP) |
| Schwab (bonus) | Public "all holdings" page (server-rendered HTML table) | ✅ Working (SCHD, SCHG, SCHB, …) |
| iShares / BlackRock | Public holdings CSV (product page resolved via the public iShares product screener) | ⚠️ Works with a US residential proxy; the CSV endpoint is geo/consent-gated on datacenter IPs |
| SEC N-PORT | Official monthly filings | 🚧 Planned — returns not_implemented |
Source strategy: issuer-first. The actor first tries the issuer it recognizes for a ticker (see the built-in ticker map), then probes the other adapters as a fallback. It deliberately does not use ETF.com, ETFdb, Yahoo, or Nasdaq as a holdings source — the issuer is the authoritative source.
About the iShares geo/consent gating
iShares serves a location/consent HTML interstitial instead of the holdings
CSV when the request comes from a non-US or datacenter IP. The actor detects
this and reports source_blocked rather than saving garbage — it does not
attempt to bypass the consent gate, CAPTCHAs, logins, or bot protection. Running
on Apify with a US residential proxy (RESIDENTIAL, country US) is the
recommended way to reach iShares. Vanguard, SPDR, Invesco and Schwab all work on
the default proxy.
Calculations performed
Pairwise overlap (PAIRWISE_OVERLAP)
For every pair of ETFs (A, B):
sharedHoldingsCount— holdings appearing in both.overlapByCountPercent—sharedHoldingsCount / etfAHoldingsCount * 100.overlapWeightInEtfA— sum of A's weights for holdings also in B.overlapWeightInEtfB— sum of B's weights for holdings also in A.estimatedOverlapWeight— the average of the two weight overlaps. This is an estimate, not a precise financial overlap score.topSharedHoldings— the most heavily shared names, sorted by combined weight.
Combined weighted exposure (COMBINED_EXPOSURE)
Your true exposure to each underlying company across your whole ETF portfolio:
combined exposure(holding) = Σ holdingWeightInEtf% × portfolioAllocation% / 100(over every ETF you hold)
Example — holding at 7.25% of an ETF you allocate 40% of your portfolio to
contributes 7.25 × 0.40 = 2.90% combined exposure.
Portfolio weights come from the weight column of the portfolio input. If you
don't provide any, all successful ETFs are weighted equally. Weights that
don't sum to 100 are normalized, and failed tickers are ignored (the
remaining weights are re-normalized to 100).
Holding matching
Holdings are matched across ETFs by a normalizedHoldingKey, in priority order:
ticker → CUSIP → ISIN → cleaned company name. The cleaner strips share-class
tokens and legal suffixes so NVIDIA CORP and NVIDIA CORPORATION match.
Input
| Field | Type | Default | Description |
|---|---|---|---|
portfolio | array | (required) | Your ETFs. In the UI this is a two-column list: ticker on the left, weight (optional) on the right. Stored as [{ "key": "VOO", "value": "40" }, …]. |
includeHoldings | boolean | true | Fetch and save each ETF's holdings. |
includeMetadata | boolean | true | Save basic ETF metadata when available. |
calculateOverlap | boolean | true | Calculate pairwise overlap + combined exposure. |
maxHoldingsPerEtf | integer | 0 | 0 = all holdings; >0 = top N by weight per ETF. |
sourcePreference | enum | issuer | issuer | auto (both issuer-first) | sec_nport_later (planned; falls back to issuer). |
proxyConfiguration | object | { useApifyProxy: true } | Proxy settings. A US residential proxy is recommended for iShares. |
debugMode | boolean | false | Log source URLs, detected file types, and detected header columns. |
The ETF list and portfolio weights are a single input — a two-column list in the UI (ticker | weight). Leave the weight column blank to weight ETFs equally, or fill it to set allocations (they need not sum to 100 — they're normalized).
The parser also accepts a plain string list (["VOO", "QQQ:30"]) and the legacy
separate tickers (array) + portfolioWeights (object) inputs, for backward
compatibility and easy API use. portfolio is the recommended input.
Example input
{"portfolio": [{ "key": "VOO", "value": "40" },{ "key": "QQQ", "value": "30" },{ "key": "SCHD", "value": "30" }],"includeHoldings": true,"includeMetadata": true,"calculateOverlap": true,"maxHoldingsPerEtf": 0,"sourcePreference": "issuer","proxyConfiguration": { "useApifyProxy": true },"debugMode": false}
Output
Dataset — one row per holding
{"etfTicker": "VOO","etfName": "Vanguard S&P 500 ETF","issuer": "Vanguard","holdingTicker": "NVDA","holdingName": "NVIDIA Corp.","holdingIdentifier": null,"cusip": "67066G104","isin": "US67066G1040","sector": null,"assetClass": null,"country": null,"weight": 7.89,"shares": 636185341,"marketValue": 134324172898.74,"sourceType": "issuer","sourceName": "Vanguard","sourceUrl": "https://investor.vanguard.com/investment-products/etfs/profile/api/VOO/portfolio-holding/stock","asOfDate": "2026-05-31","rankInEtf": 1,"rawHoldingName": "NVIDIA Corp.","rawHoldingTicker": "NVDA","normalizedHoldingKey": "NVDA","scrapedAt": "2026-07-03T00:26:38.807Z"}
Unavailable fields are null — the actor never fabricates data.
Key-value store outputs
| Key | Contents |
|---|---|
SUMMARY | High-level run summary: successes, failures, portfolio weights used, highest-overlap pair, top combined exposures, source notes. |
ETF_METADATA | Array of per-ETF metadata (fund name, issuer, holdings count, as-of date, source URL). |
OVERLAP_MATRIX | Compact matrix of estimatedOverlapWeight and sharedHoldingsCount keyed by ticker. |
PAIRWISE_OVERLAP | Array of per-pair overlap objects (see above). |
COMBINED_EXPOSURE | Array of holdings sorted by combined weighted exposure descending. |
FAILED_TICKERS | Array of { ticker, errorType, errorMessage, sourceTried } for anything that couldn't be fetched. |
How to run locally
Requires Node.js 18+.
npm install# Provide input (local Apify storage):mkdir -p storage/key_value_stores/defaultcat > storage/key_value_stores/default/INPUT.json <<'JSON'{"portfolio": ["VOO:60", "SPY:40"],"proxyConfiguration": { "useApifyProxy": false },"debugMode": true}JSONnpm start
Holdings appear in storage/datasets/default/, and the summary files in
storage/key_value_stores/default/.
Run the analysis unit tests (offline, deterministic):
$npm run test:analysis
How to run on Apify
- Push the actor:
apify push(or import this repo on the Apify console). - Open the actor, fill in the input form (tickers, portfolio weights).
- For iShares / Invesco coverage, set Proxy to Apify Proxy with the
RESIDENTIALgroup and countryUS. - Run. Read the Dataset tab for holdings and the Storage / Key-value store
tab for
SUMMARY,PAIRWISE_OVERLAP,COMBINED_EXPOSURE, etc.
Project structure
src/main.js # Orchestrator: input → fetch → normalize → analyze → savesources/issuerRouter.js # Ticker → issuer routing + probingishares.js # iShares CSV (product resolved via public screener)spdr.js # SPDR daily-holdings XLSXvanguard.js # Vanguard public holdings JSONinvesco.js # Invesco holdings JSON API (bonus issuer, e.g. QQQ)schwab.js # Schwab server-rendered holdings table (bonus, e.g. SCHD)secNport.js # SEC N-PORT placeholder (not_implemented)normalize/normalizeHolding.js # Raw holding → canonical schema + matching keynormalizeTicker.js # Ticker cleanup + de-dupnormalizePrice.js # Number/weight parsing helpersanalysis/calculateOverlap.js # Pairwise overlap + overlap matrixcalculateCombinedExposure.js # Portfolio-weighted exposurecalculateSummary.js # SUMMARY builderutils/http.js # got-scraping client + error classification + HTML detectioncsv.js # CSV → rowsexcel.js # XLSX → rowshtml.js # HTML table → rows (for server-rendered holdings)table.js # Dynamic header detection + column mappingdates.js # Loose date → ISOlogging.js # Logger wrapperINPUT_SCHEMA.jsonDockerfiletest/analysis.test.js # Deterministic overlap/exposure/normalization tests
The source-adapter architecture means adding a new issuer is just a new file in
src/sources/ plus a routing entry — the normalizer and analysis layers are shared.
Known limitations
- Holdings availability depends on each issuer's public data. iShares gates its CSV behind a location/consent interstitial (see above); a US residential proxy is recommended for it.
- Data can be delayed depending on the source — Vanguard's public feed is typically month-end; SPDR is daily; Invesco is monthly; Schwab is a few days lagged.
- Vanguard's public endpoint returns only the top ~500 holdings. For concentrated funds (VOO ≈ 500 holdings) this is effectively complete, but for broad funds (VTI ≈ 3,600 holdings) overlap is measured on the top 500 (~85–90% of fund weight).
- Some ETFs are not supported in this MVP. Covered issuers: Vanguard, SPDR, Invesco, Schwab, and iShares (with proxy). First Trust, VanEck, Global X, WisdomTree, ProShares, etc. are future work.
- Schwab publishes market value abbreviated (e.g. "$4.2B"); those are parsed to approximate numbers, while weight and share counts are exact.
- Some holdings lack a ticker, sector, or weight in the source file; those fields are
nullrather than guessed. - Different issuers publish different fields (e.g. SPY's file omits sector and market value; Vanguard's feed omits sector) — normalized output reflects only what the source provides.
- Overlap and combined exposure are estimates based on reported holdings and weights, not a precise financial overlap score.
Compliance notes
- Uses only public issuer data and public downloadable holdings files.
- Does not scrape private data, bypass paywalls, logins, CAPTCHAs, or bot protection.
- Does not use paid APIs. (Morningstar and similar licensed sources are intentionally avoided.)
- Data sources and as-of dates are labeled on every row (
sourceName,sourceUrl,asOfDate). - Output is data and calculations only — not financial advice.
Future improvements (not in this MVP)
- Full SEC N-PORT historical holdings parser.
- Sector / country / asset-class overlap breakdowns.
- Expense-ratio and dividend-yield comparison.
- More issuers (First Trust, VanEck, Global X, WisdomTree, ProShares).
- Enriching iShares metadata (expense ratio, AUM, inception date) from the product screener even when the holdings CSV is gated.
- Recurring monitoring and change alerts.