ETF Holdings Overlap Intelligence Scraper avatar

ETF Holdings Overlap Intelligence Scraper

Pricing

from $1.00 / 1,000 results

Go to Apify Store
ETF Holdings Overlap Intelligence Scraper

ETF Holdings Overlap Intelligence Scraper

Reveal hidden ETF overlap by scraping holdings data, normalizing fund exposures, and showing which stocks your ETFs secretly share.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

Leafy

Leafy

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Categories

Share

ETF Holdings & Overlap Intelligence Scraper

Collect ETF holdings from public issuer sources, normalize the many different issuer formats into one clean schema, and calculate ETF overlap, shared holdings, and hidden concentration exposure.

You give it a list of ETF tickers. It returns:

  • Every holding inside each ETF (one dataset row per holding).
  • Which stocks are shared across your ETFs.
  • How much overlap exists between each pair of ETFs.
  • Your true combined weighted exposure to each underlying company across your whole ETF portfolio.
  • Basic ETF metadata when the source exposes it.

⚠️ Disclaimer — This actor provides ETF holdings data processing and overlap calculations for research purposes only. It does not provide financial advice, investment recommendations, or predictions. Overlap and exposure figures are estimates based on each issuer's publicly reported holdings.


Who it's for

  • Beginner and intermediate investors who want to see if their ETFs secretly own the same companies.
  • ETF investors checking for hidden concentration (e.g. how much NVIDIA you really own across VOO + QQQ + SPY).
  • Financial bloggers and personal-finance creators.
  • Research analysts and developers building ETF comparison tools.

What it does (the flow)

  1. You provide ETF tickers.
  2. For each ticker it picks the best public source (issuer-first).
  3. It downloads the issuer's public holdings file (CSV / XLSX) or public data endpoint.
  4. It normalizes every holding into one consistent schema.
  5. It writes one dataset row per holding.
  6. It calculates pairwise overlap between every pair of ETFs.
  7. It calculates combined weighted exposure across all your ETFs, using your portfolio allocation.
  8. It saves summary objects to the key-value store.
  9. It records what worked, what failed, and which source was used per ticker.

A single failing ticker never crashes the run — it is reported in FAILED_TICKERS and the rest continue.


Supported sources in this MVP

IssuerMethodStatus
VanguardPublic portfolio-holdings JSON endpoint✅ Working (VOO, VTI, VUG, …) — top 500 holdings
State Street / SPDRPublic daily-holdings XLSX download✅ Working (SPY, XLF, XLK, DIA, …)
Invesco (bonus)Public holdings JSON API (dng-api.invesco.com)✅ Working (QQQ, QQQM, RSP)
Schwab (bonus)Public "all holdings" page (server-rendered HTML table)✅ Working (SCHD, SCHG, SCHB, …)
iShares / BlackRockPublic holdings CSV (product page resolved via the public iShares product screener)⚠️ Works with a US residential proxy; the CSV endpoint is geo/consent-gated on datacenter IPs
SEC N-PORTOfficial monthly filings🚧 Planned — returns not_implemented

Source strategy: issuer-first. The actor first tries the issuer it recognizes for a ticker (see the built-in ticker map), then probes the other adapters as a fallback. It deliberately does not use ETF.com, ETFdb, Yahoo, or Nasdaq as a holdings source — the issuer is the authoritative source.

About the iShares geo/consent gating

iShares serves a location/consent HTML interstitial instead of the holdings CSV when the request comes from a non-US or datacenter IP. The actor detects this and reports source_blocked rather than saving garbage — it does not attempt to bypass the consent gate, CAPTCHAs, logins, or bot protection. Running on Apify with a US residential proxy (RESIDENTIAL, country US) is the recommended way to reach iShares. Vanguard, SPDR, Invesco and Schwab all work on the default proxy.


Calculations performed

Pairwise overlap (PAIRWISE_OVERLAP)

For every pair of ETFs (A, B):

  • sharedHoldingsCount — holdings appearing in both.
  • overlapByCountPercentsharedHoldingsCount / etfAHoldingsCount * 100.
  • overlapWeightInEtfA — sum of A's weights for holdings also in B.
  • overlapWeightInEtfB — sum of B's weights for holdings also in A.
  • estimatedOverlapWeight — the average of the two weight overlaps. This is an estimate, not a precise financial overlap score.
  • topSharedHoldings — the most heavily shared names, sorted by combined weight.

Combined weighted exposure (COMBINED_EXPOSURE)

Your true exposure to each underlying company across your whole ETF portfolio:

combined exposure(holding) = Σ holdingWeightInEtf% × portfolioAllocation% / 100
(over every ETF you hold)

Example — holding at 7.25% of an ETF you allocate 40% of your portfolio to contributes 7.25 × 0.40 = 2.90% combined exposure.

Portfolio weights come from the weight column of the portfolio input. If you don't provide any, all successful ETFs are weighted equally. Weights that don't sum to 100 are normalized, and failed tickers are ignored (the remaining weights are re-normalized to 100).

Holding matching

Holdings are matched across ETFs by a normalizedHoldingKey, in priority order: ticker → CUSIP → ISIN → cleaned company name. The cleaner strips share-class tokens and legal suffixes so NVIDIA CORP and NVIDIA CORPORATION match.


Input

FieldTypeDefaultDescription
portfolioarray(required)Your ETFs. In the UI this is a two-column list: ticker on the left, weight (optional) on the right. Stored as [{ "key": "VOO", "value": "40" }, …].
includeHoldingsbooleantrueFetch and save each ETF's holdings.
includeMetadatabooleantrueSave basic ETF metadata when available.
calculateOverlapbooleantrueCalculate pairwise overlap + combined exposure.
maxHoldingsPerEtfinteger00 = all holdings; >0 = top N by weight per ETF.
sourcePreferenceenumissuerissuer | auto (both issuer-first) | sec_nport_later (planned; falls back to issuer).
proxyConfigurationobject{ useApifyProxy: true }Proxy settings. A US residential proxy is recommended for iShares.
debugModebooleanfalseLog source URLs, detected file types, and detected header columns.

The ETF list and portfolio weights are a single input — a two-column list in the UI (ticker | weight). Leave the weight column blank to weight ETFs equally, or fill it to set allocations (they need not sum to 100 — they're normalized).

The parser also accepts a plain string list (["VOO", "QQQ:30"]) and the legacy separate tickers (array) + portfolioWeights (object) inputs, for backward compatibility and easy API use. portfolio is the recommended input.

Example input

{
"portfolio": [
{ "key": "VOO", "value": "40" },
{ "key": "QQQ", "value": "30" },
{ "key": "SCHD", "value": "30" }
],
"includeHoldings": true,
"includeMetadata": true,
"calculateOverlap": true,
"maxHoldingsPerEtf": 0,
"sourcePreference": "issuer",
"proxyConfiguration": { "useApifyProxy": true },
"debugMode": false
}

Output

Dataset — one row per holding

{
"etfTicker": "VOO",
"etfName": "Vanguard S&P 500 ETF",
"issuer": "Vanguard",
"holdingTicker": "NVDA",
"holdingName": "NVIDIA Corp.",
"holdingIdentifier": null,
"cusip": "67066G104",
"isin": "US67066G1040",
"sector": null,
"assetClass": null,
"country": null,
"weight": 7.89,
"shares": 636185341,
"marketValue": 134324172898.74,
"sourceType": "issuer",
"sourceName": "Vanguard",
"sourceUrl": "https://investor.vanguard.com/investment-products/etfs/profile/api/VOO/portfolio-holding/stock",
"asOfDate": "2026-05-31",
"rankInEtf": 1,
"rawHoldingName": "NVIDIA Corp.",
"rawHoldingTicker": "NVDA",
"normalizedHoldingKey": "NVDA",
"scrapedAt": "2026-07-03T00:26:38.807Z"
}

Unavailable fields are null — the actor never fabricates data.

Key-value store outputs

KeyContents
SUMMARYHigh-level run summary: successes, failures, portfolio weights used, highest-overlap pair, top combined exposures, source notes.
ETF_METADATAArray of per-ETF metadata (fund name, issuer, holdings count, as-of date, source URL).
OVERLAP_MATRIXCompact matrix of estimatedOverlapWeight and sharedHoldingsCount keyed by ticker.
PAIRWISE_OVERLAPArray of per-pair overlap objects (see above).
COMBINED_EXPOSUREArray of holdings sorted by combined weighted exposure descending.
FAILED_TICKERSArray of { ticker, errorType, errorMessage, sourceTried } for anything that couldn't be fetched.

How to run locally

Requires Node.js 18+.

npm install
# Provide input (local Apify storage):
mkdir -p storage/key_value_stores/default
cat > storage/key_value_stores/default/INPUT.json <<'JSON'
{
"portfolio": ["VOO:60", "SPY:40"],
"proxyConfiguration": { "useApifyProxy": false },
"debugMode": true
}
JSON
npm start

Holdings appear in storage/datasets/default/, and the summary files in storage/key_value_stores/default/.

Run the analysis unit tests (offline, deterministic):

$npm run test:analysis

How to run on Apify

  1. Push the actor: apify push (or import this repo on the Apify console).
  2. Open the actor, fill in the input form (tickers, portfolio weights).
  3. For iShares / Invesco coverage, set Proxy to Apify Proxy with the RESIDENTIAL group and country US.
  4. Run. Read the Dataset tab for holdings and the Storage / Key-value store tab for SUMMARY, PAIRWISE_OVERLAP, COMBINED_EXPOSURE, etc.

Project structure

src/
main.js # Orchestrator: input → fetch → normalize → analyze → save
sources/
issuerRouter.js # Ticker → issuer routing + probing
ishares.js # iShares CSV (product resolved via public screener)
spdr.js # SPDR daily-holdings XLSX
vanguard.js # Vanguard public holdings JSON
invesco.js # Invesco holdings JSON API (bonus issuer, e.g. QQQ)
schwab.js # Schwab server-rendered holdings table (bonus, e.g. SCHD)
secNport.js # SEC N-PORT placeholder (not_implemented)
normalize/
normalizeHolding.js # Raw holding → canonical schema + matching key
normalizeTicker.js # Ticker cleanup + de-dup
normalizePrice.js # Number/weight parsing helpers
analysis/
calculateOverlap.js # Pairwise overlap + overlap matrix
calculateCombinedExposure.js # Portfolio-weighted exposure
calculateSummary.js # SUMMARY builder
utils/
http.js # got-scraping client + error classification + HTML detection
csv.js # CSV → rows
excel.js # XLSX → rows
html.js # HTML table → rows (for server-rendered holdings)
table.js # Dynamic header detection + column mapping
dates.js # Loose date → ISO
logging.js # Logger wrapper
INPUT_SCHEMA.json
Dockerfile
test/analysis.test.js # Deterministic overlap/exposure/normalization tests

The source-adapter architecture means adding a new issuer is just a new file in src/sources/ plus a routing entry — the normalizer and analysis layers are shared.


Known limitations

  • Holdings availability depends on each issuer's public data. iShares gates its CSV behind a location/consent interstitial (see above); a US residential proxy is recommended for it.
  • Data can be delayed depending on the source — Vanguard's public feed is typically month-end; SPDR is daily; Invesco is monthly; Schwab is a few days lagged.
  • Vanguard's public endpoint returns only the top ~500 holdings. For concentrated funds (VOO ≈ 500 holdings) this is effectively complete, but for broad funds (VTI ≈ 3,600 holdings) overlap is measured on the top 500 (~85–90% of fund weight).
  • Some ETFs are not supported in this MVP. Covered issuers: Vanguard, SPDR, Invesco, Schwab, and iShares (with proxy). First Trust, VanEck, Global X, WisdomTree, ProShares, etc. are future work.
  • Schwab publishes market value abbreviated (e.g. "$4.2B"); those are parsed to approximate numbers, while weight and share counts are exact.
  • Some holdings lack a ticker, sector, or weight in the source file; those fields are null rather than guessed.
  • Different issuers publish different fields (e.g. SPY's file omits sector and market value; Vanguard's feed omits sector) — normalized output reflects only what the source provides.
  • Overlap and combined exposure are estimates based on reported holdings and weights, not a precise financial overlap score.

Compliance notes

  • Uses only public issuer data and public downloadable holdings files.
  • Does not scrape private data, bypass paywalls, logins, CAPTCHAs, or bot protection.
  • Does not use paid APIs. (Morningstar and similar licensed sources are intentionally avoided.)
  • Data sources and as-of dates are labeled on every row (sourceName, sourceUrl, asOfDate).
  • Output is data and calculations only — not financial advice.

Future improvements (not in this MVP)

  • Full SEC N-PORT historical holdings parser.
  • Sector / country / asset-class overlap breakdowns.
  • Expense-ratio and dividend-yield comparison.
  • More issuers (First Trust, VanEck, Global X, WisdomTree, ProShares).
  • Enriching iShares metadata (expense ratio, AUM, inception date) from the product screener even when the holdings CSV is gated.
  • Recurring monitoring and change alerts.