SEC EDGAR Scraper for RAG: 10-K/10-Q/8-K as JSON avatar

SEC EDGAR Scraper for RAG: 10-K/10-Q/8-K as JSON

Pricing

from $20.00 / 1,000 extracted sec filings

Go to Apify Store
SEC EDGAR Scraper for RAG: 10-K/10-Q/8-K as JSON

SEC EDGAR Scraper for RAG: 10-K/10-Q/8-K as JSON

Extract SEC EDGAR filings (10-K, 10-Q, 8-K). Fixed-token text chunks of primary documents for finance LLMs and compliance RAG. Drop-in for LlamaIndex, LangChain. Skip manual XBRL parsing. $0.03/filing.

Pricing

from $20.00 / 1,000 extracted sec filings

Rating

0.0

(0)

Developer

Devansh Tiwari

Devansh Tiwari

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

5 days ago

Last modified

Share

Extract SEC EDGAR filings into RAG-ready text chunks for finance and compliance LLMs. Get 10-K annual reports, 10-Q quarterly reports, and 8-K material events pre-chunked as JSON. Drop-in ready for LlamaIndex, LangChain, Pinecone, and Qdrant. Built for AI training data teams, buy-side research assistants, and M&A intelligence platforms. Skip manual HTML/XBRL parsing and messy SEC entity decoding.

What does SEC EDGAR RAG Extractor do?

This Actor fetches corporate filings directly from the SEC EDGAR database. It pulls the HTML primary document, strips out the noise (like <ix:header> metadata and tables styling), extracts the plain text, and slices it into fixed-token chunks with overlap. You get clean, LLM-ready JSON arrays representing the core text of the filing.

Try it with Apple's CIK 0000320193 or search for "artificial intelligence risk factors". Runs reliably on the Apify platform with built-in SEC rate limiting.

Why use SEC EDGAR RAG Extractor?

  • Finance AI: Train models on clean corporate disclosures without writing custom HTML parsers.
  • Compliance RAG: Build chatbots that cite specific regulatory filings accurately.
  • M&A Research: Feed target company 10-Ks into your intelligence pipeline.
  • Sell-side Research: Monitor 8-K events and earnings drift automatically.

How to use SEC EDGAR RAG Extractor

  1. Set your User-Agent. The SEC requires a real name and email.
  2. Provide a list of CIKs (e.g., 0000320193 for Apple) or enter a Search Query (e.g., "AI risks").
  3. Select the Form Types you want (10-K, 10-Q, 8-K).
  4. Set your Date Range and Max Filings cap.
  5. Click "Start" and download the chunked JSON.

Input

Provide standard SEC parameters. Here is a JSON example:

{
"cikList": ["0000320193"],
"formTypes": ["10-K"],
"dateFrom": "2024-01-01",
"dateTo": "2024-12-31",
"maxFilings": 5,
"searchQuery": "",
"userAgent": "Jane Smith jane@acme.com"
}

Output

The Actor outputs one record per filing. You can download the dataset in various formats such as JSON, HTML, CSV, or Excel.

{
"accession_no": "0000320193-24-000123",
"cik": "0000320193",
"company_name": "Apple Inc.",
"ticker": "AAPL",
"form_type": "10-K",
"filing_date": "2024-11-01",
"period_of_report": "2024-09-28",
"filing_url": "https://www.sec.gov/Archives/edgar/data/320193/000032019324000123/aapl-20240928.htm",
"source": "full_text",
"chunks": [
{
"idx": 0,
"text": "Item 1. Business...",
"tokens": 512
}
]
}

Data table

FieldTypeDescription
accession_noStringUnique SEC identifier for the filing
cikString10-digit Central Index Key
company_nameStringFiler name
tickerStringStock ticker (if available)
form_typeString10-K, 10-Q, or 8-K
filing_dateStringDate submitted to the SEC
sourceStringIndicates extraction depth (full_text or exhibits_stripped)
filing_urlStringLink to the SEC Archives primary document

Pricing / Cost estimation

This Actor is priced at $0.03 per filing. How much does it cost to scrape SEC EDGAR? If you pull 1000 Apple 10-Ks and 10-Qs, the run will cost exactly $30.00. You only pay for successful extractions.

User-Agent warning

The SEC strictly requires a valid User-Agent header containing your name and email. The default placeholder will be rejected with a 403 Forbidden error, crashing the run. Please override the userAgent input field with your real contact information before starting.

Tips / Advanced options

  • Narrow via search: Use the full-text search query field to build a highly targeted RAG corpus instead of pulling every filing for a CIK.
  • Filing sizes: 10-K annual reports are very long. Expect 20 to 50 text chunks (512 tokens each) per filing.
  • Rate limiting: The Actor automatically paces requests at the SEC ceiling of 10 requests per second for maximum throughput without IP bans.

SEC EDGAR data is public domain. Please respect the SEC fair-access policy. Limitations:

  • v1 supports 10-K, 10-Q, and 8-K bodies only. No 13F, S-1, or DEF 14A.
  • v1 does not parse inline XBRL tables for numerical extraction.
  • Text-format exhibits are concatenated into the main body text; binary exhibits are skipped.

FAQ

Why do I need a User-Agent? The SEC blocks automated traffic that doesn't identify itself. A valid name and email allow them to contact you if your traffic causes issues.

Why are some filings marked source: exhibits_stripped? If a filing contains complex or binary attachments that fail to parse cleanly, the Actor falls back to extracting just the primary document body to ensure you still get data.

Can I get 13F filings? S-1? DEF 14A? Not in v1. We focused on the core financial disclosures first.

Support

Found a bug or need a feature? Open an issue on our GitHub repository.