SEC EDGAR Scraper for RAG: 10-K/10-Q/8-K as JSON
Pricing
from $20.00 / 1,000 extracted sec filings
SEC EDGAR Scraper for RAG: 10-K/10-Q/8-K as JSON
Extract SEC EDGAR filings (10-K, 10-Q, 8-K). Fixed-token text chunks of primary documents for finance LLMs and compliance RAG. Drop-in for LlamaIndex, LangChain. Skip manual XBRL parsing. $0.03/filing.
Pricing
from $20.00 / 1,000 extracted sec filings
Rating
0.0
(0)
Developer
Devansh Tiwari
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
5 days ago
Last modified
Categories
Share
Extract SEC EDGAR filings into RAG-ready text chunks for finance and compliance LLMs. Get 10-K annual reports, 10-Q quarterly reports, and 8-K material events pre-chunked as JSON. Drop-in ready for LlamaIndex, LangChain, Pinecone, and Qdrant. Built for AI training data teams, buy-side research assistants, and M&A intelligence platforms. Skip manual HTML/XBRL parsing and messy SEC entity decoding.
What does SEC EDGAR RAG Extractor do?
This Actor fetches corporate filings directly from the SEC EDGAR database. It pulls the HTML primary document, strips out the noise (like <ix:header> metadata and tables styling), extracts the plain text, and slices it into fixed-token chunks with overlap. You get clean, LLM-ready JSON arrays representing the core text of the filing.
Try it with Apple's CIK 0000320193 or search for "artificial intelligence risk factors". Runs reliably on the Apify platform with built-in SEC rate limiting.
Why use SEC EDGAR RAG Extractor?
- Finance AI: Train models on clean corporate disclosures without writing custom HTML parsers.
- Compliance RAG: Build chatbots that cite specific regulatory filings accurately.
- M&A Research: Feed target company 10-Ks into your intelligence pipeline.
- Sell-side Research: Monitor 8-K events and earnings drift automatically.
How to use SEC EDGAR RAG Extractor
- Set your User-Agent. The SEC requires a real name and email.
- Provide a list of CIKs (e.g.,
0000320193for Apple) or enter a Search Query (e.g., "AI risks"). - Select the Form Types you want (10-K, 10-Q, 8-K).
- Set your Date Range and Max Filings cap.
- Click "Start" and download the chunked JSON.
Input
Provide standard SEC parameters. Here is a JSON example:
{"cikList": ["0000320193"],"formTypes": ["10-K"],"dateFrom": "2024-01-01","dateTo": "2024-12-31","maxFilings": 5,"searchQuery": "","userAgent": "Jane Smith jane@acme.com"}
Output
The Actor outputs one record per filing. You can download the dataset in various formats such as JSON, HTML, CSV, or Excel.
{"accession_no": "0000320193-24-000123","cik": "0000320193","company_name": "Apple Inc.","ticker": "AAPL","form_type": "10-K","filing_date": "2024-11-01","period_of_report": "2024-09-28","filing_url": "https://www.sec.gov/Archives/edgar/data/320193/000032019324000123/aapl-20240928.htm","source": "full_text","chunks": [{"idx": 0,"text": "Item 1. Business...","tokens": 512}]}
Data table
| Field | Type | Description |
|---|---|---|
accession_no | String | Unique SEC identifier for the filing |
cik | String | 10-digit Central Index Key |
company_name | String | Filer name |
ticker | String | Stock ticker (if available) |
form_type | String | 10-K, 10-Q, or 8-K |
filing_date | String | Date submitted to the SEC |
source | String | Indicates extraction depth (full_text or exhibits_stripped) |
filing_url | String | Link to the SEC Archives primary document |
Pricing / Cost estimation
This Actor is priced at $0.03 per filing. How much does it cost to scrape SEC EDGAR? If you pull 1000 Apple 10-Ks and 10-Qs, the run will cost exactly $30.00. You only pay for successful extractions.
User-Agent warning
The SEC strictly requires a valid User-Agent header containing your name and email. The default placeholder will be rejected with a 403 Forbidden error, crashing the run. Please override the userAgent input field with your real contact information before starting.
Tips / Advanced options
- Narrow via search: Use the full-text search query field to build a highly targeted RAG corpus instead of pulling every filing for a CIK.
- Filing sizes: 10-K annual reports are very long. Expect 20 to 50 text chunks (512 tokens each) per filing.
- Rate limiting: The Actor automatically paces requests at the SEC ceiling of 10 requests per second for maximum throughput without IP bans.
Legal disclaimer and limitations
SEC EDGAR data is public domain. Please respect the SEC fair-access policy. Limitations:
- v1 supports 10-K, 10-Q, and 8-K bodies only. No 13F, S-1, or DEF 14A.
- v1 does not parse inline XBRL tables for numerical extraction.
- Text-format exhibits are concatenated into the main body text; binary exhibits are skipped.
FAQ
Why do I need a User-Agent? The SEC blocks automated traffic that doesn't identify itself. A valid name and email allow them to contact you if your traffic causes issues.
Why are some filings marked source: exhibits_stripped?
If a filing contains complex or binary attachments that fail to parse cleanly, the Actor falls back to extracting just the primary document body to ensure you still get data.
Can I get 13F filings? S-1? DEF 14A? Not in v1. We focused on the core financial disclosures first.
Support
Found a bug or need a feature? Open an issue on our GitHub repository.