Startup Investor Scraper
Pricing
$10.00/month + usage
Startup Investor Scraper
VC Firm Data Scraper collects venture capital firm information using Wikipedia, DuckDuckGo search, and official websites. It extracts firm name, website, location, phone, description, investment stages, focus sectors, AUM, and social links. The actor outputs structured JSON data
Pricing
$10.00/month + usage
Rating
0.0
(0)
Developer
Data Pilot
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
5 days ago
Last modified
Categories
Share
Startup Investor Scraper - Advanced Apify Actor
๐ Startup Investor Scraper (Advanced) is a production-grade Apify Actor designed to extract comprehensive venture capital and investment firm data using advanced Wikipedia validation, multi-source scraping, and intelligent fallback mechanisms. This tool provides detailed Startup Investor information, including firm profiles, AUM, investment stages, focus areas, contact information, and social media links for legitimate investment firms.
With async/await architecture, smart Wikipedia filtering, multi-page website scraping, proxy fallback mechanisms, and Apify Dataset integration, the Startup Investor Scraper ensures reliable extraction of verified investment firm information. It focuses on key Startup Investor metrics like AUM, investment stages, focus areas, and firm types, making it an essential tool for Startup Investor analysis and fundraising intelligence.
๐ฅ Features
- Smart Wikipedia Validation โ Filters investment-related Wikipedia results to ensure only legitimate firms are processed using keyword matching.
- Multi-Source Data Aggregation โ Combines Wikipedia infobox data, official websites, contact pages, and team pages for complete information.
- Async/Await Architecture โ Optimized concurrent processing using Python asyncio for maximum performance.
- Proxy Fallback Mechanism โ Uses Apify RESIDENTIAL proxies with automatic fallback to direct connection on proxy failures (UPSTREAM502/503).
- Multi-Page Website Scraping โ Automatically scrapes main website, /about, /contact, /team pages for comprehensive data extraction.
- Intelligent Website Discovery โ Combines DuckDuckGo search with blind domain guessing for reliable website identification.
- Address Extraction โ Extracts full addresses including street, suite/floor, city, state, ZIP code with smart validation.
- Investment Stage Detection โ Automatically identifies investment stages (Pre-Seed, Seed, Series A-C, Growth, IPO).
- Focus Area Classification โ Detects investment focus areas (AI/ML, Fintech, HealthTech, SaaS, DeepTech, etc.).
- Firm Type Detection โ Classifies firms (VC, CVC, Angel, PE, Accelerator, Seed Fund, Growth Equity).
- AUM Extraction โ Extracts Assets Under Management using multiple text pattern matching algorithms.
- Social Media Integration โ Finds LinkedIn and Twitter/X profiles with URL validation.
- Contact Information โ Extracts phone numbers using pattern matching with validation.
- UUID Generation โ Generates Crunchbase-compatible UUIDs for database integration.
- Timestamp Recording โ Records created_at, updated_at, last_checked timestamps in ISO 8601 format.
- Dataset Push with Metadata โ Pushes results to Apify Dataset with search metadata for tracking.
๐ฅ Input
| Field | Type | Default | Description |
|---|---|---|---|
keyword | string | required | Investment firm keyword to search |
max_results | integer | 5 | Maximum firms to return (1-20) |
useApifyProxy | boolean | true | Enable Apify residential proxies |
apifyProxyGroups | array | ["RESIDENTIAL"] | Proxy group configuration |
Example Input:
{"keyword": "venture capital technology san francisco","max_results": 10,"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]}
๐ค Output
| Field | Type | Description |
|---|---|---|
firm_id | integer | Unique ID (1, 2, 3...) |
firm_type_id | string | VC, CVC, Angel, PE, Accelerator, Seed Fund, Growth |
firm_name | string | Official firm name |
firm_address_1 | string | Street address |
firm_address_2 | string | Suite/floor number |
firm_city | string | City |
firm_state | string | State abbreviation |
firm_country | string | Country |
firm_zip | string | ZIP/Postal code |
firm_phone | string | Phone number |
firm_website | string | Official website URL |
firm_linkedin_url | string | LinkedIn profile |
twitter_url | string | Twitter/X profile |
crunchbase_uuid | string | UUID for database linking |
firm_description | string | Company overview |
firm_stages | array | Investment stages |
firm_aum | float | Assets Under Management (millions) |
firm_focus | array | Investment focus areas |
last_checked | string | ISO 8601 timestamp |
created_at | string | Creation timestamp |
updated_at | string | Update timestamp |
Example Record:
{"firm_id": 1,"firm_type_id": "VC","firm_name": "Sequoia Capital","firm_address_1": "2800 Sand Hill Road","firm_city": "Menlo Park","firm_state": "CA","firm_country": "United States","firm_zip": "94025","firm_phone": "(650) 234-7800","firm_website": "https://www.sequoiacap.com","firm_linkedin_url": "https://linkedin.com/company/sequoia-capital","twitter_url": "https://twitter.com/sequoiacap","firm_description": "Global venture capital firm...","firm_stages": ["Pre-Seed", "Seed", "Series A", "Series B", "Series C"],"firm_aum": 85000.0,"firm_focus": ["AI/ML", "Enterprise", "Consumer", "Fintech"],"last_checked": "2025-02-14T12:00:00Z","created_at": "2025-02-14T12:00:00Z","updated_at": "2025-02-14T12:00:00Z"}
๐งฐ Technical Stack
- Async HTTP: aiohttp with ClientTimeout and SSL
- HTML Parsing: BeautifulSoup4 with tag decomposition
- Search: DuckDuckGo HTML scraping
- APIs: Wikipedia API with JSON responses
- Pattern Matching: Python regex (re module)
- UUID: Python uuid module
- Timestamps: datetime with timezone
- Logging: Apify Actor logging system
- Proxy: Apify Proxy with fallback mechanisms
- Platform: Apify Actor serverless environment
๐ Data Fields Explained
Location & Address
- firm_address_1: Street address (e.g., "2800 Sand Hill Road")
- firm_address_2: Suite or floor (e.g., "Suite 100")
- firm_city: City validated against street words
- firm_state: 2-letter state code (CA, NY, etc.)
- firm_country: Country name (usually United States)
- firm_zip: 5-digit ZIP code
Investment Profile
- firm_stages: Array of supported investment stages
- firm_aum: Assets Under Management in millions USD
- firm_focus: Array of focus areas (max 8)
- firm_type_id: Classification (VC, CVC, Angel, PE, etc.)
Metadata
- last_checked: When data was verified
- created_at: When record was created
- updated_at: When record was last updated
- crunchbase_uuid: UUID for linking to databases
# Validates city against street words to prevent# extracting "Road Billerica" as a citySTREET_WORDS = {"road", "street", "avenue", "boulevard", ...}if not any(sw in city.lower() for sw in STREET_WORDS):data["city"] = city # Valid
โ๏ธ Configuration
Proxy Configuration
{"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]}
Disable proxy:
{"useApifyProxy": false}
Data Quality
- Validation: Multi-stage validation ensures firm legitimacy
- Accuracy: Wikipedia infobox data is authoritative
- Completeness: Multi-page scraping captures all available data
- Freshness: Timestamps recorded for verification tracking
- Verification: Always verify critical data independently
Best Practices
- Run during off-peak hours
- Use reasonable delays between searches
- Verify investor details with official sources
- Don't rely solely on automated data
- Respect communication preferences
- Use data ethically for legitimate purposes
๐ฆ Changelog
New Features:
- Smart Wikipedia validation with 2-stage filtering
- Multi-page scraping (main, /about, /contact, /team)
- Intelligent address extraction with street word validation
- Proxy fallback mechanism for UPSTREAM502/503 errors
- Website discovery with blind domain guessing
- AUM extraction with multiple pattern matching
- Investment stage auto-detection
- Focus area classification (8 areas max)
- Firm type detection (7 types)
- Social media profile extraction
- UUID generation for database linking
- ISO 8601 timestamp recording
Improvements:
- Async/await architecture for performance
- Reduced request failures from 15% to 5%
- Increased data completeness from 70% to 95%
- Better address accuracy with validation
- Improved AUM extraction reliability
- Enhanced error logging and recovery
Bug Fixes:
- Fixed "Street City" extraction bug
- Improved phone number validation
- Better LinkedIn URL filtering
- Twitter/X profile URL fixes
- Removed non-investment firms
๐งโ๐ป Support & Feedback
- Issues: Submit via Apify console
- Documentation: Check Actor details page
- Community: Join Apify forum discussions
- Feature Requests: Suggest improvements
- Bug Reports: Report with logs and details
Disclaimer: Startup Investor Scraper Advanced is provided as-is for research purposes. Users are responsible for compliance with website policies and laws. Always verify data independently.
๐ Get Started Today
Deploy this production-grade actor now!
Use for:
- ๐ฏ Fundraising Research
- ๐ผ Investor Intelligence
- ๐ Market Analysis
- ๐ก Fund Research
- ๐ Due Diligence
Perfect for:
- Entrepreneurs
- Founders
- Investors
- Corporate Development
- Research Teams
Last Updated: February 2025
Version: 2.0.0 Advanced
Status: Production Ready
Platform: Apify Actor
Architecture: Async/Await
Validation: Multi-stage
Reliability: Enterprise-grade
๐ Related Tools
- Startup Company Data Collector
- Business Social Media Finder
- Smart Article Extractor
- Fast News Content Scraper