Pricing

$10.00/month + usage

Startup Investor Scraper

VC Firm Data Scraper collects venture capital firm information using Wikipedia, DuckDuckGo search, and official websites. It extracts firm name, website, location, phone, description, investment stages, focus sectors, AUM, and social links. The actor outputs structured JSON data

Pricing

$10.00/month + usage

Rating

0.0

(0)

Developer

Data Pilot

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

Startup Investor Scraper - Advanced Apify Actor

🚀 Startup Investor Scraper (Advanced) is a production-grade Apify Actor designed to extract comprehensive venture capital and investment firm data using advanced Wikipedia validation, multi-source scraping, and intelligent fallback mechanisms. This tool provides detailed Startup Investor information, including firm profiles, AUM, investment stages, focus areas, contact information, and social media links for legitimate investment firms.

With async/await architecture, smart Wikipedia filtering, multi-page website scraping, proxy fallback mechanisms, and Apify Dataset integration, the Startup Investor Scraper ensures reliable extraction of verified investment firm information. It focuses on key Startup Investor metrics like AUM, investment stages, focus areas, and firm types, making it an essential tool for Startup Investor analysis and fundraising intelligence.

🔥 Features

Smart Wikipedia Validation – Filters investment-related Wikipedia results to ensure only legitimate firms are processed using keyword matching.
Multi-Source Data Aggregation – Combines Wikipedia infobox data, official websites, contact pages, and team pages for complete information.
Async/Await Architecture – Optimized concurrent processing using Python asyncio for maximum performance.
Proxy Fallback Mechanism – Uses Apify RESIDENTIAL proxies with automatic fallback to direct connection on proxy failures (UPSTREAM502/503).
Multi-Page Website Scraping – Automatically scrapes main website, /about, /contact, /team pages for comprehensive data extraction.
Intelligent Website Discovery – Combines DuckDuckGo search with blind domain guessing for reliable website identification.
Address Extraction – Extracts full addresses including street, suite/floor, city, state, ZIP code with smart validation.
Investment Stage Detection – Automatically identifies investment stages (Pre-Seed, Seed, Series A-C, Growth, IPO).
Focus Area Classification – Detects investment focus areas (AI/ML, Fintech, HealthTech, SaaS, DeepTech, etc.).
Firm Type Detection – Classifies firms (VC, CVC, Angel, PE, Accelerator, Seed Fund, Growth Equity).
AUM Extraction – Extracts Assets Under Management using multiple text pattern matching algorithms.
Social Media Integration – Finds LinkedIn and Twitter/X profiles with URL validation.
Contact Information – Extracts phone numbers using pattern matching with validation.
UUID Generation – Generates Crunchbase-compatible UUIDs for database integration.
Timestamp Recording – Records created_at, updated_at, last_checked timestamps in ISO 8601 format.
Dataset Push with Metadata – Pushes results to Apify Dataset with search metadata for tracking.

📥 Input

Field	Type	Default	Description
`keyword`	string	required	Investment firm keyword to search
`max_results`	integer	`5`	Maximum firms to return (1-20)
`useApifyProxy`	boolean	`true`	Enable Apify residential proxies
`apifyProxyGroups`	array	`["RESIDENTIAL"]`	Proxy group configuration

Example Input:

{
  "keyword": "venture capital technology san francisco",
  "max_results": 10,
  "useApifyProxy": true,
  "apifyProxyGroups": ["RESIDENTIAL"]
}

📤 Output

Field	Type	Description
`firm_id`	integer	Unique ID (1, 2, 3...)
`firm_type_id`	string	VC, CVC, Angel, PE, Accelerator, Seed Fund, Growth
`firm_name`	string	Official firm name
`firm_address_1`	string	Street address
`firm_address_2`	string	Suite/floor number
`firm_city`	string	City
`firm_state`	string	State abbreviation
`firm_country`	string	Country
`firm_zip`	string	ZIP/Postal code
`firm_phone`	string	Phone number
`firm_website`	string	Official website URL
`firm_linkedin_url`	string	LinkedIn profile
`twitter_url`	string	Twitter/X profile
`crunchbase_uuid`	string	UUID for database linking
`firm_description`	string	Company overview
`firm_stages`	array	Investment stages
`firm_aum`	float	Assets Under Management (millions)
`firm_focus`	array	Investment focus areas
`last_checked`	string	ISO 8601 timestamp
`created_at`	string	Creation timestamp
`updated_at`	string	Update timestamp

Example Record:

{
  "firm_id": 1,
  "firm_type_id": "VC",
  "firm_name": "Sequoia Capital",
  "firm_address_1": "2800 Sand Hill Road",
  "firm_city": "Menlo Park",
  "firm_state": "CA",
  "firm_country": "United States",
  "firm_zip": "94025",
  "firm_phone": "(650) 234-7800",
  "firm_website": "https://www.sequoiacap.com",
  "firm_linkedin_url": "https://linkedin.com/company/sequoia-capital",
  "twitter_url": "https://twitter.com/sequoiacap",
  "firm_description": "Global venture capital firm...",
  "firm_stages": ["Pre-Seed", "Seed", "Series A", "Series B", "Series C"],
  "firm_aum": 85000.0,
  "firm_focus": ["AI/ML", "Enterprise", "Consumer", "Fintech"],
  "last_checked": "2025-02-14T12:00:00Z",
  "created_at": "2025-02-14T12:00:00Z",
  "updated_at": "2025-02-14T12:00:00Z"
}

🧰 Technical Stack

Async HTTP: aiohttp with ClientTimeout and SSL
HTML Parsing: BeautifulSoup4 with tag decomposition
Search: DuckDuckGo HTML scraping
APIs: Wikipedia API with JSON responses
Pattern Matching: Python regex (re module)
UUID: Python uuid module
Timestamps: datetime with timezone
Logging: Apify Actor logging system
Proxy: Apify Proxy with fallback mechanisms
Platform: Apify Actor serverless environment

📊 Data Fields Explained

Location & Address

firm_address_1: Street address (e.g., "2800 Sand Hill Road")
firm_address_2: Suite or floor (e.g., "Suite 100")
firm_city: City validated against street words
firm_state: 2-letter state code (CA, NY, etc.)
firm_country: Country name (usually United States)
firm_zip: 5-digit ZIP code

Investment Profile

firm_stages: Array of supported investment stages
firm_aum: Assets Under Management in millions USD
firm_focus: Array of focus areas (max 8)
firm_type_id: Classification (VC, CVC, Angel, PE, etc.)

Metadata

last_checked: When data was verified
created_at: When record was created
updated_at: When record was last updated
crunchbase_uuid: UUID for linking to databases

# Validates city against street words to prevent
# extracting "Road Billerica" as a city
STREET_WORDS = {"road", "street", "avenue", "boulevard", ...}

if not any(sw in city.lower() for sw in STREET_WORDS):
    data["city"] = city  # Valid

⚙️ Configuration

Proxy Configuration

{
  "useApifyProxy": true,
  "apifyProxyGroups": ["RESIDENTIAL"]
}

Disable proxy:

{
  "useApifyProxy": false
}

Data Quality

Validation: Multi-stage validation ensures firm legitimacy
Accuracy: Wikipedia infobox data is authoritative
Completeness: Multi-page scraping captures all available data
Freshness: Timestamps recorded for verification tracking
Verification: Always verify critical data independently

Best Practices

Run during off-peak hours
Use reasonable delays between searches
Verify investor details with official sources
Don't rely solely on automated data
Respect communication preferences
Use data ethically for legitimate purposes

📦 Changelog

New Features:

Smart Wikipedia validation with 2-stage filtering
Multi-page scraping (main, /about, /contact, /team)
Intelligent address extraction with street word validation
Proxy fallback mechanism for UPSTREAM502/503 errors
Website discovery with blind domain guessing
AUM extraction with multiple pattern matching
Investment stage auto-detection
Focus area classification (8 areas max)
Firm type detection (7 types)
Social media profile extraction
UUID generation for database linking
ISO 8601 timestamp recording

Improvements:

Async/await architecture for performance
Reduced request failures from 15% to 5%
Increased data completeness from 70% to 95%
Better address accuracy with validation
Improved AUM extraction reliability
Enhanced error logging and recovery

Bug Fixes:

Fixed "Street City" extraction bug
Improved phone number validation
Better LinkedIn URL filtering
Twitter/X profile URL fixes
Removed non-investment firms

🧑‍💻 Support & Feedback

Issues: Submit via Apify console
Documentation: Check Actor details page
Community: Join Apify forum discussions
Feature Requests: Suggest improvements
Bug Reports: Report with logs and details

Disclaimer: Startup Investor Scraper Advanced is provided as-is for research purposes. Users are responsible for compliance with website policies and laws. Always verify data independently.

🎉 Get Started Today

Deploy this production-grade actor now!

Use for:

🎯 Fundraising Research
💼 Investor Intelligence
📊 Market Analysis
💡 Fund Research
🔍 Due Diligence

Perfect for:

Entrepreneurs
Founders
Investors
Corporate Development
Research Teams

Last Updated: February 2025
Version: 2.0.0 Advanced
Status: Production Ready
Platform: Apify Actor
Architecture: Async/Await
Validation: Multi-stage
Reliability: Enterprise-grade

Startup Company Data Collector
Business Social Media Finder
Smart Article Extractor
Fast News Content Scraper

Martindale Law Firm Scraper

parseforge/martindale-scraper

Collect law firm listings from Martindale using filters for keyword, practice area, and location. Get clear records with firm name, attorneys, ratings, contact info, address details, service flags, and source links in clean outputs ready for legal research, lead generation, and competitive analysis.

ParseForge

Investment Finance Professionals

johnvc/SECInvestmentAdvisorContacts

Find and filter 250,000+ investment professionals and 15,000+ financial firms by location, firm name, and more. Get structured contact data with emails, LinkedIn profiles, and firm associations for lead generation and market research.

John

5.0

Law Firm Website Contact Scraper

jungle_synthesizer/law-firm-website-contact-scraper

Extract attorney profiles, contact details, practice areas, and bios directly from law firm websites. Provide a list of law firm URLs and get structured attorney data including name, title, email, phone, education, bar admissions, and headshot.

BowTiedRaccoon

Signal NFX Scraper — VC & Angel Investor Data

jungle_synthesizer/signal-nfx-startup-investors-scraper

Extract VC and angel investor profiles from Signal NFX. Get firm data with investment stage focus, sector preferences, check sizes, AUM, portfolio companies, and fund details. Optionally include individual partner profiles. Filter by stage, sector, investor type, and location.

BowTiedRaccoon

Startup Investors Data Scraper

johnvc/startup-investors-data-scraper

10,469 investment firms at your fingertips (as of Dec 2025). The most comprehensive startup investor firm database for finding funding and customers. Access detailed firm profiles to accelerate your startup's growth, find customers, and conduct comprehensive market research.

John

455

1.8

VC Sheet Funds Scraper – Venture Capital Directory

giovannibiancia/vc-sheet-funds-scraper---venture-capital-directory

Scrape the complete VC Sheet fund directory (vcsheet.com) to extract structured data on hundreds of active venture capital funds. Perfect for founders building investor outreach lists, researchers mapping the VC ecosystem, and B2B data providers targeting the startup finance space.

Giovanni Bianciardi

PitchBook Investors Scraper

jungle_synthesizer/pitchbook-investors-scraper

Scrape public investor profile metadata from PitchBook without a subscription. Supports text search, direct profile URLs, and bulk sitemap discovery. Returns firm name, description, location, investor type, status, investment metrics, social links, and more.

BowTiedRaccoon

Signal NFX Investor Scraper

powerai/signal-listing-scraper

Scrape investor profiles from Signal NFX with automatic pagination and comprehensive investor data including firm details, check sizes, and investment focus.

PowerAI

Startup Company Data Collector

datapilot/startup-company-data-collector

Startup Data Collector gathers structured startup information from multiple sources like Wikipedia, official websites, and search results. It extracts company description, website, industry, location, founding year, employees, funding data, emails, and social links (LinkedIn, Twitter, etc.),

Data Pilot

Signal NFX Investor Scraper - Cheap 📊👤

scrapestorm/signal-nfx-investor-scraper---cheap

🔍 Easily collect investors from Signal NFX Provide one or multiple Signal NFX investor list URLs and extract detailed investors such as investor name, firm, role, sweet spot, geographies & direct profile link 👤📊 Perfect for VC market analysis, firm intelligence & startup lead generation 🚀💼