Changelog
All notable changes to the Crexi Real Estate Scraper will be documented in this file.
The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.
[1.0.0] - 2025-10-30
Added
- Initial release of Crexi Real Estate Scraper
- Automated scraping of commercial real estate listings from Crexi.com
- Property data extraction including:
- Basic information (name, type, location, address)
- Financial data (price, lease rate, investment metrics)
- Physical details (square footage, lot size, specifications)
- Media (images, documents)
- Agent information
- Property descriptions and highlights
- Pagination handling for comprehensive data collection
- Rate limiting with configurable delays
- HTML debugging feature that saves page content for selector analysis
- Browser automation using Playwright
- Anti-detection measures (user agent, header customization)
- Multiple selector strategies with fallback options
- Detailed property page scraping (optional)
- Data validation and cleaning
- Duplicate detection and prevention
- Configurable input parameters:
maxProperties: Maximum number of properties to scrape
scrapeDetails: Toggle for detailed page scraping
propertyTypes: Filter by property types
locations: Filter by locations
minPrice / maxPrice: Price range filters
rateLimitDelay: Configurable delay between requests
- Comprehensive error handling and logging
- Local testing script (
test_local.py)
- Output visualization script (
view_output.py)
- Docker support with Dockerfile
- Apify platform integration with:
- Actor configuration (
.actor/actor.json)
- Input schema (
.actor/input_schema.json)
- Dataset views
- Comprehensive documentation:
- README.md with features and usage
- IMPLEMENTATION_GUIDE.md with technical details
- PROJECT_STRUCTURE.md with file organization
- QUICK_START.md for rapid onboarding
- Python package structure with proper imports
- Git ignore configuration
Features
- 20+ data fields per property
- Support for both listing and detail page extraction
- Flexible selector strategies to handle various page layouts
- Automatic URL normalization and validation
- Configurable rate limiting (default: 2 seconds)
- Efficient pagination handling
- Option to skip detail pages for faster scraping
- Memory-efficient processing
Reliability
- Multiple selector fallbacks
- Comprehensive error handling
- Graceful degradation when fields are missing
- Deduplication to prevent duplicate entries
- Anti-detection browser configuration
Debugging
- Automatic HTML saving for analysis
- Detailed logging at each step
- Debug mode with HTML snapshots
- Selector discovery helpers in logs
Testing
- Local test script for development
- Output visualization tool
- Sample input configurations
- HTML analysis workflow
Technical Details
- Python 3.12 support
- Playwright 1.40.0 for browser automation
- BeautifulSoup 4.12.0 for HTML parsing
- Apify SDK 2.1.0+ integration
- Async/await pattern for efficiency
- Type hints for better code quality
Documentation
- Complete README with usage examples
- Implementation guide for developers
- Project structure documentation
- Quick start guide
- Changelog for version tracking
Known Limitations
- Requires internet connection
- Subject to Crexi website structure changes
- Some fields may not be available for all properties
- Rate limiting required to avoid blocking
- Respects robots.txt and terms of service
Future Enhancements (Planned)
- Add support for saved searches
- Implement proxy rotation
- Add CSV export option
- Create scheduled run templates
- Add data validation rules
- Implement webhook notifications
- Add support for custom filters
- Create dashboard for monitoring
- Add data enrichment features
- Implement incremental updates
[Unreleased]
Planned
- Enhanced error recovery mechanisms
- More granular logging levels
- Performance optimizations
- Additional data fields
- API integration options
Version History
Version Numbering
- Major version (X.0.0): Breaking changes or major feature additions
- Minor version (1.X.0): New features, backward compatible
- Patch version (1.0.X): Bug fixes, minor improvements
- Added: New features
- Changed: Changes to existing functionality
- Deprecated: Features that will be removed
- Removed: Removed features
- Fixed: Bug fixes
- Security: Security improvements
Note: This changelog will be updated with each release. For the latest changes, see the commit history.