Medium Scraper
This Actor may be unreliable while under maintenance. Would you like to try a similar Actor instead?
See alternative ActorsMedium Scraper
Medium Profile Scraper A robust asynchronous scraper for Medium profiles and articles, built with Python. A base scraper for finding Medium users and their articles A detailed profile scraper for gathering comprehensive user information Author Afnan Khan GitHub: 2Cloud-S LinkedIn: afnankhan-ak
Medium Profile Scraper
A robust asynchronous scraper for Medium profiles and articles, built with Python. This project consists of two main components:
- A base scraper for finding Medium users and their articles
- A detailed profile scraper for gathering comprehensive user information
Author
Afnan Khan
- GitHub: 2Cloud-S
- LinkedIn: afnankhan-ak
Features
Base Scraper (medium.py)
- Asynchronous scraping of Medium profiles
- Topic-based user discovery
- Premium content detection
- Website and email extraction
- Progress tracking and resumable scraping
- CSV export with duplicate prevention
Profile Scraper (profile_scraper.py)
- Detailed profile information extraction
- Bio and social links
- Article statistics and history
- User interests and topics
- Batch processing with rate limiting
- Structured data export
Installation
- Clone the repository:
1git clone https://github.com/2Cloud-S/medium-scraper.git 2cd medium-scraper
- Create and activate a virtual environment:
1python -m venv venv 2source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
Usage
1. Base Scraper
Run the base scraper to collect Medium users:
python medium.py
This will:
- Scrape users from specified topics
- Save progress to
data/medium_users_progress.csv
- Export final results to
data/medium_users_final.csv
2. Profile Scraper
After collecting users, run the profile scraper:
python profile_scraper.py
This will:
- Read users from
medium_users_final.csv
- Collect detailed profile information
- Save results to
data/medium_profiles_detailed.csv
Data Structure
Base Scraper Output
- username
- is_premium
- has_newsletter
- website
- website_emails
- follower_count
- article_count
- premium_articles
Profile Scraper Output
- username
- bio
- total_claps
- total_responses
- following_count
- top_writer_in
- member_since
- last_active
- social_links
- interests
- latest_articles
Configuration
Modify the following in medium.py
:
topics
: List of topics to scrapeheaders
: Update cookies for authenticated requestssearch_paths
: Customize URL patterns
Rate Limiting
The scrapers implement:
- Random delays between requests
- Batch processing
- Error handling with retries
- Session management
Contributing
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Built with aiohttp for async operations
- Uses BeautifulSoup4 for HTML parsing
- Implements best practices for web scraping
Disclaimer
This tool is for educational purposes only. Be sure to comply with Medium's terms of service and implement appropriate delays between requests.