Dcinside Scraper
Pricing
Pay per usage
Dcinside Scraper
Scrapes DCInside mgallery boards and outputs one dataset item per post including post metadata, full text, and a structured list of comments + replies (plus a commentsText array for easy viewing).
Pricing
Pay per usage
Rating
0.0
(0)
Developer

Rafaz
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
6 days ago
Last modified
Categories
Share
DCInside Gallery Scraper
An Apify Actor that scrapes posts and comments from DCInside mgallery boards using CheerioCrawler.
Overview
This Actor scrapes DCInside (디시인사이드) mgallery boards, extracting:
- Post metadata (title, author, date)
- Post content (clean plain-text + optional HTML)
- All comments and replies
- Structured JSON output to dataset
The scraper uses CheerioCrawler for fast HTTP-based scraping (no browser required) and fetches comments via the mobile AJAX API endpoint.
Quick Start
Install dependencies:
$npm install
Run the Actor locally:
$apify run
Deploy to Apify Platform:
apify loginapify push
Input Parameters
The Actor accepts the following input parameters (defined in .actor/input_schema.json):
Basic Settings
| Parameter | Type | Default | Description |
|---|---|---|---|
galleryId | string | "tomoo" | The DCInside gallery ID or full URL to scrape (e.g., "tomoo" or "https://gall.dcinside.com/mgallery/board/lists?id=tomoo") |
startPage | integer | 1 | Page number to start scraping from |
endPage | integer | auto-detect | Page number to stop scraping at. Leave empty to auto-detect last page |
maxPosts | integer | 0 | Maximum number of posts to scrape (0 = unlimited) |
Date Filtering
| Parameter | Type | Description |
|---|---|---|
startDate | string | Only scrape posts on or after this date (YYYY-MM-DD format) |
endDate | string | Only scrape posts on or before this date (YYYY-MM-DD format) |
Comment Options
| Parameter | Type | Default | Description |
|---|---|---|---|
includeComments | boolean | true | Whether to fetch comments for each post. Disable for faster scraping |
maxCommentsPerPost | integer | 0 | Maximum comments per post (0 = unlimited). Useful for posts with thousands of comments |
Output & Performance
| Parameter | Type | Default | Description |
|---|---|---|---|
outputFormat | string | "nested" | Output format: nested, flat, minimal, or text-only |
skipExisting | boolean | false | Skip posts already in dataset. Useful for resuming failed runs |
maxRequestsPerCrawl | integer | 10000 | Maximum HTTP requests (safety limit) |
Example Input
{"galleryId": "tomoo","startPage": 1,"endPage": 10,"startDate": "2024-01-01","endDate": "2024-01-31","maxPosts": 100,"includeComments": true,"maxCommentsPerPost": 500,"outputFormat": "nested","skipExisting": false}
Using Full URLs
You can provide a full gallery URL instead of just the ID:
{"galleryId": "https://gall.dcinside.com/mgallery/board/lists?id=tomoo"}
Or even a specific post URL:
{"galleryId": "https://gall.dcinside.com/mgallery/board/view/?id=tomoo&no=123456"}
The Actor will automatically extract the gallery ID.
Output
The Actor outputs structured JSON objects to the dataset. The exact structure depends on the outputFormat setting:
Nested Format (default)
One object per post with nested comments array:
{"galleryId": "tomoo","postNo": "123456","url": "https://gall.dcinside.com/mgallery/board/view/?id=tomoo&no=123456","postTitle": "Post Title","postCreatedAt": "2024.01.15 14:30:25","postAuthor": "nickname","postAuthorNick": "nickname","postAuthorUid": "user123","postAuthorIp": "","postText": "Clean post text (readable plain text)...","postHtml": "<div>Raw HTML inside .write_div...</div>","comments": [{"commentId": "789","parentCommentId": "","commentAuthor": "commenter1","commentCreatedAt": "01.15 15:00","commentText": "Comment text","commentDepth": 0}],"commentsText": ["commenter1 (01.15 15:00): Comment text"],"commentsCount": 1}
Flat Format
One row per comment (great for CSV export):
{"galleryId": "tomoo","postNo": "123456","postTitle": "Post Title","commentId": "789","commentAuthor": "commenter1","commentText": "Comment text","commentDepth": 0}
Minimal Format
Posts only, no comments:
{"galleryId": "tomoo","postNo": "123456","postTitle": "Post Title","postText": "Post body text..."}
Text-Only Format
Condensed text format:
{"galleryId": "tomoo","postNo": "123456","postTitle": "Post Title","postText": "Post body text...","allCommentsText": "commenter1: Comment text\n↳ commenter2: Reply text","commentsCount": 2}
How It Works
- Gallery ID Extraction: The Actor accepts gallery IDs or full URLs and extracts the ID automatically
- List Page Discovery: Fetches list pages from
https://gall.dcinside.com/mgallery/board/lists/and extracts post URLs - Date Filtering: If date filters are set, posts outside the range are skipped
- Deduplication: If
skipExistingis enabled, already-scraped posts are skipped - Post Extraction: For each post, it extracts metadata plus post content from the desktop mgallery view page (
.write_div) - Comment Fetching: Comments are fetched via the mobile AJAX endpoint (
https://m.dcinside.com/ajax/response-comment) with pagination support - Data Output: Each post (and optionally comments) is pushed to the dataset in the requested format
Project Structure
.actor/├── actor.json # Actor configuration├── input_schema.json # Input parameter definitions├── output_schema.json # Output schema└── dataset_schema.json # Dataset view configurationsrc/└── main.ts # Main Actor codestorage/ # Local storage (development only)├── datasets/ # Output items├── key_value_stores/ # INPUT.json and other files└── request_queues/ # Crawl request queue
Features
- ✅ Full URL Support: Accept gallery URLs or IDs
- ✅ Date-Based Filtering: Scrape posts from specific date ranges
- ✅ Smart Deduplication: Skip existing posts for resume/incremental runs
- ✅ Flexible Output Formats: Nested, flat, minimal, or text-only
- ✅ Comment Control: Enable/disable comments, set max per post
- ✅ Fast HTTP-based scraping (no browser overhead)
- ✅ Automatic last page detection
- ✅ Comment pagination support
- ✅ Structured comment hierarchy (top-level + replies)
- ✅ Configurable page ranges and post limits
- ✅ Proxy support via Apify Proxy
- ✅ Graceful abort handling
Limitations
- Image downloading is not included (text and comments only)
- CSV export functionality removed (use dataset export instead)
- Date filtering requires the post date to be parsable from the page
Tips
Resuming a Failed Run
If a run fails partway through, enable skipExisting to avoid re-scraping posts:
{"galleryId": "tomoo","skipExisting": true}
Scraping Only Recent Posts
Use date filtering instead of page numbers:
{"galleryId": "tomoo","startDate": "2024-01-01","endDate": "2024-01-31"}
Fast Scraping (Posts Only)
Disable comments for much faster scraping:
{"galleryId": "tomoo","includeComments": false,"outputFormat": "minimal"}
Handling Posts with Many Comments
Some posts have thousands of comments. Limit them:
{"galleryId": "tomoo","maxCommentsPerPost": 100}
CSV Export
Use outputFormat: "flat" for easier CSV export (one row per comment).
Resources
Changelog
v1.3.0
- New: Extract post content more reliably (clean
postText) and includepostHtml(raw HTML of the post body) - Improved: Friendlier input descriptions/tooltips for non-technical users
- Improved: Dataset overview view now includes
postText
v1.1.0
- New: Accept full gallery URLs (not just IDs)
- New: Date-based filtering (
startDate,endDate) - New: Skip existing posts (
skipExisting) - New: Output format options (
nested,flat,minimal,text-only) - New: Comment control (
includeComments,maxCommentsPerPost) - Improved input validation and error messages
v1.0.0
- Initial release
License
ISC