Dcinside Scraper avatar
Dcinside Scraper
Under maintenance

Pricing

Pay per usage

Go to Apify Store
Dcinside Scraper

Dcinside Scraper

Under maintenance

Scrapes DCInside mgallery boards and outputs one dataset item per post including post metadata, full text, and a structured list of comments + replies (plus a commentsText array for easy viewing).

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Rafaz

Rafaz

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

6 days ago

Last modified

Share

DCInside Gallery Scraper

An Apify Actor that scrapes posts and comments from DCInside mgallery boards using CheerioCrawler.

Overview

This Actor scrapes DCInside (디시인사이드) mgallery boards, extracting:

  • Post metadata (title, author, date)
  • Post content (clean plain-text + optional HTML)
  • All comments and replies
  • Structured JSON output to dataset

The scraper uses CheerioCrawler for fast HTTP-based scraping (no browser required) and fetches comments via the mobile AJAX API endpoint.

Quick Start

Install dependencies:

$npm install

Run the Actor locally:

$apify run

Deploy to Apify Platform:

apify login
apify push

Input Parameters

The Actor accepts the following input parameters (defined in .actor/input_schema.json):

Basic Settings

ParameterTypeDefaultDescription
galleryIdstring"tomoo"The DCInside gallery ID or full URL to scrape (e.g., "tomoo" or "https://gall.dcinside.com/mgallery/board/lists?id=tomoo")
startPageinteger1Page number to start scraping from
endPageintegerauto-detectPage number to stop scraping at. Leave empty to auto-detect last page
maxPostsinteger0Maximum number of posts to scrape (0 = unlimited)

Date Filtering

ParameterTypeDescription
startDatestringOnly scrape posts on or after this date (YYYY-MM-DD format)
endDatestringOnly scrape posts on or before this date (YYYY-MM-DD format)

Comment Options

ParameterTypeDefaultDescription
includeCommentsbooleantrueWhether to fetch comments for each post. Disable for faster scraping
maxCommentsPerPostinteger0Maximum comments per post (0 = unlimited). Useful for posts with thousands of comments

Output & Performance

ParameterTypeDefaultDescription
outputFormatstring"nested"Output format: nested, flat, minimal, or text-only
skipExistingbooleanfalseSkip posts already in dataset. Useful for resuming failed runs
maxRequestsPerCrawlinteger10000Maximum HTTP requests (safety limit)

Example Input

{
"galleryId": "tomoo",
"startPage": 1,
"endPage": 10,
"startDate": "2024-01-01",
"endDate": "2024-01-31",
"maxPosts": 100,
"includeComments": true,
"maxCommentsPerPost": 500,
"outputFormat": "nested",
"skipExisting": false
}

Using Full URLs

You can provide a full gallery URL instead of just the ID:

{
"galleryId": "https://gall.dcinside.com/mgallery/board/lists?id=tomoo"
}

Or even a specific post URL:

{
"galleryId": "https://gall.dcinside.com/mgallery/board/view/?id=tomoo&no=123456"
}

The Actor will automatically extract the gallery ID.

Output

The Actor outputs structured JSON objects to the dataset. The exact structure depends on the outputFormat setting:

Nested Format (default)

One object per post with nested comments array:

{
"galleryId": "tomoo",
"postNo": "123456",
"url": "https://gall.dcinside.com/mgallery/board/view/?id=tomoo&no=123456",
"postTitle": "Post Title",
"postCreatedAt": "2024.01.15 14:30:25",
"postAuthor": "nickname",
"postAuthorNick": "nickname",
"postAuthorUid": "user123",
"postAuthorIp": "",
"postText": "Clean post text (readable plain text)...",
"postHtml": "<div>Raw HTML inside .write_div...</div>",
"comments": [
{
"commentId": "789",
"parentCommentId": "",
"commentAuthor": "commenter1",
"commentCreatedAt": "01.15 15:00",
"commentText": "Comment text",
"commentDepth": 0
}
],
"commentsText": ["commenter1 (01.15 15:00): Comment text"],
"commentsCount": 1
}

Flat Format

One row per comment (great for CSV export):

{
"galleryId": "tomoo",
"postNo": "123456",
"postTitle": "Post Title",
"commentId": "789",
"commentAuthor": "commenter1",
"commentText": "Comment text",
"commentDepth": 0
}

Minimal Format

Posts only, no comments:

{
"galleryId": "tomoo",
"postNo": "123456",
"postTitle": "Post Title",
"postText": "Post body text..."
}

Text-Only Format

Condensed text format:

{
"galleryId": "tomoo",
"postNo": "123456",
"postTitle": "Post Title",
"postText": "Post body text...",
"allCommentsText": "commenter1: Comment text\n↳ commenter2: Reply text",
"commentsCount": 2
}

How It Works

  1. Gallery ID Extraction: The Actor accepts gallery IDs or full URLs and extracts the ID automatically
  2. List Page Discovery: Fetches list pages from https://gall.dcinside.com/mgallery/board/lists/ and extracts post URLs
  3. Date Filtering: If date filters are set, posts outside the range are skipped
  4. Deduplication: If skipExisting is enabled, already-scraped posts are skipped
  5. Post Extraction: For each post, it extracts metadata plus post content from the desktop mgallery view page (.write_div)
  6. Comment Fetching: Comments are fetched via the mobile AJAX endpoint (https://m.dcinside.com/ajax/response-comment) with pagination support
  7. Data Output: Each post (and optionally comments) is pushed to the dataset in the requested format

Project Structure

.actor/
├── actor.json # Actor configuration
├── input_schema.json # Input parameter definitions
├── output_schema.json # Output schema
└── dataset_schema.json # Dataset view configuration
src/
└── main.ts # Main Actor code
storage/ # Local storage (development only)
├── datasets/ # Output items
├── key_value_stores/ # INPUT.json and other files
└── request_queues/ # Crawl request queue

Features

  • Full URL Support: Accept gallery URLs or IDs
  • Date-Based Filtering: Scrape posts from specific date ranges
  • Smart Deduplication: Skip existing posts for resume/incremental runs
  • Flexible Output Formats: Nested, flat, minimal, or text-only
  • Comment Control: Enable/disable comments, set max per post
  • Fast HTTP-based scraping (no browser overhead)
  • Automatic last page detection
  • Comment pagination support
  • Structured comment hierarchy (top-level + replies)
  • Configurable page ranges and post limits
  • Proxy support via Apify Proxy
  • Graceful abort handling

Limitations

  • Image downloading is not included (text and comments only)
  • CSV export functionality removed (use dataset export instead)
  • Date filtering requires the post date to be parsable from the page

Tips

Resuming a Failed Run

If a run fails partway through, enable skipExisting to avoid re-scraping posts:

{
"galleryId": "tomoo",
"skipExisting": true
}

Scraping Only Recent Posts

Use date filtering instead of page numbers:

{
"galleryId": "tomoo",
"startDate": "2024-01-01",
"endDate": "2024-01-31"
}

Fast Scraping (Posts Only)

Disable comments for much faster scraping:

{
"galleryId": "tomoo",
"includeComments": false,
"outputFormat": "minimal"
}

Handling Posts with Many Comments

Some posts have thousands of comments. Limit them:

{
"galleryId": "tomoo",
"maxCommentsPerPost": 100
}

CSV Export

Use outputFormat: "flat" for easier CSV export (one row per comment).

Resources

Changelog

v1.3.0

  • New: Extract post content more reliably (clean postText) and include postHtml (raw HTML of the post body)
  • Improved: Friendlier input descriptions/tooltips for non-technical users
  • Improved: Dataset overview view now includes postText

v1.1.0

  • New: Accept full gallery URLs (not just IDs)
  • New: Date-based filtering (startDate, endDate)
  • New: Skip existing posts (skipExisting)
  • New: Output format options (nested, flat, minimal, text-only)
  • New: Comment control (includeComments, maxCommentsPerPost)
  • Improved input validation and error messages

v1.0.0

  • Initial release

License

ISC