Fast Instagram Post Scraper 🚀 avatar
Fast Instagram Post Scraper 🚀

Pricing

Pay per usage

Go to Store
Fast Instagram Post Scraper 🚀

Fast Instagram Post Scraper 🚀

Developed by

Instagram Scraper

Instagram Scraper

Maintained by Community

Instagram Post Scraper. Post data: hashtags, comment_count, like_count, usertags, images, videos, shortcode and etc. Scrape Instagram Posts with Ease and Speed

5.0 (1)

Pricing

Pay per usage

3

Total users

108

Monthly users

55

Runs succeeded

98%

Issue response

2.3 days

Last modified

21 days ago

LB

FEATURE REQUEST / View count

Closed

Labsed opened this issue
3 months ago

Hey,

Is it possible to add view_count to the post item?

Best

LB

Labsed

3 months ago

And also the video duration.

IS

view_count has been added, but it looks like it's basically null, and I need to make sure that the video duration is available via the video_dash_manifest info.

LB

Labsed

2 months ago

Thanks for considering the feedback. Attaching my crawler in case it can help you with the manifest manipulation, or any other means.

import {
createHttpRouter,
HttpCrawler,
HttpCrawlerOptions,
HttpCrawlingContext,
ProxyConfiguration,
} from "crawlee";
import jp from "jsonpath";
import * as cheerio from "cheerio";
import Video from "../../entities/video";
import Channel from "../../entities/channel";
import ChannelStats from "../../entities/channel-stats";
import VideoStats from "../../entities/video-stats";
import { CrawlerInterface, Platform } from "../../types";
export class InstagramCrawler
extends HttpCrawler<any>
implements CrawlerInterface
{
constructor(options?: HttpCrawlerOptions) {
const router = createHttpRouter();
router.addDefaultHandler(async ({ request, json }: HttpCrawlingContext) => {
const nodes: Record<string, any>[] = jp.query(
json,
"$.data.user.edge_owner_to_timeline_media.edges[?(@.node.__typename=='GraphVideo')].node",
);
for (const node of nodes) {
const video = Video.create({
channel: request.userData.channel,
videoId: node.shortcode,
title: node.edge_media_to_caption?.edges[0].node.text ?? "",
duration: extractDurationFromManifest(
node.dash_info.video_dash_manifest,
),
publishedAt: new Date(node.taken_at_timestamp * 1000),
});
await Video.upsert(video, ["channel", "videoId"]);
await VideoStats.save({
video,
views: node.video_view_count,
comments: node.edge_media_to_comment.count,
reactions: node.edge_media_preview_like.count,
topReactions: [
{ name: "Like", count: node.edge_media_preview_like.count },
],
});
}
});
router.addHandler(
"stats",
async ({ json, request }: HttpCrawlingContext) => {
const data: Record<string, any> = jp.value(
json,
"$..data.xdt_shortcode_media",
);
await Video.update(
{ id: request.userData.video.id },
{
title: data.edge_media_to_caption?.edges[0].node.text ?? "",
duration: parseInt(data.video_duration),
publishedAt: new Date(data.taken_at_timestamp * 1000),
},
);
await VideoStats.save({
video: request.userData.video,
views: data.video_view_count,
comments: data.edge_media_to_parent_comment.count,
reactions: data.edge_media_preview_like.count,
topReactions: [
{ name: "Like", count: data.edge_media_preview_like.count },
],
});
},
);
router.addHandler("cadd", async ({ log, json }: HttpCrawlingContext) => {
const data: Record<string, any> = jp.value(json, "$..data.user");
log.info("Got channel data", data);
const channel = Channel.create({
platform: Platform.INSTAGRAM,
channelId: data.id,
username: data.username,
name: data.full_name,
});
await Channel.upsert(channel, ["platform", "channelId"]);
log.info("Channel saved", channel);
});
router.addHandler(
"cstats",
async ({ request, json }: HttpCrawlingContext) => {
await ChannelStats.save({
channel: request.userData.channel,
followers: json.data.user.edge_followed_by.count,
});
},
);
let proxyConfiguration: ProxyConfiguration | undefined;
if (process.env.IG_PROXY_URL) {
proxyConfiguration = new ProxyConfiguration({
proxyUrls: [process.env.IG_PROXY_URL],
});
}
super({
...options,
proxyConfiguration,
requestHandler: router,
preNavigationHooks: [
async (_: HttpCrawlingContext, gotOptions: any) => {
gotOptions.headers = {
"X-IG-App-ID": "936619743392459",
};
},
],
});
}
addChannelAddRequest(url: string) {
return this.addRequests([{ url, label: "cadd" }]);
}
addChannelStatsRequests(channels: Channel[]) {
return this.addRequests(
channels.map((channel) => ({
url: `https://www.instagram.com/api/v1/users/web_profile_info/?username=${channel.username}`,
label: "cstats",
userData: { channel },
})),
);
}
addChannelVideoRequests(channels: Channel[]) {
return this.addRequests(
channels.map((channel) => ({
url: `https://www.instagram.com/api/v1/users/web_profile_info/?username=${channel.username}`,
userData: { channel },
})),
);
}
addVideoStatsRequests(videos: Video[]) {
return this.addRequests(
videos.map((video) => ({
url: "https://www.instagram.com/graphql/query",
method: "POST",
headers: { "Content-Type": "application/x-www-form-urlencoded" },
payload: new URLSearchParams({
doc_id: "8845758582119845",
variables: JSON.stringify({ shortcode: video.videoId }),
}).toString(),
useExtendedUniqueKey: true,
label: "stats",
userData: { video },
})),
);
}
}
function extractDurationFromManifest(manifestXml: string): number {
const match = cheerio
.load(manifestXml, { xmlMode: true })("MPD")
.attr("mediaPresentationDuration")
?.match(/^PT([\d.]+)S$/);
return match?.[1] ? parseInt(match[1]) : 0;
}
IS

Thanks a lot, I found the video_duration parameter, I grabbed the post collection and the parameter was not as comprehensive as the single page information