TikTok Scraper avatar

TikTok Scraper

Try for free

4 days trial then $45.00/month - No credit card required now

View all Actors
TikTok Scraper

TikTok Scraper

clockworks/tiktok-scraper
Try for free

4 days trial then $45.00/month - No credit card required now

Extract data from TikTok videos, hashtags, and users. Use URLs or search queries to scrape TikTok profiles, hashtags, posts, URLs, shares, followers, hearts, names, video, and music-related data. Export scraped data, run the scraper via API, schedule and monitor runs or integrate with other tools.

Do you want to learn more about this Actor?

Get a demo
FE

listItems() `total` count does not reflect number of posts returned

Open

firebrick_ektara opened this issue
17 days ago

Using the ApifyClient module sdk, we are seeing an issue in which the value of the total field returned by the dataset.listItems() method does not align with the number of posts actually returned for paginated results.

The following actor options:

1const actorOptions = {
2        profiles: [userId],
3        resultsPerPage: maxPosts,
4        excludePinnedPosts: true,
5        oldestPostDate: undefined,
6        proxyCountryCode: "None",
7        waitSecs: 60 * 60, // 1 hour
8      };
9const apifyRunResult = await apifyClient
10        .actor(apifyActor)
11        .call(actorOptions);
12 const dataset = await apifyClient.dataset(
13        apifyRunResult.defaultDatasetId,
14      );
15 const rawResponse = await dataset.listItems({
16          limit, // page size
17          offset, // item offset
18        });
19...
20const {item, count , total} = rawResponse;

produces inconsistent results. Sometimes the total response reflects the actual count and sometimes it reports a number like 30 even tho total number of items returned will by more than 30.

The total value is important to determine when the pagination has completed.

svpetrenko avatar

Hi! Thanks for patience, I've tried to reproduce it, but couldn't, I always get total to be the exact number of dataset items. For reference, here is my reproduction:

1import { ApifyClient } from 'apify-client';
2
3const client = new ApifyClient({
4  token: 'mytoken',
5});
6
7const actorOptions = {
8  profiles: ["muslim"],
9  resultsPerPage: 1000,
10  excludePinnedPosts: true,
11  oldestPostDate: undefined,
12  proxyCountryCode: "None",
13  waitSecs: 60 * 60, // 1 hour
14};
15const apifyRunResult = await client
16  .actor('GdWCkxBtKWOsKjdch')
17  .call(actorOptions);
18const dataset = client.dataset(
19  apifyRunResult.defaultDatasetId,
20);
21
22const itemCount = await dataset.get().then((response) => response.itemCount);
23for (let i = 0; i < itemCount; i += 30) {
24  const rawResponses = await dataset.listItems({
25    limit: 30, // page size
26    offset: i, // item offset
27  });
28  if (rawResponses.total !== itemCount) {
29    process.exit(1);
30  }
31  console.log(rawResponses.total);
32}

Could you send me your workflow's exact code (with tokens redacted), so I check further? It could be the case that after finishing the run, the dataset count hasn't had time to be updated, so it's a sort of a rare race condition

FE

firebrick_ektara

9 days ago

Hi, thanks for looking into this. I wasn't aware of the approach you used to get the itemCount - that could be a useful workaround for us.

Here is the code we use the paginate the results. As you can see, we took another approach to workaround the issue we are having with the total property. However, the problem we were having previsouly is that the total value would sometimes change with multiple listItems() calls.

1export const paginateFeedDataset = async (dataset, maxPosts, log, stats) => {
2  let hasMore = true;
3  const limit = 20; // apify api page size
4  let offset = 0; // the offset of the next page
5  let total = 0;
6  let totalItemsFetched = 0;
7  const items = [];
8
9  while (hasMore) {
10    const rawResponse = await dataset.listItems({
11      limit, // page size
12      offset, // item offset
13    });
14    const {items: pageItems, count = 0} = rawResponse;
15
16    // not currently in use
17    total = rawResponse.total;
18
19    log.info(`Fetched batch of ${pageItems.length} posts from Apify.`);
20
21    totalItemsFetched += count;
22    offset = totalItemsFetched;
23    items.push(...pageItems);
24
25    hasMore = totalItemsFetched < maxPosts && count > 0;
26  }
27  // Check if account is private
28  if (items.length === 1) {
29    if (items[0]?.authorMeta?.privateAccount) {
30      const msg = `${items[0].authorMeta.name} is a private account.`;
31      // log.error(msg);
32      throw new Error(msg);
33    }
34  }
35  stats.count("total", totalItemsFetched);
36  return items;
37};
FE

firebrick_ektara

9 days ago

The initial call to the apify Client is made using this snippet. I am not able to post the entirety of our implementation. But let me know if there is any other key info you need:

1const actorOptions = {
2        profiles: [userId],
3        resultsPerPage: maxPosts,
4        excludePinnedPosts: false,
5        oldestPostDate: dateString,
6        proxyCountryCode: "None",
7        waitSecs: 60 * 60 * 24, // 24 hours
8      };
9
10   // get the user's feed. apify will return the history of maxQueries posts, starting at most recent.
11      const apifyRunResult = await apifyClient
12        .actor(apifyActor)
13        .call(actorOptions);
Developer
Maintained by Apify

Actor Metrics

  • 1.1k monthly users

  • 96 stars

  • >99% runs succeeded

  • 3 days response time

  • Created in Sep 2021

  • Modified 8 hours ago