TikTok Scraper avatar

TikTok Scraper

Try for free

4 days trial then $45.00/month - No credit card required now

Go to Store
TikTok Scraper

TikTok Scraper

clockworks/tiktok-scraper
Try for free

4 days trial then $45.00/month - No credit card required now

Extract data from TikTok videos, hashtags, and users. Use URLs or search queries to scrape TikTok profiles, hashtags, posts, URLs, shares, followers, hearts, names, video, and music-related data. Export scraped data, run the scraper via API, schedule and monitor runs or integrate with other tools.

Do you want to learn more about this Actor?

Get a demo
FE

listItems() `total` count does not reflect number of posts returned

Closed

firebrick_ektara opened this issue
2 months ago

Using the ApifyClient module sdk, we are seeing an issue in which the value of the total field returned by the dataset.listItems() method does not align with the number of posts actually returned for paginated results.

The following actor options:

1const actorOptions = {
2        profiles: [userId],
3        resultsPerPage: maxPosts,
4        excludePinnedPosts: true,
5        oldestPostDate: undefined,
6        proxyCountryCode: "None",
7        waitSecs: 60 * 60, // 1 hour
8      };
9const apifyRunResult = await apifyClient
10        .actor(apifyActor)
11        .call(actorOptions);
12 const dataset = await apifyClient.dataset(
13        apifyRunResult.defaultDatasetId,
14      );
15 const rawResponse = await dataset.listItems({
16          limit, // page size
17          offset, // item offset
18        });
19...
20const {item, count , total} = rawResponse;

produces inconsistent results. Sometimes the total response reflects the actual count and sometimes it reports a number like 30 even tho total number of items returned will by more than 30.

The total value is important to determine when the pagination has completed.

svpetrenko avatar

Hi! Thanks for patience, I've tried to reproduce it, but couldn't, I always get total to be the exact number of dataset items. For reference, here is my reproduction:

1import { ApifyClient } from 'apify-client';
2
3const client = new ApifyClient({
4  token: 'mytoken',
5});
6
7const actorOptions = {
8  profiles: ["muslim"],
9  resultsPerPage: 1000,
10  excludePinnedPosts: true,
11  oldestPostDate: undefined,
12  proxyCountryCode: "None",
13  waitSecs: 60 * 60, // 1 hour
14};
15const apifyRunResult = await client
16  .actor('GdWCkxBtKWOsKjdch')
17  .call(actorOptions);
18const dataset = client.dataset(
19  apifyRunResult.defaultDatasetId,
20);
21
22const itemCount = await dataset.get().then((response) => response.itemCount);
23for (let i = 0; i < itemCount; i += 30) {
24  const rawResponses = await dataset.listItems({
25    limit: 30, // page size
26    offset: i, // item offset
27  });
28  if (rawResponses.total !== itemCount) {
29    process.exit(1);
30  }
31  console.log(rawResponses.total);
32}

Could you send me your workflow's exact code (with tokens redacted), so I check further? It could be the case that after finishing the run, the dataset count hasn't had time to be updated, so it's a sort of a rare race condition

FE

firebrick_ektara

2 months ago

Hi, thanks for looking into this. I wasn't aware of the approach you used to get the itemCount - that could be a useful workaround for us.

Here is the code we use the paginate the results. As you can see, we took another approach to workaround the issue we are having with the total property. However, the problem we were having previsouly is that the total value would sometimes change with multiple listItems() calls.

1export const paginateFeedDataset = async (dataset, maxPosts, log, stats) => {
2  let hasMore = true;
3  const limit = 20; // apify api page size
4  let offset = 0; // the offset of the next page
5  let total = 0;
6  let totalItemsFetched = 0;
7  const items = [];
8
9  while (hasMore) {
10    const rawResponse = await dataset.listItems({
11      limit, // page size
12      offset, // item offset
13    });
14    const {items: pageItems, count = 0} = rawResponse;
15
16    // not currently in use
17    total = rawResponse.total;
18
19    log.info(`Fetched batch of ${pageItems.length} posts from Apify.`);
20
21    totalItemsFetched += count;
22    offset = totalItemsFetched;
23    items.push(...pageItems);
24
25    hasMore = totalItemsFetched < maxPosts && count > 0;
26  }
27  // Check if account is private
28  if (items.length === 1) {
29    if (items[0]?.authorMeta?.privateAccount) {
30      const msg = `${items[0].authorMeta.name} is a private account.`;
31      // log.error(msg);
32      throw new Error(msg);
33    }
34  }
35  stats.count("total", totalItemsFetched);
36  return items;
37};
FE

firebrick_ektara

2 months ago

The initial call to the apify Client is made using this snippet. I am not able to post the entirety of our implementation. But let me know if there is any other key info you need:

1const actorOptions = {
2        profiles: [userId],
3        resultsPerPage: maxPosts,
4        excludePinnedPosts: false,
5        oldestPostDate: dateString,
6        proxyCountryCode: "None",
7        waitSecs: 60 * 60 * 24, // 24 hours
8      };
9
10   // get the user's feed. apify will return the history of maxQueries posts, starting at most recent.
11      const apifyRunResult = await apifyClient
12        .actor(apifyActor)
13        .call(actorOptions);
svpetrenko avatar

Hi! Again, thanks for the patience!

I've managed to reproduce it and forwarded to our tooling team. In the meantime, I'd recommend to use the one endpoint you've said you could use as a workaround, and if you can add some timeout before first querying it (like sleep(5000) to wait for 5 seconds), so that the servers have time to update the count

UPD.: the tooling team said there is a known lag in the update of this count, and recommend to wait for 10 seconds before querying it for now

svpetrenko avatar

UPD2: The platform team said that "Updates to dataset obejct are throttled in API. So dataset stats may change even after actor has finished its run." and apparently they won't fix it. So I'd recommend not to rely on the total count too much, or, again, wait for ~10 seconds before querying for it

Let me know if it's still a problem for you by reopening the issue

Developer
Maintained by Apify

Actor Metrics

  • 1.1k monthly users

  • 133 stars

  • >99% runs succeeded

  • 3.6 days response time

  • Created in Sep 2021

  • Modified 5 days ago