Linkedin Posts Informations Scraper avatar
Linkedin Posts Informations Scraper

Pricing

$30.00/month + usage

Go to Store
Linkedin Posts Informations Scraper

Linkedin Posts Informations Scraper

saswave/linkedin-posts-informations-parser

Developed by

SASWAVE

Maintained by Community

Scrape linkedin posts from linkedin post search results, url post or linkedin member. Supports advanced linkedin search filters. Extract posts data at scale.

0.0 (0)

Pricing

$30.00/month + usage

27

Monthly users

52

Runs succeeded

94%

Response time

1.7 days

Last modified

a day ago

OP

Broken UTF-8 encoding

Closed
openjoy opened this issue
4 months ago

Hello. There's another issue we found and it's a bit weird. We see some issues with text encoding (i.e. post contents). Some unicode characters are represented incorrectly, mostly emojis but also some punctuation marks and even non-breakable spaces. But it's not like 100% of unicode chars are broken. Some emojis, for example, are represented correctly. We tried using different tools/libs to fix the encoding but without success. And we see the broken chars already in APIFY datasets so our guess is that the issue is somewhere in the actor (or surrounding libs/infra). Could you please take a look?

Example input:

1{
2  "cookies": [...],
3  "days_since_post": 14,
4  "max_posts": 0,
5  "url_search": "https://www.linkedin.com/in/danielmoka/"
6}

Example post: https://www.linkedin.com/posts/danielmoka_50-off-black-friday-deal-on-learning-activity-7267795633724407808-epJX

What was scraped (copied from APIFY web UI but we see the same picture from other tools): What you’ll get: • 4+ hours of hands-on 𝐯𝐢𝐝𝐞𝐨 𝐭𝐮𝐭𝐨𝐫𝐢𝐚𝐥𝐬 on TDD • A 𝐓𝐃𝐃 𝐞-𝐛𝐨𝐨𝐤 packed with 10+ years of experience • Pro tips on mastering 𝐭𝐞𝐬𝐭𝐢𝐧𝐠 𝐚𝐧𝐝 𝐫𝐞𝐟𝐚𝐜𝐭𝐨𝐫𝐢𝐧𝐠 • 3 𝐫𝐞𝐚𝐥-𝐰𝐨𝐫𝐥𝐝 projects written in C#/.NET

K

saswave avatar

SASWAVE (saswave)

4 months ago

Added to the todo for tomorrow morning , But we found that text saved in apify dataset isn't always the same encoding as the one you print for logs before saving (if this makes sense to you)

We will check if it's code related or apify related when we push to the storage

OP

openjoy

4 months ago

Thank you for quick response, as always. I understand that the encoding can get broken in various places. Admittedly, we haven't checked the run logs to investigate. In the end, we, of course, just want to grab a dataset file so I hope this can be fixed. I don't think we had this issue with other actors but it could be because the results there were within ASCII charset.

If it's any help, one low-level example of broken encoding: This character "♻️" was encoded as C3 A2 C2 99 C2 BB C3 AF C2 B8 C2 8F instead of E2 99 BB EF B8 8F. My not very educated guess is that this could be double-encoding but I don't think this explains all the symptoms.

OP

openjoy

4 months ago

I just checked and it's indeed double-encoding. We can probably do some post-processing on our side but it's better to fix the root cause of course.

saswave avatar

SASWAVE (saswave)

4 months ago

We have updated the actor, have a try

Probably related to the way we were decoding linkedin text content, we removed the decoding step and return what linkedin returns

OP

openjoy

4 months ago

Thank you for looking into it. I tried re-running one of the tasks with the latest build (0.0.119) and the issue still persists unfortunately. Seems to be on the same level as before.

saswave avatar

SASWAVE (saswave)

4 months ago

If it didn't help we can't do much.

At this point we return what linkedin returns us as content

I know the encoding in dataset is not always the same as the data we initial want to save (we faced this kind of issue with another actor)

Do you want us to handle unicode cleaning (ignore out of scope char / emojis ) ?

OP

openjoy

4 months ago

My assumption was that this could be some configuration issue on the http client, headless browser (if that's how this works), crawling library, etc. The content returned from LinkedIn to the browser seems to be correct (I double-checked the binary representation) so if you're saying the response was incorrect, could be worth investigating the differences in how the data is requested. Or, like you said, if the issue is with dataset then maybe APIFY devs can suggest something.

In any case, while this is not ideal, we can implement an ugly workaround on our side. Please don't filter out the content as the data is still there. Also, I was wrong at calling the encoding "broken". It's perfectly valid UTF-8, just with some double encoding here and there. Filtering it would be as difficult as fixing the data.

Pricing

Pricing model

Rental 

To use this Actor, you have to pay a monthly rental fee to the developer. The rent is subtracted from your prepaid usage every month after the free trial period. You also pay for the Apify platform usage.

Free trial

3 days

Price

$30.00