
✨Mass Linkedin Profile Scraper with Email 📧 (No Cookies)
Pricing
$10.00 / 1,000 results

✨Mass Linkedin Profile Scraper with Email 📧 (No Cookies)
Scrape Linkedin profiles and get full information of the lead.
4.7 (39)
Pricing
$10.00 / 1,000 results
476
Total users
8.6K
Monthly users
4K
Runs succeeded
>99%
Issues response
11 hours
Last modified
2 days ago
Comprehensive Issues in LinkedIn Lead Processing Workflow: LinkedIn URL Normalization and Deduplication Failures, Apify Scraper Duplicates & Limits, and Intermittent Email Validation Errors (n8n v1.97.1)
Closed
Bug Description
I'm encountering a series of interconnected and persistent issues with my n8n workflow designed for LinkedIn lead processing, specifically concerning LinkedIn URL normalization, deduplication, Apify scraping, and email validation. These problems have been present since I started building this complex workflow.
My overall workflow aims to:
Get leads from Google Sheets. Normalize LinkedIn profile URLs for consistent identification. Deduplicate leads based on normalized LinkedIn URLs and email addresses. Scrape LinkedIn profile data using Apify. Validate email addresses using EmailGuard.io. Generate personalized outreach materials and save results to Outlook. Here's a detailed breakdown of the problems I've faced, chronologically where possible:
Phase 1: Initial Setup and Deduplication Challenges
Problem 1: Difficulty with Accurate Deduplication (Initial State)
Initial Goal: My primary goal from the beginning was to prevent processing duplicate leads. Leads often come from Google Sheets, where a single person might appear multiple times with slightly different data, or even multiple times with the same core LinkedIn URL but varied parameters (e.g., ?trk=). Challenge: I needed a reliable way to identify unique individuals, and the LinkedIn Profile URL seemed the most robust identifier. However, direct comparison of raw LinkedIn URLs was failing due to variations. Problem 2: "normalizedLinkedinUrl" Field Missing in "Remove Duplicates" Node (First Major Hurdle)
Introduction of Normalization: To address the deduplication challenge, I implemented a Code node (named Normalize LinkedIn URL) early in my workflow, directly after fetching leads from Google Sheets. Its purpose was to clean up LinkedIn URLs by removing parameters (like ? and # components) to create a consistent, normalized URL (e.g., https://www.linkedin.com/in/pamelajgoodwin/). This normalized URL was intended to be stored in a new field called normalizedLinkedinUrl. First Error: When I then tried to use a Remove Duplicates node, configured to compare based on this new normalizedLinkedinUrl field, it would consistently fail with the error: "normalizedLinkedinUrl" field is missing from some input items. Debugging Attempts: I inspected the output of my Normalize LinkedIn URL node, and for most items, the normalizedLinkedinUrl field seemed to be correctly generated and present. I tried adding an IF node (IF normalized linkedin Exists) before Remove Duplicates to filter out items without this field. However, the error persisted, suggesting that either the IF condition wasn't catching all cases, or the data flow was more complex. I was confused about why the Remove Duplicates node was still complaining if the IF node was supposed to ensure the field existed. Problem 3: Data Stream Mismatch and Field Nesting Issues (Root Cause of Deduplication Failure)
Discovery: Through detailed debugging, I realized the core issue was that my workflow had branched: one path was normalizing LinkedIn URLs, and another path (my email validation branch, involving EmailGuard and mails.so Outlook) was processing emails. Key Insight: The normalizedLinkedinUrl was being added at the top level of item.json in one branch, while the email validation data (especially email after processing by EmailGuard) was often nested under item.json.data. Consequence: When the workflow paths reconverged, the Remove Duplicates node (and the IF normalized linkedin Exists node before it) was receiving items where either normalizedLinkedinUrl was missing or email was nested incorrectly, leading to the "field missing" error. Solution Implemented: I added a Set node (prepare LinkedIn Data) after Normalize LinkedIn URL to explicitly ensure normalizedLinkedinUrl and email were at the root level of item.json. I added another Set node (prepare Email Data) after the True branch of my IF email = deliverable1 node (from the email validation path). This Set node ensures the email field is brought to the top level ($json.email) and also explicitly sets normalizedLinkedinUrl to an empty string ("") for items coming only through the email path, guaranteeing the field exists for all items. Crucially, I then inserted a Merge node (in Append mode) to combine the outputs of prepare LinkedIn Data and prepare Email Data. This ensures all items passing to downstream nodes (like IF normalized linkedin Exists and Remove Duplicates) consistently have both normalizedLinkedinUrl and email at the top level. I updated the IF normalized linkedin Exists condition to {{ $json.normalizedLinkedinUrl }} is Exists, and the Remove Duplicates comparison fields to normalizedLinkedinUrl,email (removing any data. prefixes). Current Status: This structural fix significantly improved the deduplication process, addressing the "missing field" errors. Phase 2: Apify Scraper Problems
Problem 4: Apify Scraper Returns Duplicate Leads
Observation: Despite implementing the LinkedIn URL normalization and subsequent deduplication, I noticed that the apify - person LinkedIn Scrape node (my HTTP Request node calling the Apify LinkedIn Profile Scraper actor) was still consistently returning duplicate scraped data. For instance, if I had 10 unique LinkedIn URLs as input, the output would include items where 6 of them were identical scraped profiles, even though they originated from distinct input URLs. My Apify console shows multiple successful runs that each returned "1 result" in the dataset, but these results often lead to duplicates in my workflow. Details: I confirmed that the Loop Over Items node was correctly passing unique personLinkedIn URLs to the Apify scraper for each iteration. The Apify node's JSON body is configured to send {"profileUrls": ["{{ $json.personLinkedin }}"]}, indicating a single URL per request. I explicitly verified that the "Batching" setting on the Apify HTTP Request node was OFF. I tried adding Wait nodes (e.g., a 22 sec wait! node after Apify) to mitigate potential rate limits, but the duplicates persisted. When checking the Apify Console, I could see unique run IDs for each n8n execution, but the datasets retrieved from those runs would contain the same duplicate data. Problem 5: Apify Scraper Exhibiting "Limits" Errors with Webhook Triggers
Context: In earlier iterations of my workflow (specifically "Workflow A" and "Workflow B"), which were triggered by webhooks, the Apify scraper would frequently return "limits" related errors. Observation: I have encountered "Payment required - perhaps check your payment details?" and "Problem in node 'apify - person LinkedIn Scrape' Payment required - perhaps check your payment details?" errors. This suggests hitting Apify's API rate limits or usage limits more aggressively when the workflow was triggered externally via webhooks, compared to manual execution. Interplay with other problems: I observed a strange behavior where, when I simplified the workflow (by temporarily removing the entire "normalized LinkedIn path" branch), the Apify "limits" errors seemed to alleviate. However, this simplification then caused the EmailGuard problem (Problem 6) to surface or become more prominent. This implies a complex and perhaps resource-intensive interaction between the different parts of my workflow. Phase 3: EmailGuard Validation Problems
Problem 6: Intermittent "Email field is required" error on EmailGuard node (Prominent in Simplified Workflow)
Context: This error became particularly noticeable and problematic when I simplified the workflow (removing the LinkedIn normalization path) to troubleshoot the Apify limits issue. Observation: The EmailGuard1 node (an HTTP Request node) would intermittently fail for specific items (like itemIndex: 2) with the error: "The email field is required.". Debugging Efforts: I verified that the input to EmailGuard1 for the failing item clearly showed a valid email field with a populated string value. I confirmed that the EmailGuard1 node's JSON body was correctly configured to dynamically reference the email: {"email": "{{ $json.email }}"}. (Initially, I had accidentally hardcoded this to {"email": ""}, which was identified and corrected, but the intermittent error persists even after this fix). The error points to the field being "required" but the input data shows it is present. This is highly perplexing and suggests an underlying issue in how n8n is sending the request or how the EmailGuard API is interpreting it for certain specific email values. Phase 4: Other Observed Errors
Problem 7: "Cannot read properties of undefined (reading 'publicIdentifier')" error in an 'add groupKey1' node
Observation: I've also encountered an error in an 'add groupKey1' node, stating "Cannot read properties of undefined (reading 'publicIdentifier')". This seems to indicate a data flow or data structure issue where the publicIdentifier field is expected but not present or defined at that point in the workflow. This specific error is visible in one of my comprehensive workflow screenshots, suggesting it's part of a later stage. Current Status and Impact These combined issues are severely hindering the reliability and efficiency of my lead processing workflow. The deduplication struggles, the Apify duplicates and limits, the EmailGuard intermittent failures, and other data-related errors mean I cannot trust the integrity or completeness of the processed leads.
Problem 1,2,3,6 : These problems are related to your workflow. You need to make changes in your workflow to solve them.
Problem 4 : Please share the run ids, where it has given duplicate data.
Problem 5 : You need to get a paid apify plan for better limits.
Problem 7 : Please share the run id where you haven't got publicIdentifier in response.
voguish_graph
lol did you use gpt to make this bug desc?