
Linkedin Url Scrape
Deprecated
Pricing
Pay per usage
Go to Store

Linkedin Url Scrape
Deprecated
Scrape Unlimited Linkedin Profile URL's
5.0 (3)
Pricing
Pay per usage
44
1.3K
5
Last modified
a year ago
.actor/Dockerfile
# First, specify the base Docker image.# You can see the Docker images from Apify at https://hub.docker.com/r/apify/.# You can also use any other image from Docker Hub.FROM apify/actor-python:3.11
# Second, copy just requirements.txt into the Actor image,# since it should be the only file that affects the dependency install in the next step,# in order to speed up the buildCOPY requirements.txt ./
# Install the packages specified in requirements.txt,# Print the installed Python version, pip version# and all installed packages with their versions for debuggingRUN echo "Python version:" \ && python --version \ && echo "Pip version:" \ && pip --version \ && echo "Installing dependencies:" \ && pip install -r requirements.txt \ && echo "All installed Python packages:" \ && pip freeze
# Next, copy the remaining files and directories with the source code.# Since we do this after installing the dependencies, quick build will be really fast# for most source file changes.COPY . ./
# Use compileall to ensure the runnability of the Actor Python code.RUN python3 -m compileall -q .
# Specify how to launch the source code of your Actor.# By default, the "python3 -m src" command is runCMD ["python3", "-m", "src"]
.actor/actor.json
{ "actorSpecification": 1, "name": "my-actor-15", "title": "Scrape single page in Python", "description": "Scrape data from single page with provided URL.", "version": "0.0", "meta": { "templateId": "python-start" }, "input": "./input_schema.json", "dockerfile": "./Dockerfile"}
.actor/input_schema.json
{ "title": "Scrape LinkedIn profiles based on keywords", "type": "object", "schemaVersion": 1, "properties": { "keywords": { "title": "Search Keywords", "type": "array", "description": "Enter the keywords to search for LinkedIn profiles, e.g., job titles, industries, locations.", "editor": "stringList", "items": { "type": "string" }, "prefill": ["chief product officer", "united states", "insurance"] }, "numPages": { "title": "Number of Pages", "type": "integer", "description": "The number of pages to scrape (each page corresponds to a set of search results).", "editor": "number", "minimum": 1, "default": 1 } }, "required": ["keywords", "numPages"]}
src/__main__.py
1"""2This module serves as the entry point for executing the Apify Actor. It handles the configuration of logging3settings. The `main()` coroutine is then executed using `asyncio.run()`.4
5Feel free to modify this file to suit your specific needs.6"""7
8import asyncio9import logging10
11from apify.log import ActorLogFormatter12
13from .main import main14
15# Configure loggers16handler = logging.StreamHandler()17handler.setFormatter(ActorLogFormatter())18
19apify_client_logger = logging.getLogger('apify_client')20apify_client_logger.setLevel(logging.INFO)21apify_client_logger.addHandler(handler)22
23apify_logger = logging.getLogger('apify')24apify_logger.setLevel(logging.DEBUG)25apify_logger.addHandler(handler)26
27# Execute the Actor main coroutine28asyncio.run(main())
src/main.py
1import asyncio2from bs4 import BeautifulSoup3import requests4from apify import Actor5import re6
7async def main() -> None:8 async with Actor() as actor:9 actor_input = await actor.get_input() or {}10 keywords = actor_input.get('keywords', ["chief product officer", "united states", "insurance"])11 num_pages = actor_input.get('numPages', 1) # Get the number of pages from input12
13 base_url = 'https://www.google.com/search?q=site%3Alinkedin.com%2Fin%2F+'14 formatted_keywords = '+'.join(f'(%22{keyword.replace(" ", "+")}%22)' for keyword in keywords)15 16 headers = {17 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}18
19 linkedin_urls = []20
21 # Loop through pages based on `num_pages`, incrementing `start` by 10 for each page22 for page_start in range(0, num_pages * 10, 10):23 url = f"{base_url}{formatted_keywords}&start={page_start}"24 print(url)25 response = requests.get(url, headers=headers)26 soup = BeautifulSoup(response.text, 'html.parser')27 links = soup.find_all('a', href=True)28
29 # Extract LinkedIn URLs30 for link in links:31 match = re.search(r'(https?://www\.linkedin\.com/in/[^&]+)', link['href'])32 if match:33 linkedin_url = match.group(1)34 if linkedin_url not in linkedin_urls: # Avoid duplicates35 linkedin_urls.append(linkedin_url)36
37 # Output the LinkedIn URLs38 for url in linkedin_urls:39 await actor.push_data({"LinkedIn URL": url})40
41 actor.log.info(f"Found and saved {len(linkedin_urls)} LinkedIn URLs based on the keywords across {num_pages} pages.")42
43if __name__ == '__main__':44 asyncio.run(main())
.dockerignore
# configurations.idea
# crawlee and apify storage foldersapify_storagecrawlee_storagestorage
# installed files.venv
# git folder.git
.editorconfig
root = true
[*]indent_style = spaceindent_size = 4charset = utf-8trim_trailing_whitespace = trueinsert_final_newline = trueend_of_line = lf
.gitignore
# This file tells Git which files shouldn't be added to source control
.idea.DS_Store
apify_storagestorage/*!storage/key_value_storesstorage/key_value_stores/*!storage/key_value_stores/defaultstorage/key_value_stores/default/*!storage/key_value_stores/default/INPUT.json
.venv/.env/__pypackages__dist/build/*.egg-info/*.egg
__pycache__
.mypy_cache.dmypy.jsondmypy.json.pytest_cache.ruff_cache
.scrapy*.log
requirements.txt
1# Feel free to add your Python dependencies below. For formatting guidelines, see:2# https://pip.pypa.io/en/latest/reference/requirements-file-format/3
4apify ~= 1.6.05beautifulsoup4 ~= 4.12.26httpx ~= 0.25.27types-beautifulsoup4 ~= 4.12.0.78requests ~= 2.28.1