No credit card required

Youtube User / Channel Scraper

gentle_cloud/youtube-user-channel-scraper

No credit card required

You can use this collector to automatically collect information about YouTube bloggers, such as nickname, avatar, country, introduction, joining time, number of videos, number of views, number of fans, etc. You only need to determine the urls and ids or channel ids

.actor/Dockerfile

1# First, specify the base Docker image.
2# You can see the Docker images from Apify at https://hub.docker.com/r/apify/.
3# You can also use any other image from Docker Hub.
4FROM apify/actor-python:3.11
5
6# Second, copy just requirements.txt into the Actor image,
7# since it should be the only file that affects the dependency install in the next step,
8# in order to speed up the build
9COPY requirements.txt ./
10
11# Install the packages specified in requirements.txt,
12# Print the installed Python version, pip version
13# and all installed packages with their versions for debugging
14RUN echo "Python version:" \
15 && python --version \
16 && echo "Pip version:" \
17 && pip --version \
18 && echo "Installing dependencies:" \
19 && pip install -r requirements.txt \
20 && echo "All installed Python packages:" \
21 && pip freeze
22
23# Next, copy the remaining files and directories with the source code.
24# Since we do this after installing the dependencies, quick build will be really fast
25# for most source file changes.
26COPY . ./
27
28# Use compileall to ensure the runnability of the Actor Python code.
29RUN python3 -m compileall -q .
30
31# Specify how to launch the source code of your Actor.
32# By default, the "python3 -m src" command is run
33CMD ["python3", "-m", "src"]

.actor/actor.json

1{
2    "actorSpecification": 1,
3    "name": "youtube-kol-crawler",
4    "title": "Youtube Kol Info Crawler",
5    "description": "Scrape data from youtube page with channel id.",
6    "version": "0.0",
7    "meta": {
8        "templateId": "python-start"
9    },
10    "input": "./input_schema.json",
11    "dockerfile": "./Dockerfile",
12    "storages": {
13        "dataset": {
14            "actorSpecification": 1,
15            "views": {
16                "overview": {
17                    "title": "Overview",
18                    "transformation": {
19                        "fields": [
20                            "channelId",
21                            "avatar",
22                            "banner",
23                            "title",
24                            "verified",
25                            "hasbusinessEmail",
26                            "joinDate",
27                            "country",
28                            "viewCount",
29                            "videoCount",
30                            "subscriberCount",
31                            "description",
32                            "links",
33                            "indexUrl",
34                            "channelUrl"
35                        ]
36                    },
37                    "display": {
38                        "component": "table",
39                        "properties": {
40                            "channelId": {
41                                "label": "Text",
42                                "format": "text"
43                            },
44                            "avatar": {
45                                "label": "Image",
46                                "format": "image"
47                            },
48                            "banner": {
49                                "label": "Image",
50                                "format": "image"
51                            },
52                            "title": {
53                                "label": "Text",
54                                "format": "text"
55                            },
56                            "verified": {
57                                "label": "Boolean",
58                                "format": "boolean"
59                            },
60                            "hasbusinessEmail": {
61                                "label": "Boolean",
62                                "format": "boolean"
63                            },
64                            "indexUrl": {
65                                "label": "Link",
66                                "format": "link"
67                            },
68                            "channelUrl": {
69                                "label": "Link",
70                                "format": "link"
71                            },
72                            "description": {
73                                "label": "Text",
74                                "format": "text"
75                            },
76                            "joinDate": {
77                                "label": "Text",
78                                "format": "text"
79                            },
80                            "country": {
81                                "label": "Text",
82                                "format": "text"
83                            },
84                            "links": {
85                                "label": "Array",
86                                "format": "array"
87                            },
88                            "viewCount": {
89                                "label": "Number",
90                                "format": "number"
91                            },
92                            "videoCount": {
93                                "label": "Number",
94                                "format": "number"
95                            },
96                            "subscriberCount": {
97                                "label": "Number",
98                                "format": "number"
99                            }
100                        }
101                    }
102                }
103            }
104        }
105    }
106}

.actor/input_schema.json

1{
2    "title": "Scrape data from a web page",
3    "type": "object",
4    "schemaVersion": 1,
5    "properties": {
6        "start_urls": {
7            "title": "Start URLs",
8            "type": "array",
9            "description": "Paste and input the YouTube homepage URLs,also supports channel URLs",
10            "editor": "requestListSources",
11            "prefill": [{"url": "https://www.youtube.com/channel/UCyBD3P9YOFWNIMTuDzqeObg"}]
12        },
13        "ids": {
14            "title": "IDs",
15            "type": "string",
16            "description": "Paste and input the YouTube ids, also supports channel ids, multiple inputs are separated by ',' ",
17            "editor": "textfield",
18            "prefill": "UCyBD3P9YOFWNIMTuDzqeObg"
19        }
20    }
21}

src/main.py

1"""
2This module serves as the entry point for executing the Apify Actor. It handles the configuration of logging
3settings. The `main()` coroutine is then executed using `asyncio.run()`.
4
5Feel free to modify this file to suit your specific needs.
6"""
7
8import asyncio
9import logging
10
11from apify.log import ActorLogFormatter
12
13from .main import main
14
15# Configure loggers
16handler = logging.StreamHandler()
17handler.setFormatter(ActorLogFormatter())
18
19apify_client_logger = logging.getLogger('apify_client')
20apify_client_logger.setLevel(logging.INFO)
21apify_client_logger.addHandler(handler)
22
23apify_logger = logging.getLogger('apify')
24apify_logger.setLevel(logging.DEBUG)
25apify_logger.addHandler(handler)
26
27# Execute the Actor main coroutine
28asyncio.run(main())

src/main.py

1"""
2This module defines the `main()` coroutine for the Apify Actor, executed from the `__main__.py` file.
3
4Feel free to modify this file to suit your specific needs.
5
6To build Apify Actors, utilize the Apify SDK toolkit, read more at the official documentation:
7https://docs.apify.com/sdk/python
8"""
9
10# Beautiful Soup - library for pulling data out of HTML and XML files, read more at
11# https://www.crummy.com/software/BeautifulSoup/bs4/doc
12
13# HTTPX - library for making asynchronous HTTP requests in Python, read more at https://www.python-httpx.org/
14import json
15import re
16
17import requests
18from lxml import etree
19
20# Apify SDK - toolkit for building Apify Actors, read more at https://docs.apify.com/sdk/python
21from apify import Actor
22from apify.storages import KeyValueStore
23
24
25def get_count(text):
26    """
27     提取数量
28    :param text: 提取对象字符串
29    :return: 数量
30    """
31    if text:
32        count = text.replace(",", "").split(" ")[0]
33        if "K" in count:
34            count = float(count.split("K")[0]) * 1000
35        elif "M" in count:
36            count = float(count.split("M")[0]) * 1000000
37        return int(count)
38
39
40def get_link_dict(links):
41    """
42     提取链接详情
43    :param links: 提取对象列表
44    :return: 提取后的数据列表
45    """
46    link_list = []
47    for link in links:
48        item = dict()
49        item["title"] = link["channelExternalLinkViewModel"]["title"]["content"]
50        item["link"] = link["channelExternalLinkViewModel"]["link"]["content"]
51        link_list.append(item)
52    return link_list
53
54
55async def main() -> None:
56    """
57    The main coroutine is being executed using `asyncio.run()`, so do not attempt to make a normal function
58    out of it, it will not work. Asynchronous execution is required for communication with Apify platform,
59    and it also enhances performance in the field of web scraping significantly.
60    """
61    async with Actor:
62        # Structure of input is defined in input_schema.json
63        actor_input = await Actor.get_input() or {}
64        start_urls = actor_input.get('start_urls', [])
65        ids = actor_input.get('ids')
66        Actor.log.info(f'ids: {ids}')
67        if ids:
68            id_urls = [{"url":"https://www.youtube.com/channel/" + sid} if "@" not in sid else {"url":"https://www.youtube.com/" + sid} for sid in ids.split(',')]
69        else:
70            id_urls = []
71        start_urls.extend(id_urls)
72        headings = []
73        # Create an asynchronous HTTPX client
74        ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
75        headers = {
76            "authority": "www.youtube.com",
77            'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
78                      "*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
79            "accept-language": "en-US,en;q=0.5",
80            "cache-control": "no-cache",
81            "pragma": "no-cache",
82            "user-agent": ua
83        }
84        for url in start_urls:
85            Actor.log.info(f'crawling: {url}')
86            try:
87                response = requests.get(url.get("url"), headers=headers)
88                if "This account has been terminated" in response.text:
89                    Actor.log.info(f"The account has since been suspended：{response.url}")
90                else:
91                    html = etree.HTML(response.text)
92                    try:
93                        banner = re.search(r'"url":"(https://[^"]+)","width":2560,"', response.text).group(1)
94                    except:
95                        banner = None
96                    name = html.xpath("//title/text()")[0].split(" -")[0]
97                    verified = 1 if 'CHECK_CIRCLE_THICK' in response.text else 0
98                    avatar = re.search('\"avatar\":{\"thumbnails\":\[{\"url\":\"(.*?)\"', response.text).group(1)
99                    kol_token = re.findall('\"token\":\"(.*?)\"', response.text)[-1]
100                    payload = json.dumps({"context": {
101                        "client": {"gl": "US", "deviceMake": "Apple", "deviceModel": "",
102                                    "userAgent": ua,
103                                    "clientName": "WEB", "clientVersion": "2.20240224.11.00", "osName": "Macintosh",
104                                    }, "user": {"lockedSafetyMode": False},
105                        "request": {"useSsl": True, "internalExperimentFlags": [], "consistencyTokenJars": []},
106                    }, "continuation": kol_token})
107                    url = "https://www.youtube.com/youtubei/v1/browse?prettyPrint=false"
108                    response = requests.request("POST", url, headers=headers, data=payload)
109                    info_dict = response.json()["onResponseReceivedEndpoints"][0]["appendContinuationItemsAction"][
110                        "continuationItems"][0]["aboutChannelRenderer"]["metadata"]["aboutChannelViewModel"]
111                    # Extract all headings from the page (tag name and text)
112                    item = {}
113                    item["channelId"] = info_dict.get("channelId")  # 博主ID
114                    item["avatar"] = avatar  # 用户头像
115                    item["banner"] = banner  # 用户背景图片
116                    item["title"] = name  # 博主昵称
117                    item["verified"] = verified  # 是否经过认证
118                    item["hasbusinessEmail"] = 1 if info_dict.get("signInForBusinessEmail") else 0  # 是否有电子邮箱
119                    item["indexUrl"] = info_dict.get("canonicalChannelUrl")  # 用户主页URL
120                    item["channelUrl"] = "https://www.youtube.com/channel/" + info_dict.get("channelId")
121                    item["description"] = info_dict.get("description")  # 用户简介
122                    item["joinDate"] = info_dict.get("joinedDateText").get("content").split("Joined ")[-1]  # 注册时间
123                    item["country"] = info_dict.get("country")  # 国家
124                    links = get_link_dict(info_dict.get("links", []))
125                    item["links"] = links if links else None  # 获取链接详情字典
126                    item["viewCount"] = get_count(info_dict.get("viewCountText"))  # 视频播放次数
127                    item["videoCount"] = get_count(info_dict.get("videoCountText"))  # 视频数量
128                    item["subscriberCount"] = get_count(info_dict.get("subscriberCountText"))  # 订阅数量
129                    Actor.log.info(f'Extracted heading: {item}')
130                    headings.append(item)
131            except:
132                Actor.log.info(f"There are some problems with the request：{url}")
133
134        # Save headings to Dataset - a table-like storage
135        await Actor.push_data(headings)

.dockerignore

1# configurations
2.idea
3
4# crawlee and apify storage folders
5apify_storage
6crawlee_storage
7storage
8
9# installed files
10.venv
11
12# git folder
13.git

.editorconfig

1root = true
2
3[*]
4indent_style = space
5indent_size = 4
6charset = utf-8
7trim_trailing_whitespace = true
8insert_final_newline = true
9end_of_line = lf

.gitignore

1# This file tells Git which files shouldn't be added to source control
2
3.idea
4.DS_Store
5
6apify_storage
7storage/*
8!storage/key_value_stores
9storage/key_value_stores/*
10!storage/key_value_stores/default
11storage/key_value_stores/default/*
12!storage/key_value_stores/default/INPUT.json
13
14.venv/
15.env/
16__pypackages__
17dist/
18build/
19*.egg-info/
20*.egg
21
22__pycache__
23
24.mypy_cache
25.dmypy.json
26dmypy.json
27.pytest_cache
28.ruff_cache
29
30.scrapy
31*.log

requirements.txt

1# Feel free to add your Python dependencies below. For formatting guidelines, see:
2# https://pip.pypa.io/en/latest/reference/requirements-file-format/
3
4apify ~= 1.6.0
5requests
6lxml

Developer

Monkey Coder

Actor Metrics

29 monthly users
16 bookmarks
>99% runs succeeded
Created in Feb 2024
Modified 5 months ago

Categories

🏯 Youtube Channel & User Scraper (Pay Per Result)

apidojo/youtube-channel-information-scraper

Extreme flexibility with search functionalities enables you to retrieve extensive channel information in detail. Not just that but YouTube Channel Information Scraper offers you Country, Language, and Location geotargeting capabilities. Only $0.50 per 1000 channels!

API Dojo

211

Fast YouTube Channel Scraper

streamers/youtube-channel-scraper

This alternative YouTube Data API has no limits or quotas. Use it to scrape one or multiple YouTube channels: channel info, URL, total number of subscribers, videos and views, creation date. Try it and get basic video data. You can download extracted data in JSON, CSV, and Excel.

Streamers

3.3k

🏯 Youtube Scraper (Pay Per Result)

apidojo/youtube-scraper

Experience unparalleled efficiency and depth with its ultra-fast capabilities in searching, URL, playlist, channel, and profile scraping. It comes equipped with a variety of filters to customize your data collection precisely. Plus, it's incredibly affordable at only $0.50 per 1000 videos!

API Dojo

674

Facebook Ads Library Scraper 🎯📈

scrapestorm/facebook-ads-library-scraper

🚀 Extract Facebook Ads Data 📊 from one or multiple Pages! Gather page details, reach estimates, impressions, ad IDs, and more 📅. Download in JSON, CSV, or Excel formats 🔽 for seamless integration into apps, spreadsheets, and reports. Perfect for analysis and insights 📈💻.

Storm_Scraper

YouTube Channels Video Scraper

topaz_sharingan/YouTube-Video-Scraper

YouTube Channels Video Scraper is a powerful and user-friendly tool designed to extract valuable data from YouTube channels. This actor allows you to effortlessly collect detailed information about YouTube Videos from a Channel, including video IDs, titles, URLs, views, likes, published times, etc.

Moses Bilal

TikTok Data Extractor

clockworks/free-tiktok-scraper

Extract data about videos, users, and channels based on hashtags or scrape full user profiles including posts, total likes, name, nickname, numbers of comments, shares, followers, following, and more.

Clockworks

21k

YouTube Scraper

streamers/youtube-scraper

YouTube crawler and video scraper. Alternative YouTube API with no limits or quotas. Extract and download channel name, likes, number of views, and number of subscribers.

Streamers

13.3k

Google Maps Scraper

compass/crawler-google-places

Extract data from hundreds of Google Maps locations and businesses. Get Google Maps data including reviews, images, contact info, opening hours, location, popular times, prices & more. Export scraped data, run the scraper via API, schedule and monitor runs, or integrate with other tools.

Compass

85.2k

Reddit Scraper Lite

trudax/reddit-scraper-lite

Pay Per Result, unlimited Reddit web scraper to crawl posts, comments, communities, and users without login. Limit web scraping by number of posts or items and extract all data in a dataset in multiple formats.

Gustavo Rudiger

4.6k

Traffic Generator (Youtube, Web, Etsy, Behance and many more!)

epctex/traffic-generator

Maximize your website's performance and visibility with our Traffic Generator. Drive targeted traffic, simulate page views, and stress-test against potential threats. With the power to generate millions of visits, it's the ultimate solution for boosting your online presence.