# Pdf To Text Scraper (`getdataforme/pdf-to-text-scraper`) Actor

The Pdf To Text Scraper is an Apify Actor that efficiently extracts text from PDFs, preserving structure and supporting batch processing....

- **URL**: https://apify.com/getdataforme/pdf-to-text-scraper.md
- **Developed by:** [GetDataForMe](https://apify.com/getdataforme) (community)
- **Categories:** AI, Automation, E-commerce
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $9.00 / 1,000 results

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

### Python Scrapy template

<!-- This is an Apify template readme -->

A template example built with Scrapy to scrape page titles from URLs defined in the input parameter. It shows how to use Apify SDK for Python and Scrapy pipelines to save results.

### Included features

- **[Apify SDK](https://docs.apify.com/sdk/python/)** for Python - a toolkit for building Apify [Actors](https://apify.com/actors) and scrapers in Python
- **[Input schema](https://docs.apify.com/platform/actors/development/input-schema)** - define and easily validate a schema for your Actor's input
- **[Request queue](https://docs.apify.com/sdk/python/docs/concepts/storages#working-with-request-queues)** - queues into which you can put the URLs you want to scrape
- **[Dataset](https://docs.apify.com/sdk/python/docs/concepts/storages#working-with-datasets)** - store structured data where each object stored has the same attributes
- **[Scrapy](https://scrapy.org/)** - a fast high-level web scraping framework

### How it works

This code is a Python script that uses Scrapy to scrape web pages and extract data from them. Here's a brief overview of how it works:

- The script reads the input data from the Actor instance, which is expected to contain a `start_urls` key with a list of URLs to scrape.
- The script then creates a Scrapy spider that will scrape the URLs. This Spider (class `TitleSpider`) is storing URLs and titles.
- Scrapy pipeline is used to save the results to the default dataset associated with the Actor run using the `push_data` method of the Actor instance.
- The script catches any exceptions that occur during the [web scraping](https://apify.com/web-scraping) process and logs an error message using the `Actor.log.exception` method.

### Resources

- [Web scraping with Scrapy](https://blog.apify.com/web-scraping-with-scrapy/)
- [Python tutorials in Academy](https://docs.apify.com/academy/python)
- [Alternatives to Scrapy for web scraping in 2023](https://blog.apify.com/alternatives-scrapy-web-scraping/)
- [Beautiful Soup vs. Scrapy for web scraping](https://blog.apify.com/beautiful-soup-vs-scrapy-web-scraping/)
- [Integration with Zapier](https://apify.com/integrations), Make, Google Drive, and others
- [Video guide on getting scraped data using Apify API](https://www.youtube.com/watch?v=ViYYDHSBAKM)
- A short guide on how to build web scrapers using code templates:

[web scraper template](https://www.youtube.com/watch?v=u-i-Korzf8w)


### Getting started

For complete information [see this article](https://docs.apify.com/platform/actors/development#build-actor-locally). To run the Actor use the following command:

```bash
apify run
````

### Deploy to Apify

#### Connect Git repository to Apify

If you've created a Git repository for the project, you can easily connect to Apify:

1. Go to [Actor creation page](https://console.apify.com/actors/new)
2. Click on **Link Git Repository** button

#### Push project on your local machine to Apify

You can also deploy the project on your local machine to Apify without the need for the Git repository.

1. Log in to Apify. You will need to provide your [Apify API Token](https://console.apify.com/account/integrations) to complete this action.

   ```bash
   apify login
   ```

2. Deploy your Actor. This command will deploy and build the Actor on the Apify Platform. You can find your newly created Actor under [Actors -> My Actors](https://console.apify.com/actors?tab=my).

   ```bash
   apify push
   ```

### Documentation reference

To learn more about Apify and Actors, take a look at the following resources:

- [Apify SDK for JavaScript documentation](https://docs.apify.com/sdk/js)
- [Apify SDK for Python documentation](https://docs.apify.com/sdk/python)
- [Apify Platform documentation](https://docs.apify.com/platform)
- [Join our developer community on Discord](https://discord.com/invite/jyEM2PRvMU)

# Actor input Schema

## `startUrls` (type: `array`):

URLs to start with (for online PDFs)

## `image` (type: `string`):

Upload an image file

## `storage_provider` (type: `string`):

Select the cloud storage provider (S3 or GCP)

## `gcp_auth_method` (type: `string`):

Select authentication method for GCP (only applies when GCP is selected)

## `gcp_service_account_json` (type: `string`):

Paste the entire service account JSON content here (only for GCP with Service Account auth)

## `gcp_project_id` (type: `string`):

Google Cloud Project ID (optional, only for GCP with Service Account auth)

## `bucket_name` (type: `string`):

Name of the S3 or GCP bucket (required if storage provider is selected)

## `access_key` (type: `string`):

Access key for S3 or HMAC Access ID for GCP (not needed for GCP Service Account)

## `secret_key` (type: `string`):

Secret key for S3 or HMAC Secret for GCP (not needed for GCP Service Account)

## `endpoint` (type: `string`):

Endpoint URL - For S3: your MinIO/S3 endpoint. For GCP HMAC: storage.googleapis.com (not needed for Service Account)

## `file_path` (type: `array`):

Specific file path(s) in the bucket to process

## `allowedDomains` (type: `array`):

Domains that the scraper is allowed to crawl.

## `proxyConfiguration` (type: `object`):

Specifies proxy servers that will be used by the scraper in order to hide its origin.

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://ballasalla.sch.im/site/uploads/pages/3/_media/20171107_019647f4/Online_gaming_an_introduction_for_parents_and_carers_2017.pdf"
    }
  ],
  "storage_provider": "none",
  "gcp_auth_method": "hmac",
  "endpoint": "storage.googleapis.com",
  "file_path": [],
  "allowedDomains": [
    "apify.com"
  ],
  "proxyConfiguration": {
    "useApifyProxy": true
  }
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://ballasalla.sch.im/site/uploads/pages/3/_media/20171107_019647f4/Online_gaming_an_introduction_for_parents_and_carers_2017.pdf"
        }
    ],
    "storage_provider": "none",
    "gcp_auth_method": "hmac",
    "gcp_service_account_json": "",
    "gcp_project_id": "",
    "bucket_name": "",
    "access_key": "",
    "secret_key": "",
    "endpoint": "storage.googleapis.com",
    "file_path": [],
    "allowedDomains": [
        "apify.com"
    ],
    "proxyConfiguration": {
        "useApifyProxy": true
    }
};

// Run the Actor and wait for it to finish
const run = await client.actor("getdataforme/pdf-to-text-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrls": [{ "url": "https://ballasalla.sch.im/site/uploads/pages/3/_media/20171107_019647f4/Online_gaming_an_introduction_for_parents_and_carers_2017.pdf" }],
    "storage_provider": "none",
    "gcp_auth_method": "hmac",
    "gcp_service_account_json": "",
    "gcp_project_id": "",
    "bucket_name": "",
    "access_key": "",
    "secret_key": "",
    "endpoint": "storage.googleapis.com",
    "file_path": [],
    "allowedDomains": ["apify.com"],
    "proxyConfiguration": { "useApifyProxy": True },
}

# Run the Actor and wait for it to finish
run = client.actor("getdataforme/pdf-to-text-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://ballasalla.sch.im/site/uploads/pages/3/_media/20171107_019647f4/Online_gaming_an_introduction_for_parents_and_carers_2017.pdf"
    }
  ],
  "storage_provider": "none",
  "gcp_auth_method": "hmac",
  "gcp_service_account_json": "",
  "gcp_project_id": "",
  "bucket_name": "",
  "access_key": "",
  "secret_key": "",
  "endpoint": "storage.googleapis.com",
  "file_path": [],
  "allowedDomains": [
    "apify.com"
  ],
  "proxyConfiguration": {
    "useApifyProxy": true
  }
}' |
apify call getdataforme/pdf-to-text-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=getdataforme/pdf-to-text-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Pdf To Text Scraper",
        "description": "The Pdf To Text Scraper is an Apify Actor that efficiently extracts text from PDFs, preserving structure and supporting batch processing....",
        "version": "1.1",
        "x-build-id": "zGKnabawdaiIFIVUD"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/getdataforme~pdf-to-text-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-getdataforme-pdf-to-text-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/getdataforme~pdf-to-text-scraper/runs": {
            "post": {
                "operationId": "runs-sync-getdataforme-pdf-to-text-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/getdataforme~pdf-to-text-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-getdataforme-pdf-to-text-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrls"
                ],
                "properties": {
                    "startUrls": {
                        "title": "Start URLs",
                        "type": "array",
                        "description": "URLs to start with (for online PDFs)",
                        "default": [],
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "image": {
                        "title": "Image Upload",
                        "type": "string",
                        "description": "Upload an image file"
                    },
                    "storage_provider": {
                        "title": "Cloud Storage Provider",
                        "enum": [
                            "none",
                            "s3",
                            "gcp"
                        ],
                        "type": "string",
                        "description": "Select the cloud storage provider (S3 or GCP)",
                        "default": "none"
                    },
                    "gcp_auth_method": {
                        "title": "GCP Authentication Method",
                        "enum": [
                            "hmac",
                            "service_account"
                        ],
                        "type": "string",
                        "description": "Select authentication method for GCP (only applies when GCP is selected)",
                        "default": "hmac"
                    },
                    "gcp_service_account_json": {
                        "title": "GCP Service Account JSON",
                        "type": "string",
                        "description": "Paste the entire service account JSON content here (only for GCP with Service Account auth)"
                    },
                    "gcp_project_id": {
                        "title": "GCP Project ID",
                        "type": "string",
                        "description": "Google Cloud Project ID (optional, only for GCP with Service Account auth)"
                    },
                    "bucket_name": {
                        "title": "Bucket Name",
                        "type": "string",
                        "description": "Name of the S3 or GCP bucket (required if storage provider is selected)"
                    },
                    "access_key": {
                        "title": "Access Key / HMAC Access ID",
                        "type": "string",
                        "description": "Access key for S3 or HMAC Access ID for GCP (not needed for GCP Service Account)"
                    },
                    "secret_key": {
                        "title": "Secret Key / HMAC Secret",
                        "type": "string",
                        "description": "Secret key for S3 or HMAC Secret for GCP (not needed for GCP Service Account)"
                    },
                    "endpoint": {
                        "title": "Endpoint URL",
                        "type": "string",
                        "description": "Endpoint URL - For S3: your MinIO/S3 endpoint. For GCP HMAC: storage.googleapis.com (not needed for Service Account)"
                    },
                    "file_path": {
                        "title": "File Path in Bucket",
                        "type": "array",
                        "description": "Specific file path(s) in the bucket to process",
                        "items": {
                            "type": "string"
                        }
                    },
                    "allowedDomains": {
                        "title": "Allowed domains",
                        "type": "array",
                        "description": "Domains that the scraper is allowed to crawl.",
                        "default": [
                            "apify.com"
                        ]
                    },
                    "proxyConfiguration": {
                        "title": "Proxy configuration",
                        "type": "object",
                        "description": "Specifies proxy servers that will be used by the scraper in order to hide its origin.",
                        "default": {
                            "useApifyProxy": true
                        }
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
