# Internet Archive Scraper (`dami_studio/internet-archive-scraper`) Actor

Searches the Internet Archive (archive.org) by keyword and returns structured items (title, creator, year, downloads, subjects, item URL); filter by media type and sort by downloads or upload date.

- **URL**: https://apify.com/dami\_studio/internet-archive-scraper.md
- **Developed by:** [Dami's Studio](https://apify.com/dami_studio) (community)
- **Categories:** Integrations, News, Other
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

$2.00 / 1,000 item returneds

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Internet Archive Scraper

Search the **Internet Archive** (archive.org) by keyword and get back clean, structured items — title, creator, year, downloads, subjects, description and the item URL. No API key, no login.

Built on the public `advancedsearch.php` JSON API. Filter by media type (texts, audio, movies, software, image, …), sort by downloads, date, or relevance, and paginate transparently up to your item limit.

### What you get per item

`identifier`, `title`, `creator`, `year`, `date`, `mediaType`, `downloads`, `subjects` (array), `description` (first ~500 chars), `publicdate`, and `url` (`https://archive.org/details/{identifier}`).

#### Fields that can be null

- `title`, `creator`, `year`, `date`, `description`, `publicdate` — null when archive.org's metadata doesn't include that field for an item.
- `subjects` — empty array when the item has no subject tags.
- `downloads` — `0` when not reported.

### Input

| Field | Notes |
|---|---|
| `query` | **Required.** Keywords, e.g. `nasa apollo`, `jazz`. Supports archive.org Lucene operators, e.g. `title:(grateful dead) AND year:[1977 TO 1980]`. |
| `mediaType` | Restrict to one type: `texts`, `audio`, `movies`, `software`, `image`, `web`, `data`, `collection`. Empty = any. |
| `sort` | `downloads` (default), `date`, `publicdate`, or `relevance`. |
| `maxItems` | Max unique items to return (default 100). Paginates 100 per request until reached or exhausted. |

### Output

One dataset row per item. Pricing is pay-per-result: you are only charged for genuine item rows (`ok: true`). Diagnostic rows are **never** charged — this includes:

- empty/invalid input (`errorCode: "BAD_INPUT"` — empty query or an unknown `mediaType`),
- no results for the query (`NO_RESULTS`),
- rate limits or network errors (`RATE_LIMITED` / `NETWORK` / `SERVER_ERROR`).

Results are de-duplicated by `identifier`.

#### Proxy

The archive.org advancedsearch API is a public, no-auth JSON endpoint with no anti-bot, so **no proxy is required** and the default runs without one (saving proxy credits). Only enable Apify Proxy if you hit IP rate limits at very high volume.

#### Troubleshooting

- Getting a `BAD_INPUT` row? Provide a non-empty `query`, and if you set `mediaType` make sure it's one of the allowed values.
- `NO_RESULTS`? The query matched nothing on archive.org — broaden the keywords or remove the media-type filter.
- Want fewer/more results? Adjust `maxItems`. The archive can return very large result sets for broad queries.

### Example

```json
{ "query": "jazz", "mediaType": "audio", "sort": "downloads", "maxItems": 50 }
````

### Notes

The actor calls `advancedsearch.php` with `output=json`, requesting `identifier`, `title`, `creator`, `year`, `date`, `mediatype`, `downloads`, `description`, `subject`, and `publicdate`, then maps each doc to a clean row. Pagination uses `page` with 100 rows per request until your `maxItems` is reached or the `numFound` total is exhausted.

# Actor input Schema

## `query` (type: `string`):

Keywords to search the Internet Archive for (e.g. "nasa apollo", "jazz"). Supports Lucene operators used by archive.org, e.g. "title:(grateful dead) AND year:\[1977 TO 1980]". Required.

## `mediaType` (type: `string`):

Restrict results to one media type, or leave empty for any. texts = books/documents, audio = music/recordings, movies = video/film, software, image, web (archived sites), data, collection.

## `sort` (type: `string`):

Order of results. downloads = most-downloaded first, date = newest item date first, publicdate = most recently added to archive.org first, relevance = the archive's default relevance ranking.

## `maxItems` (type: `integer`):

Maximum number of unique items to return. The actor paginates 100 per request until this many items are collected or the result set is exhausted.

## `notionConnector` (type: `string`):

Optional. Write each item as a page into your Notion when the run finishes. Authorize a Notion connector once in Settings → API & Integrations → MCP connectors, then pick it here. Leave empty to skip (default) — results are always saved to the dataset regardless.

## `notionParentId` (type: `string`):

Optional. The Notion data source ID of the database to write into (only used if a Notion connector is set). Leave empty to create the pages privately in your workspace instead.

## `proxyConfiguration` (type: `object`):

OPTIONAL. The archive.org advancedsearch API is a public, no-auth, no-anti-bot JSON endpoint, so no proxy is needed and the default routes traffic directly (saving proxy credits). Only enable Apify Proxy if you hit IP rate limits at very high volume.

## Actor input object example

```json
{
  "query": "nasa apollo",
  "mediaType": "",
  "sort": "downloads",
  "maxItems": 100,
  "proxyConfiguration": {
    "useApifyProxy": false
  }
}
```

# Actor output Schema

## `results` (type: `string`):

Scraped rows are stored in the default dataset (one row per result). Blocked/empty/error runs return a single uncharged diagnostic row instead.

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "query": "nasa apollo"
};

// Run the Actor and wait for it to finish
const run = await client.actor("dami_studio/internet-archive-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "query": "nasa apollo" }

# Run the Actor and wait for it to finish
run = client.actor("dami_studio/internet-archive-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "query": "nasa apollo"
}' |
apify call dami_studio/internet-archive-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=dami_studio/internet-archive-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Internet Archive Scraper",
        "description": "Searches the Internet Archive (archive.org) by keyword and returns structured items (title, creator, year, downloads, subjects, item URL); filter by media type and sort by downloads or upload date.",
        "version": "0.1",
        "x-build-id": "dZbgRXvIaL0QllzYn"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/dami_studio~internet-archive-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-dami_studio-internet-archive-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/dami_studio~internet-archive-scraper/runs": {
            "post": {
                "operationId": "runs-sync-dami_studio-internet-archive-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/dami_studio~internet-archive-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-dami_studio-internet-archive-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "query": {
                        "title": "Search query",
                        "type": "string",
                        "description": "Keywords to search the Internet Archive for (e.g. \"nasa apollo\", \"jazz\"). Supports Lucene operators used by archive.org, e.g. \"title:(grateful dead) AND year:[1977 TO 1980]\". Required."
                    },
                    "mediaType": {
                        "title": "Media type",
                        "enum": [
                            "",
                            "texts",
                            "audio",
                            "movies",
                            "software",
                            "image",
                            "web",
                            "data",
                            "collection"
                        ],
                        "type": "string",
                        "description": "Restrict results to one media type, or leave empty for any. texts = books/documents, audio = music/recordings, movies = video/film, software, image, web (archived sites), data, collection.",
                        "default": ""
                    },
                    "sort": {
                        "title": "Sort by",
                        "enum": [
                            "downloads",
                            "date",
                            "publicdate",
                            "relevance"
                        ],
                        "type": "string",
                        "description": "Order of results. downloads = most-downloaded first, date = newest item date first, publicdate = most recently added to archive.org first, relevance = the archive's default relevance ranking.",
                        "default": "downloads"
                    },
                    "maxItems": {
                        "title": "Max items",
                        "minimum": 1,
                        "maximum": 10000,
                        "type": "integer",
                        "description": "Maximum number of unique items to return. The actor paginates 100 per request until this many items are collected or the result set is exhausted.",
                        "default": 100
                    },
                    "notionConnector": {
                        "title": "Notion connector (optional)",
                        "type": "string",
                        "description": "Optional. Write each item as a page into your Notion when the run finishes. Authorize a Notion connector once in Settings → API & Integrations → MCP connectors, then pick it here. Leave empty to skip (default) — results are always saved to the dataset regardless."
                    },
                    "notionParentId": {
                        "title": "Notion target data source ID",
                        "type": "string",
                        "description": "Optional. The Notion data source ID of the database to write into (only used if a Notion connector is set). Leave empty to create the pages privately in your workspace instead."
                    },
                    "proxyConfiguration": {
                        "title": "Proxy (optional)",
                        "type": "object",
                        "description": "OPTIONAL. The archive.org advancedsearch API is a public, no-auth, no-anti-bot JSON endpoint, so no proxy is needed and the default routes traffic directly (saving proxy credits). Only enable Apify Proxy if you hit IP rate limits at very high volume.",
                        "default": {
                            "useApifyProxy": false
                        }
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
