Under maintenance

Pricing

$2.00 / 1,000 results

Try for free

Go to Apify Store

Obsidian Mcp Actor

Under maintenance

Try for free

A lightweight Obsidian MCP Actor built for fast, local note automation. It parses, indexes, and transforms vault content with zero bloat. Perfect for workflows that need speed, clean structure, and reliable processing across markdown files, tags, metadata, and linked notes.

Pricing

$2.00 / 1,000 results

Rating

0.0

(0)

Developer

Antony mwangi

Actor stats

Bookmarked

Total users

Monthly active users

4 months ago

Last modified

🎯 What's New in v2.0

🛡️ Security Hardening

Path traversal protection: All file operations validated against vault directory
Content size limits: Prevents OOM with 10MB maximum content size
Input validation: Sanitized URLs and filenames with clear error categories
Robots.txt compliance: Automatic respect for robots.txt rules

⚡ Performance & Scalability

Parallel processing: Concurrent image downloads (3-10x faster)
Intelligent caching: Three cache strategies (memory, disk, Apify KV)
Exponential backoff: Smart retry logic with jitter for rate-limited sites
Stealth mode: Enhanced Playwright evasion for anti-bot protection

🏗️ Modern Architecture

Service-oriented design: Modular, testable, maintainable
Strategy pattern: Pluggable scraping engines (Cheerio → Playwright fallback)
Real-time monitoring: Live WebSocket progress viewer
TypeScript support: Full type definitions for core interfaces

🔥 Key Features

Feature	Description
🤖 Dual Scraping Engines	Cheerio for speed, Playwright for JavaScript-heavy sites
💾 Persistent Caching	Avoid re-scraping with disk-backed cache between runs
🏷️ Intelligent Tagging	Extract tags from content, metadata, JSON-LD, and domains
🔗 Auto Internal Linking	Automatically link related notes by shared tags
📸 Image Handling	Download and reference images with parallel processing
📝 Template Support	Configure scraping via Obsidian template files
📊 Live Progress	WebSocket viewer shows real-time scraping status
🔐 Security First	Path traversal protection, input validation, size limits
🎯 MCP Integration	Expose 5 tools to Claude/LLMs for AI-driven workflows
📈 Performance Metrics	Track cache hit rates, processing times, and throughput

🚀 Quick Start

Single URL Scrape

{
  "url": "https://example.com/article",
  "vaultPath": "/Users/yourname/Documents/Obsidian",
  "folderPath": "research/articles",
  "tags": ["ai", "research"],
  "autoTag": true,
  "autoLink": true
}

Bulk Import with Caching

{
  "urls": [
    "https://site1.com/post",
    "https://site2.com/guide",
    "https://site3.com/tutorial"
  ],
  "vaultPath": "/Users/yourname/Documents/Obsidian",
  "bulkMode": true,
  "usePlaywright": false,
  "cache": "disk",
  "rateLimitDelay": 2000
}

JavaScript-Heavy Site

{
  "url": "https://react-app.example.com",
  "vaultPath": "/Users/yourname/Documents/Obsidian",
  "usePlaywright": true,
  "enableStealth": true,
  "playwrightTimeout": 45
}

📋 Configuration Reference

Core Settings

Parameter	Type	Required	Default	Description
`url`	string	No*	-	Single URL to scrape (use `urls` for bulk)
`urls`	array	No*	[]	Array of URLs for bulk import
`vaultPath`	string	Yes	-	Absolute path to your Obsidian vault
`folderPath`	string	No	`scraped`	Subfolder path within vault
`noteName`	string	No	Auto	Custom note filename (auto-sanitized)

Processing Options

Parameter	Type	Default	Description
`addMetadata`	boolean	`true`	Include YAML front-matter
`tags`	array	`[]`	Manual tags to apply
`autoTag`	boolean	`true`	Enable intelligent auto-tagging
`autoLink`	boolean	`true`	Create internal links between notes
`updateExisting`	boolean	`false`	Allow overwriting existing notes
`templatePath`	string	-	Obsidian template for config

Performance & Reliability

Parameter	Type	Default	Description
`usePlaywright`	boolean	`false`	Use Chrome browser automation
`playwrightTimeout`	number	`30`	Page load timeout (seconds)
`enableStealth`	boolean	`true`	Apply anti-bot evasion
`maxRetries`	number	`3`	Retry attempts per URL
`rateLimitDelay`	number	`2000`	Delay between requests (ms)
`cache`	string	`memory`	Cache type: `memory`, `disk`, `apify`
`downloadImages`	boolean	`false`	Download images to vault
`concurrency`	number	`3`	Parallel download workers

* Either url or urls must be provided

📁 Generated Note Format

---
title: "Understanding Machine Learning"
url: https://example.com/ml-guide
scraped: 2024-01-15T10:30:00.000Z
tags: ["machine-learning", "ai", "research", "technology", "example"]
description: "A comprehensive guide to ML fundamentals"
author: "Jane Smith"
---

# Understanding Machine Learning

> 🔗 Source: [https://example.com/ml-guide](https://example.com/ml-guide)

> 📅 Scraped: January 15, 2024

---

## Article Content

Full content converted to Markdown...

---

## Metadata

- **Author:** Jane Smith
- **Description:** A comprehensive guide to ML fundamentals
- **Canonical:** https://example.com/ml-guide
- **Robots:** index,follow

🎓 Advanced Usage

Template-Based Configuration

Create templates/scraper-config.md in your vault:

---
folderPath: "research/ai-papers"
autoTag: true
autoLink: true
tags: ["ai", "paper"]
usePlaywright: false
cache: "disk"
---

# AI Paper Scraper Template

This template automatically applies settings when referenced.

Usage:

{
  "url": "https://arxiv.org/abs/2401.12345",
  "vaultPath": "/path/to/vault",
  "templatePath": "templates/scraper-config"
}

Caching Strategies

// Memory cache (fast, ephemeral)
const cache = new MemoryCache({ maxSize: 100 });

// Disk cache (persistent across runs)
const cache = new PersistentCache({ cacheDir: './storage' });

// Apify KV store (cloud, for scheduled actors)
const cache = new PersistentScrapeCache('my-scrape-cache');

Real-Time Progress Viewer

Local Development:

npm install  # Install dependencies
npm run dev  # Start MCP server with live viewer

Apify Platform:

{
  "startResultsServer": true,
  "resultsServerPort": 8080
}

Then visit http://localhost:8080 in your browser.

🤖 MCP Server Integration

The Actor exposes 5 tools to Claude/LLMs:

# Install globally
npm install -g obsidian-mcp-actor

# Add to Claude config
{
  "mcpServers": {
    "obsidian": {
      "command": "obsidian-mcp-actor",
      "args": ["mcp-server"]
    }
  }
}

Available Tools:

scrape_website - Scrape any URL
extract_tags - Analyze content for tags
validate_content - Check scrape quality
convert_html_to_markdown - Transform content
save_note - Save to Obsidian vault

AI Workflow Example:

"Claude, scrape the latest 5 articles from Hacker News, tag them by topic, and save to my trending folder with internal links."

🔧 Development Setup

# Clone repository
git clone https://github.com/yourusername/obsidian-mcp-actor.git
cd obsidian-mcp-actor

# Install dependencies
npm install

# Run TypeScript compilation
npm run build

# Run tests
npm test

# Start MCP server locally
npm run mcp-server

Project Structure

obsidian-mcp-actor/
├── src/
│   ├── main.js                      # Apify Actor entry point
│   ├── mcp-server.js               # MCP server entry point
│   └── lib/
│       ├── processor/              # Core business logic
│       │   ├── UnifiedScraper.js
│       │   ├── MarkdownConverter.js
│       │   ├── TagExtractor.js
│       │   └── ActorService.js
│       ├── scraper/                # Scraping strategies
│       │   ├── CheerioStrategy.js
│       │   └── PlaywrightStrategy.js
│       ├── vault/                  # Obsidian operations
│       │   ├── NoteManager.js
│       │   └── LinkManager.js
│       ├── cache/                  # Caching implementations
│       │   ├── MemoryCache.js
│       │   ├── PersistentCache.js
│       │   └── PersistentScrapeCache.js
│       ├── utils/                  # Utilities
│       │   ├── url.js
│       │   ├── errors.js
│       │   ├── retry.js
│       │   └── stealth.js
│       └── server/                 # WebSocket server
│           └── ResultsServer.js
├── test/                           # Unit and integration tests
├── input_schema.json               # Apify input schema
├── output_schema.json              # Apify output schema
└── package.json

🧪 Testing

# Run all tests
npm test

# Run with coverage
npm run test:coverage

# Run specific test file
npm test test/UnifiedScraper.test.js

Test Coverage Goals:

Core scraping logic: >90%
Security validation: 100%
Vault operations: >85%

📦 Deployment

Apify Platform

Push to Apify:

$apify push

Configure Environment Variables:

APIFY_MEMORY_MBYTES=4096
APIFY_BUILD_TIMEOUT_SECS=300

Schedule Runs:

apify schedule create my-schedule \
  --actor-id your-actor-id \
  --cron "0 9 * * *" \
  --input-json '{"urls": [...], "vaultPath": "/data"}'

Self-Hosted

# Docker
docker build -t obsidian-mcp-actor .
docker run -v /path/to/vault:/data -p 8080:8080 obsidian-mcp-actor

🔄 Migration from v1.x

Breaking Changes

For most users: No changes needed. The public API remains identical.

If you extended internals:

Legacy functions in helpers.js are deprecated but functional

Import from specific modules for new features:

// Old (still works)
import { scrapeWebsite } from './helpers.js';

// New (recommended)
import { UnifiedScraper } from './lib/processor/UnifiedScraper.js';
const scraper = new UnifiedScraper({ usePlaywright: true });

New Cache API

// Old
const cache = new ScrapeCache();

// New
const cache = new MemoryCache({ maxSize: 100, ttl: 3600000 });

Updated File Structure

Move custom code from main.js to lib/processor/ActorService.js for modularity.

📚 Use Cases

Use Case	Configuration
Research Paper Collection	`usePlaywright: false`, `cache: "disk"`, `folderPath: "papers/{year}"`
News Monitoring	`bulkMode: true`, `rateLimitDelay: 5000`, `updateExisting: true`
Competitive Intelligence	`enableStealth: true`, `downloadImages: true`, `autoTag: true`
Course Materials	`templatePath: "templates/course"`, `addMetadata: true`, `autoLink: true`
AI-Powered Curation	Enable MCP server, use Claude to orchestrate complex scraping tasks

📊 Performance Benchmarks

Scenario	v1.x	v2.0	Improvement
Single static page	2.1s	0.8s	2.6x faster
Bulk 10 URLs	45s	18s	2.5x faster
JS-heavy SPA	15s	12s	1.25x faster
Image downloads (20)	25s	3s	8.3x faster
Cache hit rate	0%	78%	78% reuse

Benchmarks on M1 Mac, 10 concurrent workers

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Development Guidelines

Write tests for new features
Follow existing code style (ESLint configured)
Update TypeScript types
Document public APIs with JSDoc
Security-first: validate all inputs

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

Built with Crawlee and Playwright
Inspired by the Obsidian community's automation needs
MCP protocol by Anthropic

Made with ❤️ for researchers, knowledge workers, and automation enthusiasts

Transform your Obsidian vault into a self-updating knowledge base.

Obsidian MCP Actor

tillzero/obsidian-mcp-actor

Bridge the web and your Obsidian vault! Scrapes articles, strips ads, downloads images locally, generates AI tags, and syncs perfectly formatted Markdown notes directly to your hard drive via the MCP protocol.

K S

Obsidian Scraping Tool

uasd40/obsidian-scraping-tool

Connects Obsidian vaults with web scraping via MCP. Auto-converts scraped content into formatted notes with tags and templates.

Simon Breidenbach

Obsidian Community Plugins & Themes Scraper

crawlerbros/obsidian-plugin-scraper

Scrape all Obsidian community plugins and CSS themes from the official obsidian-releases repository. Includes download stats, latest version, GitHub links, and more for 1700+ plugins.

Crawler Bros

X(Twitter) Article to Markdown

fastcrawler/x-twitter-article-to-markdown

A high-fidelity extractor that converts X (Twitter) articles into clean, valid Markdown. It delivers a raw UTF-8 string optimized for seamless integration with Obsidian, Notion, and static site generators. Perfect for archival and knowledge management.

fastcrawler

202

Email Actor

gytelio1/email-actor

For MCP server

Gytis Ščipokas

Website to Markdown MCP Server

quodlibetical_buffalo/website-to-markdown-mcp

Convert any webpage to clean Markdown. MCP server for AI agents and LLM pipelines.

Marek Pommier

Tiktok Video Audio Downloader

liul/tiktok-video-audio-downloader

A fast, reliable API to download video & audio files from TikTok videos using URLs only. Built for developers, automation workflows, and AI agents that need clean, ready-to-use TikTok media data.

APISmith

5.0

Tiktok Video Audio Downloader

apple_yang/tiktok-video-audio-downloader

A fast, reliable API to download video & audio files from TikTok videos using URLs only. Built for developers, automation workflows, and AI agents that need clean, ready-to-use TikTok media data.

APISmith

5.0

Website Content Scraper

qaseemiqbal/website-content-scraper

Extract clean Markdown, plain text, linked files, and RAG-ready chunks from websites, documentation, help centers, knowledge bases, and authenticated portals. Preserve structure, metadata, URLs, and crawl context for AI search, training, and retrieval workflows.

Muhammad Qaseem Iqbal