skill-seekers/Skill_Seekers

MCP Developer Tools

The universal preprocessing layer that converts documentation, code repos, PDFs, videos and many other sources into structured knowledge assets ready for AI systems such as Claude, Gemini, LangChain, LlamaIndex, Cursor and more.

Track releases GitHub Website

Python Latest v3.8.0 · 1mo ago Security brief →

Features

Converts docs, repos, PDFs, videos, notebooks and over ten source types into structured knowledge assets
One‑command creation (`skill-seekers create …`) for any input URL or local path
Exports to multiple AI targets (Claude skills, Gemini, OpenAI/GPT, LangChain, LlamaIndex, Haystack, Pinecone, IBM Bob, Cursor, etc.)
Accelerates data preparation from days to 15‑45 minutes while preserving high‑quality skill output

Recent releases

View all 21 releases →

No immediate action

v3.8.0 Breaking risk 1mo

Unification, CI fixes, Windows fix

Open

No immediate action

v3.7.0 Breaking risk 1mo

scan command + opt-in submission

Open

v3.6.0 Breaking risk 2mo

Notable features

IBM Bob packaging target via `--target bob`
GitHub scraper filters: issue state, labels, and since date
Per-issue Markdown files for GitHub issues

Full changelog

[3.6.0] - 2026-05-03

Theme: Quality-of-life release — packaging targets, GitHub issue workflow, codebase analysis fixes, and source detection hardening.

Added

IBM Bob packaging target — new --target bob adaptor and agent install support for IBM's Bob agent platform (#366)
GitHub issue filtering — --github-issue-state, --github-issue-labels, and --github-issue-since filters in the GitHub scraper for narrowing which issues are pulled (#367)
Per-issue files — GitHub scraper now writes one Markdown file per issue instead of a single bundle, improving navigation and downstream chunking (#367)
Pinecone frontmatter — Pinecone vector exports now include consistent YAML frontmatter for metadata round-tripping (#367)

Fixed

Unified scraper now generates codebase_analysis/ index — local sources were producing C3.x outputs with broken SKILL.md links; the unified skill builder now wires up the index and resolves links correctly (#362, #376)
Guides fallback fires correctly — unified_skill_builder was emitting a truthy placeholder for empty guides which suppressed the fallback content; placeholder removed (#364, #375)
HTML URLs no longer treated as local files — source_detector now checks for http(s):// before falling through to the local-path branch, fixing false-positive routing (#373)
PDF extracted images appear in markdown — pdf_scraper now inserts ![](…) references for images extracted from PDFs so they render in the generated SKILL.md (#369)
C3.x output for local sources — unified command was skipping the C3.x analysis pipeline for local codebase sources; now emits the full pattern/test/guide/config/router output (#363, #372)
Language filter passed to C3.x clone analysis — repos cloned for analysis now respect --languages instead of analyzing every file (fixes #361, #370)
Unity vs Unreal detection — Unity projects with C# imports were being misidentified as Unreal; detection now keys on C# import patterns (fixes #365, #368)

View release on GitHub

v3.5.1 Breaking risk 3mo

Breaking changes

max_pages default changed from 500 to -1 (unlimited)
removal of hardcoded magic numbers in constants.py; now reads defaults.json

Notable features

Centralized `defaults.json` config as single source of truth for all default values
Low‑signal code snippet filtering via `_is_low_signal_code_snippet()`
Pattern description normalization with `_normalize_pattern_description()`

Full changelog

[3.5.1] - 2026-04-12

Added

Centralized defaults.json config — single source of truth for all default values (rate_limit, max_pages, workers, async_mode, enhancement, analysis, RAG settings). New defaults.py loader module. All 15+ files that previously hardcoded defaults now read from this file (#356)
Low-signal code snippet filtering — _is_low_signal_code_snippet() filters junk patterns like bare True, options, single identifiers from quick references (#360)
Pattern description normalization — _normalize_pattern_description() cleans boilerplate prefixes and truncates to first meaningful sentence (#360)
Example language priority ranking — _example_language_priority() ranks Python > Bash > JSON > etc. for SKILL.md examples (#360)
checkpoint_exists() method on DocToSkillConverter — was called but never defined (#360)
Unified config source normalization — DocToSkillConverter.__init__ merges fields from sources[0] into flat config for compatibility (#360)
display_name support in SKILL.md generation — produces cleaner titles and slugs (#360)
New tests: test_doc_scraper_entrypoint.py (regression for _run_scraping), quick-reference quality tests, docs-only compatibility tests, nested reference coverage tests (#360)

Changed

max_pages default is now unlimited (-1) — the scraper fetches all pages unless the user explicitly sets --max-pages. Previously defaulted to 500 (#356)
--no-rate-limit flag now works — was defined in CLI arguments but never consumed by ExecutionContext (#356)
constants.py reads from defaults.json — no longer contains hardcoded magic numbers (#356)
ExecutionContext.ScrapingSettings — rate_limit and max_pages now use real defaults instead of None, preventing None-poisoning downstream (#356)
SKILL.md frontmatter cleanup — empty doc_version: and version: fields are now omitted; placeholder sections removed (#360)
Enhancement routing through platform adaptors instead of importing nonexistent enhance_skill_md helper (#360)
quality_metrics.py uses rglob for nested reference directories in unified skills (#360)

Fixed

TypeError: '>' not supported between instances of 'NoneType' and 'int' — rate_limit defaulted to None in ExecutionContext, which flowed through config.get("rate_limit", DEFAULT) (dict.get returns None when the key exists with value None, ignoring the fallback). Fixed in doc_scraper.py (sync + async paths), estimate_pages.py, and sync_config.py (#356, #359)
discover_urls() loop never executed with unlimited max_pages — len(discovered) < -1 is always False. Added unlimited mode guard (#356)
converter.scrape() called nonexistent method in _run_scraping() — changed to converter.scrape_all() (#360)
None-safety for BeautifulSoup attributes — link["href"], sitemap.text, meta_desc["content"] guarded against None XML text nodes (#360)
Python 3.10 compatibility — backslash in f-string in quality_metrics.py not supported before 3.12 (#360)

View release on GitHub

v3.5.0 Breaking risk 3mo

⚠ Upgrade required

All content extraction features (pattern detection, test examples, how‑to guides, config extraction, router generation) are now enabled by default; no opt‑in required
Dynamic routing via `_build_argv()` replaces manual argument forwarding and adds 7 previously missing CLI flags

Breaking changes

Renamed `claude-enhanced` merge mode to `ai-enhanced` (backward‑compatible alias retained)
Removed hardcoded Claude references across the codebase
Removed GitHub API analysis limit of 50 files and config extraction limit of 100 files

Security fixes

Removed command injection vulnerability from cloned repo script execution
Replaced `git add -A` with targeted staging in marketplace publisher
Cleared auth tokens from cached `.git/config` after clone

Notable features

Grand Unification: single `create` command for 18 source types with auto‑detection and direct converters
Agent‑agnostic `AgentClient` abstraction supporting Claude, Kimi, Codex, Copilot, OpenCode, and custom agents via API‑key detection
Headless browser rendering (`--browser` flag) using Playwright to handle JavaScript SPAs

Full changelog

[3.5.0] - 2026-04-09

Theme: Grand Unification — one command, one interface, direct converters. Agent-agnostic architecture, marketplace pipeline, smart SPA discovery, all content extraction enabled by default. 80+ files changed across the codebase.

Added

Grand Unification — unified create command as single entry point for all 18 source types with auto-detection, direct converter invocation, and centralized enhancement (#346)
Agent-agnostic AgentClient abstraction — all 5 enhancers now support Claude, Kimi, Codex, Copilot, OpenCode, and custom agents via a unified interface. Auto-detects agent from API keys instead of hardcoding (#336)
Kimi CLI integration with stdin piping and output parsing (#336)
MarketplacePublisher — publish skills to Claude Code plugin marketplace repos (#336)
MarketplaceManager — register and manage marketplace repositories (#336)
ConfigPublisher — push configs to registered config source repos (#336)
push_config MCP tool for automated config publishing (#336)
Smart SPA discovery engine — three-layer discovery: sitemap.xml, llms.txt, SPA nav rendering (#336)
"browser": true config support for JavaScript SPA sites with browser renderer timeout defaults (60s, domcontentloaded) (#336)
Dynamic routing via _build_argv() — replaced manual arg forwarding with dynamic forwarder, added 7 missing CLI flags (#336)
Kotlin language support for codebase analysis — Full C3.x pipeline support: AST parsing (classes, objects, functions, data/sealed classes, extension functions, coroutines), dependency extraction, design pattern recognition (object declaration→Singleton, companion object→Factory, sealed class→Strategy), test example extraction (JUnit, Kotest, MockK, Spek), language detection patterns, config detection (build.gradle.kts), and extension maps across all analyzers (#287)
Headless browser rendering (--browser flag) — uses Playwright to render JavaScript SPA sites (React, Vue, etc.) that return empty HTML shells. Auto-installs Chromium on first use. Optional dep: pip install "skill-seekers[browser]" (#321)
skill-seekers doctor command — 8 diagnostic checks (Python version, package install, git, core/optional deps, API keys, MCP server, output dir) with pass/warn/fail status and --verbose flag (#316)
Prompt injection check workflow — bundled prompt-injection-check workflow scans scraped content for injection patterns (role assumption, instruction overrides, delimiter injection, hidden instructions). Added as first stage in default and security-focus workflows. Flags suspicious content without removing it (#324)
Codex CLI plugin manifest (.codex-plugin/plugin.json) for OpenAI Codex integration (#350)
6 behavioral UML diagrams — 3 sequence (create pipeline, GitHub+C3.x flow, MCP invocation), 2 activity (source detection, enhancement pipeline), 1 component (runtime dependencies with interface contracts)
134 new tests — test_agent_client.py, test_config_publisher.py, _build_argv tests. Total: 3194 passed, 39 expected skips (#336)

Changed

All content extraction features enabled by default — pattern detection, test examples, how-to guides, config extraction, and router generation no longer require explicit opt-in
Renamed claude-enhanced merge mode to ai-enhanced — backward compatibility alias kept (#336)
Removed 118+ hardcoded Claude references across 60+ files (#336)
Refactored 5 enhancers to use AgentClient abstraction (#336)
Removed 50-file GitHub API analysis limit (#336)
Removed 100-file config extraction limit (#336)
Fixed unified scraper default max_pages from 100 to 500 (#336)
Centralized enhancement timeouts to 45min default with unlimited support (#336)
Excluded slow MCP/e2e tests from CI coverage step to prevent timeout

Fixed

glob('*.md') replaced with rglob('*.md') in all adaptors — fixes packaging when skills are in nested directories (#349)
scraped_data list-vs-dict bug in conflict detection (#336)
base_url passthrough to doc scraper subprocess (#336)
URL filtering now uses base directory correctly (#336)
C3.x analysis data loss (#336)
--enhance-level flag not passed correctly (#336)
guide_enhancer method rename — _call_claude_api renamed to _call_ai (#336)
11 pre-existing test failures fixed (#336)
Per-file language detection in GitHub scraper (#336)
GitHub language detection crashes with TypeError when API response contains non-integer metadata keys (e.g., "url") — now filters to integer values only (#322)
C3.x codebase analysis crashes with TypeError — _run_c3_analysis() and _analyze_c3x() passed removed enhance_with_ai/ai_mode kwargs to analyze_codebase() instead of enhance_level (#323)