Two changes to prevent unnecessary Anthropic prompt cache misses in the
gateway, where a fresh AIAgent is created per user message:
1. Reuse stored system prompt for continuing sessions:
When conversation_history is non-empty, load the system prompt from
the session DB instead of rebuilding from disk. The model already has
updated memory in its conversation history (it wrote it!), so
re-reading memory from disk produces a different system prompt that
breaks the cache prefix.
2. Stabilize Honcho context per session:
- Only prefetch Honcho context on the first turn (empty history)
- Bake Honcho context into the cached system prompt and store to DB
- Remove the per-turn Honcho injection from the API call loop
This ensures the system message is identical across all turns in a
session. Previously, re-fetching Honcho could return different context
on each turn, changing the system message and invalidating the cache.
Both changes preserve the existing behavior for compression (which
invalidates the prompt and rebuilds from scratch) and for the CLI
(where the same AIAgent persists and the cached prompt is already
stable across turns).
Tests: 2556 passed (6 new)
Terminal output was already redacted via redact_sensitive_text() but
read_file and search_files returned raw content. Now both tools
redact secrets before returning results to the LLM.
Based on PR #372 by @teyrebaz33 (closes#363) — applied manually
due to branch conflicts with the current codebase.
The summary message was always injected as 'user' role, which causes
consecutive user messages when the last preserved head message is also
'user'. Some APIs reject this (400 error), and it produces malformed
training data.
Fix: check the role of the last head message and pick the opposite role
for the summary — 'user' after assistant/tool, 'assistant' after user.
Based on PR #328 by johnh4098. Closes#328.
Split fallback provider handling into two clean registries:
_FALLBACK_API_KEY_PROVIDERS — env-var-based (openrouter, zai, kimi, minimax)
_FALLBACK_OAUTH_PROVIDERS — OAuth-based (openai-codex, nous)
New _resolve_fallback_credentials() method handles all three cases
(OAuth, API key, custom endpoint) and returns a uniform (key, url, mode)
tuple. _try_activate_fallback() is now just validation + client build.
Adds Nous Portal as a fallback provider — uses the same OAuth flow
as the primary provider (hermes login), returns chat_completions mode.
OAuth providers get credential refresh for free: the existing 401
retry handlers (_try_refresh_codex/nous_client_credentials) check
self.provider, which is set correctly after fallback activation.
4 new tests (nous activation, nous no-login, codex retained).
27 total fallback tests passing, 2548 full suite.
Codex OAuth uses a different auth flow (OAuth tokens, not env vars)
and a different API mode (codex_responses, not chat_completions).
The fallback now handles this specially:
- Resolves credentials via resolve_codex_runtime_credentials()
- Sets api_mode to codex_responses
- Fails gracefully if no Codex OAuth session exists
Also added to the commented-out config.yaml example.
2 new tests (codex activation + graceful failure).
The gateway had a SEPARATE compression system ('session hygiene')
with hardcoded thresholds (100k tokens / 200 messages) that were
completely disconnected from the model's context length and the
user's compression config in config.yaml. This caused premature
auto-compression on Telegram/Discord — triggering at ~60k tokens
(from the 200-message threshold) or inconsistent token counts.
Changes:
- Gateway hygiene now reads model name from config.yaml and uses
get_model_context_length() to derive the actual context limit
- Compression threshold comes from compression.threshold in
config.yaml (default 0.85), same as the agent's ContextCompressor
- Removed the message-count-based trigger (was redundant and caused
false positives in tool-heavy sessions)
- Removed the undocumented session_hygiene config section — the
standard compression.* config now controls everything
- Env var overrides (CONTEXT_COMPRESSION_THRESHOLD,
CONTEXT_COMPRESSION_ENABLED) are respected
- Warn threshold is now 95% of model context (was hardcoded 200k)
- Updated tests to verify model-aware thresholds, scaling across
models, and that message count alone no longer triggers compression
For claude-opus-4.6 (200k context) at 85% threshold: gateway
hygiene now triggers at 170k tokens instead of the old 100k.
New browser capabilities and a built-in skill for agent-driven web QA.
## New tool: browser_console
Returns console messages (log/warn/error/info) AND uncaught JavaScript
exceptions in a single call. Uses agent-browser's 'console' and 'errors'
commands through the existing session plumbing. Supports --clear to reset
buffers. Verified working in both local and Browserbase cloud modes.
## Enhanced tool: browser_vision(annotate=True)
New boolean parameter on browser_vision. When true, agent-browser overlays
numbered [N] labels on interactive elements — each [N] maps to ref @eN.
Annotation data (element name, role, bounding box) returned alongside the
vision analysis. Useful for QA reports and spatial reasoning.
## Config: browser.record_sessions
Auto-record browser sessions as WebM video files when enabled:
- Starts recording on first browser_navigate
- Stops and saves on browser_close
- Saves to ~/.hermes/browser_recordings/
- Works in both local and cloud modes (verified)
- Disabled by default
## Built-in skill: dogfood
Systematic exploratory QA testing for web applications. Teaches the agent
a 5-phase workflow:
1. Plan — accept URL, create output dirs, set scope
2. Explore — systematic crawl with annotated screenshots
3. Collect Evidence — screenshots, console errors, JS exceptions
4. Categorize — severity (Critical/High/Medium/Low) and category
(Functional/Visual/Accessibility/Console/UX/Content)
5. Report — structured markdown with per-issue evidence
Includes:
- skills/dogfood/SKILL.md — full workflow instructions
- skills/dogfood/references/issue-taxonomy.md — severity/category defs
- skills/dogfood/templates/dogfood-report-template.md — report template
## Tests
21 new tests covering:
- browser_console message/error parsing, clear flag, empty/failed states
- browser_console schema registration
- browser_vision annotate schema and flag passing
- record_sessions config defaults and recording lifecycle
- Dogfood skill file existence and content validation
Addresses #315.
Remove hallucinated providers (openai, deepseek, together, groq,
fireworks, mistral, gemini, nous) from the fallback provider map.
These don't exist in hermes-agent's provider system.
The real supported providers for fallback are:
openrouter (OPENROUTER_API_KEY)
zai (ZAI_API_KEY)
kimi-coding (KIMI_API_KEY)
minimax (MINIMAX_API_KEY)
minimax-cn (MINIMAX_CN_API_KEY)
For any other OpenAI-compatible endpoint, users can use the
base_url + api_key_env overrides in the config.
Also adds Kimi User-Agent header for kimi fallback (matching
the main provider system).
When the primary model/provider fails after retries (rate limit, overload,
auth errors, connection failures), Hermes automatically switches to a
configured fallback model for the remainder of the session.
Config (in ~/.hermes/config.yaml):
fallback_model:
provider: openrouter
model: anthropic/claude-sonnet-4
Supports all major providers: OpenRouter, OpenAI, Nous, DeepSeek, Together,
Groq, Fireworks, Mistral, Gemini — plus custom endpoints via base_url and
api_key_env overrides.
Design principles:
- Dead simple: one fallback model, not a chain
- One-shot: switches once, doesn't ping-pong back
- Zero new dependencies: uses existing OpenAI client
- Minimal code: ~100 lines in run_agent.py, ~5 lines in cli.py/gateway
- Three trigger points: max retries exhausted, non-retryable client errors,
and invalid response exhaustion
Does NOT trigger on context overflow or payload-too-large errors (those
are handled by the existing compression system).
Addresses #737.
25 new tests, 2492 total passing.
Complete Signal adapter using signal-cli daemon HTTP API.
Based on PR #268 by ibhagwan, rebuilt on current main with bug fixes.
Architecture:
- SSE streaming for inbound messages with exponential backoff (2s→60s)
- JSON-RPC 2.0 for outbound (send, typing, attachments, contacts)
- Health monitor detects stale SSE connections (120s threshold)
- Phone number redaction in all logs and global redact.py
Features:
- DM and group message support with separate access policies
- DM policies: pairing (default), allowlist, open
- Group policies: disabled (default), allowlist, open
- Attachment download with magic-byte type detection
- Typing indicators (8s refresh interval)
- 100MB attachment size limit, 8000 char message limit
- E.164 phone + UUID allowlist support
Integration:
- Platform.SIGNAL enum in gateway/config.py
- Signal in _is_user_authorized() allowlist maps (gateway/run.py)
- Adapter factory in _create_adapter() (gateway/run.py)
- user_id_alt/chat_id_alt fields in SessionSource for UUIDs
- send_message tool support via httpx JSON-RPC (not aiohttp)
- Interactive setup wizard in 'hermes gateway setup'
- Connectivity testing during setup (pings /api/v1/check)
- signal-cli detection and install guidance
Bug fixes from PR #268:
- Timestamp reads from envelope_data (not outer wrapper)
- Uses httpx consistently (not aiohttp in send_message tool)
- SIGNAL_DEBUG scoped to signal logger (not root)
- extract_images regex NOT modified (preserves group numbering)
- pairing.py NOT modified (no cross-platform side effects)
- No dual authorization (adapter defers to run.py for user auth)
- Wildcard uses set membership ('*' in set, not list equality)
- .zip default for PK magic bytes (not .docx)
No new Python dependencies — uses httpx (already core).
External requirement: signal-cli daemon (user-installed).
Tests: 30 new tests covering config, init, helpers, session source,
phone redaction, authorization, and send_message integration.
Co-authored-by: ibhagwan <ibhagwan@users.noreply.github.com>
The gateway had a SEPARATE compression system ('session hygiene')
with hardcoded thresholds (100k tokens / 200 messages) that were
completely disconnected from the model's context length and the
user's compression config in config.yaml. This caused premature
auto-compression on Telegram/Discord — triggering at ~60k tokens
(from the 200-message threshold) or inconsistent token counts.
Changes:
- Gateway hygiene now reads model name from config.yaml and uses
get_model_context_length() to derive the actual context limit
- Compression threshold comes from compression.threshold in
config.yaml (default 0.85), same as the agent's ContextCompressor
- Removed the message-count-based trigger (was redundant and caused
false positives in tool-heavy sessions)
- Removed the undocumented session_hygiene config section — the
standard compression.* config now controls everything
- Env var overrides (CONTEXT_COMPRESSION_THRESHOLD,
CONTEXT_COMPRESSION_ENABLED) are respected
- Warn threshold is now 95% of model context (was hardcoded 200k)
- Updated tests to verify model-aware thresholds, scaling across
models, and that message count alone no longer triggers compression
For claude-opus-4.6 (200k context) at 85% threshold: gateway
hygiene now triggers at 170k tokens instead of the old 100k.
The 'openai' provider was redundant — using OPENAI_BASE_URL +
OPENAI_API_KEY with provider: 'main' already covers direct OpenAI API.
Provider options are now: auto, openrouter, nous, codex, main.
- Removed _try_openai(), _OPENAI_AUX_MODEL, _OPENAI_BASE_URL
- Replaced openai tests with codex provider tests
- Updated all docs to remove 'openai' option and clarify 'main'
- 'main' description now explicitly mentions it works with OpenAI API,
local models, and any OpenAI-compatible endpoint
Tests: 2467 passed.
The Codex Responses API (chatgpt.com/backend-api/codex) supports
vision via gpt-5.3-codex. This was verified with real API calls
using image analysis.
Changes to _CodexCompletionsAdapter:
- Added _convert_content_for_responses() to translate chat.completions
multimodal format to Responses API format:
- {type: 'text'} → {type: 'input_text'}
- {type: 'image_url', image_url: {url: '...'}} → {type: 'input_image', image_url: '...'}
- Fixed: removed 'stream' from resp_kwargs (responses.stream() handles it)
- Fixed: removed max_output_tokens and temperature (Codex endpoint rejects them)
Provider changes:
- Added 'codex' as explicit auxiliary provider option
- Vision auto-fallback now includes Codex (OpenRouter → Nous → Codex)
since gpt-5.3-codex supports multimodal input
- Updated docs with Codex OAuth examples
Tested with real Codex OAuth token + ~/.hermes/image2.png — confirmed
working end-to-end through the full adapter pipeline.
Tests: 2459 passed.
The Codex model normalization was rejecting any model without 'codex'
in its name, forcing a fallback to gpt-5.3-codex. This blocked models
like gpt-5.4 that the Codex API actually supports.
The fix simplifies _normalize_model_for_provider() to two operations:
1. Strip provider prefixes (API needs bare slugs)
2. Replace the *untouched default* model with a Codex-compatible one
If the user explicitly chose a model — any model — we trust them and
let the API be the judge. No allowlists, no slug checks.
Also removes the 'codex not in slug' filter from _read_cache_models()
so the local cache preserves all API-available models.
Inspired by OpenClaw's approach which explicitly lists non-codex models
(gpt-5.4, gpt-5.2) as valid Codex models.
Users can now set provider: "openai" for auxiliary tasks (vision, web
extract, compression) to use OpenAI's API directly with their
OPENAI_API_KEY. This hits api.openai.com/v1 with gpt-4o-mini as the
default model — supports vision since GPT-4o handles image input.
Provider options are now: auto, openrouter, nous, openai, main.
Changes:
- agent/auxiliary_client.py: added _try_openai(), "openai" case in
_resolve_forced_provider(), updated auxiliary_max_tokens_param()
to use max_completion_tokens for OpenAI
- Updated docs: cli-config.yaml.example, AGENTS.md, and user-facing
configuration.md with Common Setups section showing OpenAI,
OpenRouter, and local model examples
- 3 new tests for OpenAI provider resolution
Tests: 2459 passed (was 2429).
Improvements on top of PR #606 (auxiliary model configuration):
1. Gateway bridge: Added auxiliary.* and compression.summary_provider
config bridging to gateway/run.py so config.yaml settings work from
messaging platforms (not just CLI). Matches the pattern in cli.py.
2. Vision auto-fallback safety: In auto mode, vision now only tries
OpenRouter + Nous Portal (known multimodal-capable providers).
Custom endpoints, Codex, and API-key providers are skipped to avoid
confusing errors from providers that don't support vision input.
Explicit provider override (AUXILIARY_VISION_PROVIDER=main) still
allows using any provider.
3. Comprehensive tests (46 new):
- _get_auxiliary_provider env var resolution (8 tests)
- _resolve_forced_provider with all provider types (8 tests)
- Per-task provider routing integration (4 tests)
- Vision auto-fallback safety (7 tests)
- Config bridging logic (11 tests)
- Gateway/CLI bridge parity (2 tests)
- Vision model override via env var (2 tests)
- DEFAULT_CONFIG shape validation (4 tests)
4. Docs: Added auxiliary_client.py to AGENTS.md project structure.
Updated module docstring with separate text/vision resolution chains.
Tests: 2429 passed (was 2383).
- Added support for auxiliary model overrides in the configuration, allowing users to specify providers and models for vision and web extraction tasks.
- Updated the CLI configuration example to include new auxiliary model settings.
- Enhanced the environment variable mapping in the CLI to accommodate auxiliary model configurations.
- Improved the resolution logic for auxiliary clients to support task-specific provider overrides.
- Updated relevant documentation and comments for clarity on the new features and their usage.
Add contextual [Hint: ...] suffixes to tool results where they save
real iterations:
- patch (no match): suggests read_file/search_files to verify content
before retrying — addresses the common pattern where the agent retries
with stale old_string instead of re-reading the file.
- search_files (truncated): provides explicit next offset and suggests
narrowing the search — clearer than relying on total_count inference.
Other hints proposed in #722 (terminal, web_search, web_extract,
browser_snapshot, search zero-results, search content-matches) were
evaluated and found to be low-value: either already covered by existing
mechanisms (read_file pagination, similar-files, schema descriptions)
or guidance the agent already follows from its own reasoning.
5 new tests covering hint presence/absence for both tools.
When resuming a session via --continue or --resume, show a compact recap
of the previous conversation inside a Rich panel before the input prompt.
This gives users immediate visual context about what was discussed.
Changes:
- Add _preload_resumed_session() to load session history early (in run(),
before banner) so _init_agent() doesn't need a separate DB round-trip
- Add _display_resumed_history() that renders a formatted recap panel:
* User messages shown with gold bullet (truncated at 300 chars)
* Assistant responses shown with green diamond (truncated at 200 chars / 3 lines)
* Tool calls collapsed to count + tool names
* System messages and tool results hidden
* <REASONING_SCRATCHPAD> blocks stripped from display
* Pure-reasoning messages (no visible output) skipped entirely
* Capped at last 10 exchanges with 'N earlier messages' indicator
* Dim/muted styling distinguishes recap from active conversation
- Add display.resume_display config option: 'full' (default) or 'minimal'
- Store resume_display as instance variable (like compact) for testability
- 27 new tests covering all display scenarios, config, and edge cases
Closes#719
NOUS_API_KEY is unused — vision tools use OPENROUTER_API_KEY or Nous
Portal OAuth (auth.json), and MoA tools use OPENROUTER_API_KEY.
Removed from:
- hermes_cli/config.py: api_keys allowlist for config set routing
- .env.example: example env file entry and comment
- tests/hermes_cli/test_set_config_value.py: parametrize test data
- tests/integration/test_web_tools.py: updated comments and log
messages to reference 'auxiliary LLM provider' instead of NOUS_API_KEY
No HECATE references found in codebase (already cleaned up).
Add `hermes sessions browse` — a curses-based interactive session picker
with live type-to-search filtering, arrow key navigation, and seamless
session resume via Enter.
Features:
- Arrow keys to navigate, Enter to select and resume, Esc/q to quit
- Type characters to live-filter sessions by title, preview, source, or ID
- Backspace to edit filter, first Esc clears filter, second Esc exits
- Adaptive column layout (title/preview, last active, source, ID)
- Scrolling support for long session lists
- --source flag to filter by platform (cli, telegram, discord, etc.)
- --limit flag to control how many sessions to load (default: 50)
- Windows fallback: numbered list with input prompt
- After selection, seamlessly execs into `hermes --resume <id>`
Design decisions:
- Separate subcommand (not a flag on -c) — preserves `hermes -c` as-is
for instant most-recent-session resume
- Uses curses (not simple_term_menu) per Known Pitfalls to avoid the
arrow-key ghost-duplication rendering bug in tmux/iTerm
- Follows existing curses pattern from hermes_cli/tools_config.py
Also fixes: removed redundant `import os` inside cmd_sessions stats
block that shadowed the module-level import (would cause UnboundLocalError
if browse action was taken in the same function).
Tests: 33 new tests covering curses picker, fallback mode, filtering,
navigation, edge cases, and argument parser registration.
Verifies that setup.py imports the correct function name
(get_codex_model_ids) from codex_models.py. This would have caught
the ImportError bug before it reached users.
Source code (hermes_cli/clipboard.py):
- _convert_to_png() lost the file when both Pillow and ImageMagick were
unavailable: path.rename(tmp) moved the file to .bmp, then subprocess.run
raised FileNotFoundError, but the file was never renamed back. The final
fallback 'return path.exists()' returned False.
- Fix: restore the original file in both except handlers by renaming tmp
back to path when the original is missing.
Test (tests/tools/test_clipboard.py):
- test_file_still_usable_when_no_converter expected 'from PIL import Image'
to raise an Exception, but Pillow is installed so pytest.raises fired
'DID NOT RAISE'. The test also never called _convert_to_png().
- Fix: properly mock PIL unavailability via patch.dict(sys.modules),
actually call _convert_to_png(), and assert the correct result.
Messaging users can now switch back to previously-named sessions:
- /resume My Project — resolves the title (with auto-lineage) and
restores that session's conversation history
- /resume (no args) — lists recent titled sessions to choose from
Adds SessionStore.switch_session() which ends the current session and
points the session entry at the target session ID so the old transcript
is loaded on the next message. Running agents are cleared on switch.
Completes the session naming feature from PR #720 for gateway users.
8 new tests covering: name resolution, lineage auto-latest, already-on-
session check, nonexistent names, agent cleanup, no-DB fallback, and
listing titled sessions.
When _ensure_runtime_credentials() resolves the provider to openai-codex,
check if the active model is Codex-compatible. If not (e.g. the default
anthropic/claude-opus-4.6), swap it for the best available Codex model.
Also strips provider prefixes the Codex API rejects (openai/gpt-5.3-codex
→ gpt-5.3-codex).
Adds _model_is_default flag so warnings are only shown when the user
explicitly chose an incompatible model (not when it's the config default).
Fixes#651.
Co-inspired-by: stablegenius49 (PR #661)
Co-inspired-by: teyrebaz33 (PR #696)
Previously, search_files would silently return 0 results when the
search path didn't exist (e.g., /root/.hermes/... when HOME is
/home/user). The path was passed to rg/grep/find which would fail
silently, and the empty stdout was parsed as 'no matches found'.
Changes:
- Add path existence check at the top of search() using test -e.
Returns SearchResult with a clear error message when path doesn't exist.
- Add exit code 2 checks in _search_with_rg() and _search_with_grep()
as secondary safety net for other error types (bad regex, permissions).
- Add 4 new tests covering: nonexistent path (content mode), nonexistent
path (files mode), existing path proceeds normally, rg error exit code.
Tests: 37 → 41 in test_file_operations.py, full suite 2330 passed.
Adds a fun alias for skipping all dangerous command approval prompts.
When passed, sets HERMES_YOLO_MODE=1 which causes check_dangerous_command()
to auto-approve everything.
Available on both top-level and chat subcommand:
hermes --fuck-it-ship-it
hermes chat --fuck-it-ship-it
Includes 5 tests covering normal blocking, yolo bypass, all patterns,
and edge cases (empty string env var).
- Empty string titles normalized to None (prevents uncaught IntegrityError
when two sessions both get empty-string titles via the unique index)
- Escape SQL LIKE wildcards (%, _) in resolve_session_by_title and
get_next_title_in_lineage to prevent false matches on titles like
'test_project' matching 'testXproject #2'
- Optimize list_sessions_rich from N+2 queries to a single query with
correlated subqueries (preview + last_active computed in SQL)
- Add /title slash command to gateway (Telegram, Discord, Slack, WhatsApp)
with set and show modes, uniqueness conflict handling
- Add /title to gateway /help text and _known_commands
- 12 new tests: empty string normalization, multi-empty-title safety,
SQL wildcard edge cases, gateway /title set/show/conflict/cross-platform
Two 'except Exception: pass' blocks silently hide failures:
- mirror_to_session failure: user's message never gets mirrored, no trace
- config.yaml parse failure: wrong model used silently
Replace with logger.warning so failures are visible in logs.
Completed/cancelled items are now filtered from format_for_injection()
output. Update the existing test to verify active items appear and
completed items are excluded.
- Block file reads after 3+ re-reads of same region (no content returned)
- Track search_files calls and block repeated identical searches
- Filter completed/cancelled todos from post-compression injection
to prevent agent from re-doing finished work
- Add 10 new tests covering all three fixes
When context compression summarizes conversation history, the agent
loses track of which files it already read and re-reads them in a loop.
Users report the agent reading the same files endlessly without writing.
Root cause: context compression is lossy — file contents and read history
are lost in the summary. After compression, the model thinks it hasn't
examined the files yet and reads them again.
Fix (two-part):
1. Track file reads per task in file_tools.py. When the same file region
is read again, include a _warning in the response telling the model
to stop re-reading and use existing information.
2. After context compression, inject a structured message listing all
files already read in the session with explicit "do NOT re-read"
instruction, preserving read history across compression boundaries.
Adds 16 tests covering warning detection, task isolation, summary
accuracy, tracker cleanup, and compression history injection.
Call ensure_hub_dirs() at the start of hermes skills list so the\nSkills Hub directory structure is created before reading hub\nmetadata.\n\nAdd a regression test covering the empty-home path where\ndoctor recommends running the list command.\n\nRefs: #703
Images pasted in the CLI were embedded as raw base64 image_url content
parts in the conversation history, which only works with vision-capable
models. If the main model (e.g. Nous API) doesn't support vision, this
breaks the request and poisons all subsequent messages.
Now the CLI uses the same approach as the messaging gateway: images are
pre-processed through the auxiliary vision model (Gemini Flash via
OpenRouter or Nous Portal) and converted to text descriptions. The
local file path is included so the agent can re-examine via
vision_analyze if needed. Works with any model.
Fixes#638.
User messaging improvements:
- Rejection: '(>_<) Error: not a valid model' instead of '(^_^) Warning: Error:'
- Rejection: shows 'Model unchanged' + tip about /model and /provider
- Session-only: explains 'this session only' with reason and 'will revert on restart'
- Saved: clear '(saved to config)' confirmation
Docs updated:
- cli-commands.md, cli.md, messaging/index.md: /model now shows
provider:model syntax, /provider command added to tables
Test fixes: deduplicated test names, assertions match new messages.
/provider command (CLI + gateway):
Shows all providers with auth status (✓/✗), aliases, and active marker.
Users can now discover what provider names work with provider:model syntax.
Gateway bugs fixed:
- Config was saved even when validation.persist=False (told user 'session
only' but actually persisted the unvalidated model)
- HERMES_INFERENCE_PROVIDER env var not set on provider switch, causing
the switch to be silently overridden if that env var was already set
parse_model_input hardened:
- Colon only treated as provider delimiter if left side is a recognized
provider name or alias. 'anthropic/claude-3.5-sonnet:beta' now passes
through as a model name instead of trying provider='anthropic/claude-3.5-sonnet'.
- HTTP URLs, random colons no longer misinterpreted.
56 tests passing across model validation, CLI commands, and integration.
Add provider:model syntax to /model command for runtime provider switching:
/model zai:glm-5 → switch to Z.AI provider with glm-5
/model nous:hermes-3 → switch to Nous Portal with hermes-3
/model openrouter:anthropic/claude-sonnet-4.5 → explicit OpenRouter
When switching providers, credentials are resolved via resolve_runtime_provider
and validated before committing. Both model and provider are saved to config.
Provider aliases work (glm: → zai, kimi: → kimi-coding, etc.).
Enhanced /model (no args) display now shows:
- Current model and provider
- Curated model list for the current provider with ← marker
- Usage examples including provider:model syntax
39 tests covering parse_model_input, curated_models_for_provider,
provider switching (success + credential failure), and display output.
The 200 lines of prompt_toolkit/rich/fire stubs added in PR #650 were
guarded by 'if module in sys.modules: return' and never activated since
those dependencies are always installed. Removed to keep the test file
lean. Also removed unused MagicMock and pytest imports.
Not all providers require 'provider/model' format. Removing the rigid
format check lets the live API probe handle all validation uniformly.
If someone types 'gpt-5.4' on OpenRouter, the probe won't find it and
will suggest 'openai/gpt-5.4' — better UX than a format rejection.
Replace the static catalog-based model validation with a live API probe.
The /model command now hits the provider's /models endpoint to check if
the requested model actually exists:
- Model found in API → accepted + saved to config
- Model NOT found in API → rejected with 'Error: not a valid model'
and fuzzy-match suggestions from the live model list
- API unreachable → graceful fallback to hardcoded catalog (session-only
for unrecognized models)
- Format errors (empty, spaces, missing '/') still caught instantly
without a network call
The API probe takes ~0.2s for OpenRouter (346 models) and works with any
OpenAI-compatible endpoint (Ollama, vLLM, custom, etc.).
32 tests covering all paths: format checks, API found, API not found,
API unreachable fallback, CLI integration.
- Wrap validate_requested_model in try/except so /model doesn't crash
if validation itself fails (falls back to old accept+save behavior)
- Remove unnecessary sys.path.insert from both test files
- Expand test_model_validation.py: 4 → 23 tests covering normalize_provider,
provider_model_ids, empty/whitespace/spaces rejection, OpenRouter format
validation, custom endpoints, nous provider, provider aliases, unknown
providers, fuzzy suggestions
- Expand test_cli_model_command.py: 2 → 5 tests adding known-model save,
validation crash fallback, and /model with no argument
Skills can now declare runtime prerequisites (env vars, CLI binaries) via
YAML frontmatter. Skills with unmet prerequisites are excluded from the
system prompt so the agent never claims capabilities it can't deliver, and
skill_view() warns the agent about what's missing.
Three layers of defense:
- build_skills_system_prompt() filters out unavailable skills
- _find_all_skills() flags unmet prerequisites in metadata
- skill_view() returns prerequisites_warning with actionable details
Tagged 12 bundled skills that have hard runtime dependencies:
gif-search (TENOR_API_KEY), notion (NOTION_API_KEY), himalaya, imessage,
apple-notes, apple-reminders, openhue, duckduckgo-search, codebase-inspection,
blogwatcher, songsee, mcporter.
Closes#658Fixes#630
Removed the hard block on base_url containing 'api.anthropic.com'.
Anthropic now offers an OpenAI-compatible /chat/completions endpoint,
so blocking their URL prevents legitimate use. If the endpoint isn't
compatible, the API call will fail with a proper error anyway.
Removed from: run_agent.py, mini_swe_runner.py
Updated test to verify Anthropic URLs are accepted.
browser_vision now saves screenshots persistently to ~/.hermes/browser_screenshots/
and returns the screenshot_path in its JSON response. The model can include
MEDIA:<path> in its response to share screenshots as native photos.
Changes:
- browser_tool.py: Save screenshots persistently, return screenshot_path,
auto-cleanup files older than 24 hours, mkdir moved inside try/except
- telegram.py: Add send_image_file() — sends local images via bot.send_photo()
- discord.py: Add send_image_file() — sends local images via discord.File
- slack.py: Add send_image_file() — sends local images via files_upload_v2()
(WhatsApp already had send_image_file — no changes needed)
- prompt_builder.py: Updated Telegram hint to list image extensions,
added Discord and Slack MEDIA: platform hints
- browser.md: Document screenshot sharing and 24h cleanup
- send_file_integration_map.md: Updated to reflect send_image_file is now
implemented on Telegram/Discord/Slack
- test_send_image_file.py: 19 tests covering MEDIA: .png extraction,
send_image_file on all platforms, and screenshot cleanup
Partially addresses #466 (Phase 0: platform adapter gaps for send_image_file).
Authored by christomitov. Auto-detects sk-kimi- key prefix and routes
to api.kimi.com/coding/v1. Adds User-Agent header for Kimi Code API
compatibility. Legacy Moonshot keys continue to work unchanged.
Critical fixes:
- Add --worktree/-w to hermes_cli/main.py argparse (both chat
subcommand and top-level parser) so 'hermes -w' works via the
actual CLI entry point, not just 'python cli.py -w'
- Pass worktree flag through cmd_chat() kwargs to cli_main()
- Handle worktree attr in bare 'hermes' and --resume/--continue paths
Bug fixes in cli.py:
- Skip worktree creation for --list-tools/--list-toolsets (wasteful)
- Wrap git worktree subprocess.run in try/except (crash on timeout)
- Add stale worktree pruning on startup (_prune_stale_worktrees):
removes clean worktrees older than 24h left by crashed/killed sessions
Documentation updates:
- AGENTS.md: add --worktree to CLI commands table
- cli-config.yaml.example: add worktree config section
- website/docs/reference/cli-commands.md: add to core commands
- website/docs/user-guide/cli.md: add usage examples
- website/docs/user-guide/configuration.md: add config docs
Test improvements (17 → 31 tests):
- Stale worktree pruning (prune old clean, keep recent, keep dirty)
- Directory symlink via .worktreeinclude
- Edge cases (no commits, not a repo, pre-existing .worktrees/)
- CLI flag/config OR logic
- TERMINAL_CWD integration
- System prompt injection format
Add a --worktree (-w) flag to the hermes CLI that creates an isolated
git worktree for the session. This allows running multiple hermes-agent
instances concurrently on the same repo without file collisions.
How it works:
- On startup with -w: detects git repo, creates .worktrees/<session>/
with its own branch (hermes/<session-id>), sets TERMINAL_CWD to it
- Each agent works in complete isolation — independent HEAD, index,
and working tree, shared git object store
- On exit: auto-removes worktree and branch if clean, warns and
keeps if there are uncommitted changes
- .worktreeinclude file support: list gitignored files (.env, .venv/)
to auto-copy/symlink into new worktrees
- .worktrees/ is auto-added to .gitignore
- Agent gets a system prompt note about the worktree context
- Config support: set worktree: true in config.yaml to always enable
Usage:
hermes -w # Interactive mode in worktree
hermes -w -q "Fix issue #123" # Single query in worktree
# Or in config.yaml:
worktree: true
Includes 17 tests covering: repo detection, worktree creation,
independence verification, cleanup (clean/dirty), .worktreeinclude,
.gitignore management, and 10 concurrent worktrees.
Closes#652
Long-lived gateway sessions can accumulate enough history that every new
message rehydrates an oversized transcript, causing repeated truncation
failures (finish_reason=length).
Add a session hygiene check in _handle_message that runs right after
loading the transcript and before invoking the agent:
1. Estimate message count and rough token count of the transcript
2. If above configurable thresholds (default: 200 msgs or 100K tokens),
auto-compress the transcript proactively
3. Notify the user about the compression with before/after stats
4. If still above warn threshold (default: 200K tokens) after
compression, suggest /reset
5. If compression fails on a dangerously large session, warn the user
to use /compress or /reset manually
Thresholds are configurable via config.yaml:
session_hygiene:
auto_compress_tokens: 100000
auto_compress_messages: 200
warn_tokens: 200000
This complements the agent's existing preflight compression (which
runs inside run_conversation) by catching pathological sessions at
the gateway layer before the agent is even created.
Includes 12 tests for threshold detection and token estimation.
Kimi Code (platform.kimi.ai) issues API keys prefixed sk-kimi- that require:
1. A different base URL: api.kimi.com/coding/v1 (not api.moonshot.ai/v1)
2. A User-Agent header identifying a recognized coding agent
Without this fix, sk-kimi- keys fail with 401 (wrong endpoint) or 403
('only available for Coding Agents') errors.
Changes:
- Auto-detect sk-kimi- key prefix and route to api.kimi.com/coding/v1
- Send User-Agent: KimiCLI/1.0 header for Kimi Code endpoints
- Legacy Moonshot keys (api.moonshot.ai) continue to work unchanged
- KIMI_BASE_URL env var override still takes priority over auto-detection
- Updated .env.example with correct docs and all endpoint options
- Fixed doctor.py health check for Kimi Code keys
Reference: https://github.com/MoonshotAI/kimi-cli (platforms.py)
Previously, when a session expired (idle/daily reset), the memory flush
ran synchronously inside get_or_create_session — blocking the user's
message for 10-60s while an LLM call saved memories.
Now a background watcher task (_session_expiry_watcher) runs every 5 min,
detects expired sessions, and flushes memories proactively in a thread
pool. By the time the user sends their next message, memories are
already saved and the response is immediate.
Changes:
- Add _is_session_expired(entry) to SessionStore — works from entry
alone without needing a SessionSource
- Add _pre_flushed_sessions set to track already-flushed sessions
- Remove sync _on_auto_reset callback from get_or_create_session
- Refactor flush into _flush_memories_for_session (sync worker) +
_async_flush_memories (thread pool wrapper)
- Add _session_expiry_watcher background task, started in start()
- Simplify /reset command to use shared fire-and-forget flush
- Add 10 tests for expiry detection, callback removal, tracking
Reduces token usage and latency for most tasks by defaulting to
medium reasoning effort instead of xhigh. Users can still override
via config or CLI flag. Updates code, tests, example config, and docs.
_convert_to_png() renamed the original file to .bmp before calling
ImageMagick convert, then unconditionally deleted the .bmp regardless
of whether convert succeeded. If convert failed, both files were gone.
- Only delete .bmp after confirmed successful conversion
- Restore original file on convert failure, timeout, or missing binary
- Add 3 tests covering failure, not-installed, and timeout scenarios
_make_cli() did not clear HERMES_MAX_ITERATIONS env var, so tests
failed in CI where the var was set externally. Also, default max_turns
changed from 60 to 90 in 0a82396 but tests were not updated.
- Clear HERMES_MAX_ITERATIONS in _make_cli() for proper isolation
- Add env_overrides parameter for tests that need specific env values
- Update hardcoded 60 assertions to 90 to match new default
- Simplify test_env_var_max_turns using env_overrides
When MarkdownV2 parsing fails, _strip_mdv2() removes escape backslashes
and bold markers (*text*) but missed italic markers (_text_). Users saw
raw underscores around italic text in the plaintext fallback.
- Add regex to strip _text_ italic markers in _strip_mdv2()
- Use word boundary lookaround to preserve snake_case identifiers
- Add tests for _strip_mdv2 covering italic, bold, snake_case, and edge cases
Add a 'platforms' field to SKILL.md frontmatter that restricts skills
to specific operating systems. Skills with platforms: [macos] only
appear in the system prompt, skills_list(), and slash commands on macOS.
Skills without the field load everywhere (backward compatible).
Implementation:
- skill_matches_platform() in tools/skills_tool.py — core filter
- Wired into all 3 discovery paths: prompt_builder.py, skills_tool.py,
skill_commands.py
- 28 new tests across 3 test files
New bundled Apple/macOS skills (all platforms: [macos]):
- imessage — Send/receive iMessages via imsg CLI
- apple-reminders — Manage Reminders via remindctl CLI
- apple-notes — Manage Notes via memo CLI
- findmy — Track devices/AirTags via AppleScript + screen capture
Docs updated: CONTRIBUTING.md, AGENTS.md, creating-skills.md,
skills.md (user guide)
Authored by areu01or00. Adds timezone support via hermes_time.now() helper
with IANA timezone resolution (HERMES_TIMEZONE env → config.yaml → server-local).
Updates system prompt timestamp, cron scheduling, and execute_code sandbox TZ
injection. Includes config migration (v4→v5) and comprehensive test coverage.
Introduces a new OpenClaw-to-Hermes migration skill with a Python
helper script that handles importing SOUL.md, memories, user profiles,
messaging settings, command allowlists, skills, TTS assets, and
workspace instructions.
Supports two migration presets (user-data / full), three skill conflict
modes (skip / overwrite / rename), overflow file export for entries that
exceed character limits, and granular include/exclude option filtering.
Includes detailed SKILL.md agent instructions covering the clarify-tool
interaction protocol, decision-to-command mapping, post-run reporting
rules, and path resolution guidance.
Adds dynamic panel width calculation to CLI clarify/approval widgets so
panels adapt to content and terminal size.
Includes 7 new tests covering presets, include/exclude, conflict modes,
overflow exports, and skills_guard integration.
Adds 4 new direct API-key providers (zai, kimi-coding, minimax, minimax-cn)
to the inference provider system. All use standard OpenAI-compatible
chat/completions endpoints with Bearer token auth.
Core changes:
- auth.py: Extended ProviderConfig with api_key_env_vars and base_url_env_var
fields. Added providers to PROVIDER_REGISTRY. Added provider aliases
(glm, z-ai, zhipu, kimi, moonshot). Added auto-detection of API-key
providers in resolve_provider(). Added resolve_api_key_provider_credentials()
and get_api_key_provider_status() helpers.
- runtime_provider.py: Added generic API-key provider branch in
resolve_runtime_provider() — any provider with auth_type='api_key'
is automatically handled.
- main.py: Added providers to hermes model menu with generic
_model_flow_api_key_provider() flow. Updated _has_any_provider_configured()
to check all provider env vars. Updated argparse --provider choices.
- setup.py: Added providers to setup wizard with API key prompts and
curated model lists.
- config.py: Added env vars (GLM_API_KEY, KIMI_API_KEY, MINIMAX_API_KEY,
etc.) to OPTIONAL_ENV_VARS.
- status.py: Added API key display and provider status section.
- doctor.py: Added connectivity checks for each provider endpoint.
- cli.py: Updated provider docstrings.
Docs: Updated README.md, .env.example, cli-config.yaml.example,
cli-commands.md, environment-variables.md, configuration.md.
Tests: 50 new tests covering registry, aliases, resolution, auto-detection,
credential resolution, and runtime provider dispatch.
Inspired by PR #33 (numman-ali) which proposed a provider registry approach.
Credit to tars90percent (PR #473) and manuelschipper (PR #420) for related
provider improvements merged earlier in this changeset.
Two bugs fixed:
1. search_messages() crashes with OperationalError when user queries
contain FTS5 special characters (+, ", (, {, dangling AND/OR, etc).
Added _sanitize_fts5_query() to strip dangerous operators and a
fallback try-except for edge cases.
2. _append_to_sqlite() in mirror.py creates a new SessionDB per call
but never closes it, leaking SQLite connections. Added finally block
to ensure db.close() is always called.
API key selection is now base_url-aware: when the resolved base_url
targets OpenRouter, OPENROUTER_API_KEY takes priority (preserving the
#289 fix). When hitting any other endpoint (Z.ai, vLLM, custom, etc.),
OPENAI_API_KEY takes priority so the OpenRouter key doesn't leak.
Applied in both the runtime provider resolver (the real code path) and
the CLI initial default (for consistency).
Fixes#560.
_make_cli() now patches CLI_CONFIG with clean defaults so
test_cli_init tests don't depend on the developer's local config.yaml.
test_empty_dir_returns_empty now mocks Path.home() so it doesn't pick
up a global SOUL.md.
Credit to teyrebaz33 for identifying and fixing these in PR #557.
Fixes#555.
tool_call_count was inaccurate in two ways:
1. Under-counting: an assistant message with N parallel tool calls
(e.g. "kill the light and shut off the fan" = 2 ha_call_service)
only incremented tool_call_count by 1 instead of N.
2. Over-counting: tool response messages (role=tool) also incremented
tool_call_count, double-counting every tool interaction.
Combined: 2 parallel tool calls produced tool_call_count=3 (1 from
assistant + 2 from tool responses) instead of the correct value of 2.
Fix: only count from assistant messages with tool_calls, incrementing
by len(tool_calls) to handle parallel calls correctly. Tool response
messages no longer affect tool_call_count.
This impacts /insights and /usage accuracy for sessions with tool use.
Two bugs in sync_skills():
1. Failed copytree poisons manifest: when shutil.copytree fails (disk
full, permission error), the skill is still recorded in the manifest.
On the next sync, the skill appears as "in manifest but not on disk"
which is interpreted as "user deliberately deleted it" — the skill
is never retried. Fix: only write to manifest on successful copy.
2. Failed update destroys user copy: rmtree deletes the existing skill
directory before copytree runs. If copytree then fails, the user's
skill is gone with no way to recover. Fix: move to .bak before
copying, restore from backup if copytree fails.
Both bugs are proven by new regression tests that fail on the old code
and pass on the fix.
Upgrade skills_sync manifest to v2 format (name:origin_hash). The origin
hash records the MD5 of the bundled skill at the time it was last synced.
On update, the user's copy is compared against the origin hash:
- User copy == origin hash → unmodified → safe to update from bundled
- User copy != origin hash → user customized → skip (preserve changes)
v1 manifests (plain names) are auto-migrated: the user's current hash
becomes the baseline, so future syncs can detect modifications.
Output now shows user-modified skills:
~ whisper (user-modified, skipping)
27 tests covering all scenarios including v1→v2 migration, user
modification detection, update after migration, and origin hash tracking.
2009 tests pass.
- Restored 21 skills removed in commits 757d012 and 740dd92:
accelerate, audiocraft, code-review, faiss, flash-attention, gguf,
grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft,
pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion,
tensorrt-llm, torchtitan, trl-fine-tuning, whisper
- Rewrote sync_skills() with proper update semantics:
* New skills (not in manifest): copied to user dir
* Existing skills (in manifest + on disk): updated via hash comparison
* User-deleted skills (in manifest, not on disk): respected, not re-added
* Stale manifest entries (removed from bundled): cleaned from manifest
- Added sync_skills() to CLI startup (cmd_chat) and gateway startup
(start_gateway) — previously only ran during 'hermes update'
- Updated cmd_update output to show new/updated/cleaned counts
- Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh
install, user deletion respect, update detection, stale cleanup, and
name collision handling
75 bundled skills total. 2002 tests pass.
Issues found and fixed during deep code path review:
1. CRITICAL: Prefix matching returned wrong prices for dated model names
- 'gpt-4o-mini-2024-07-18' matched gpt-4o ($2.50) instead of gpt-4o-mini ($0.15)
- Same for o3-mini→o3 (9x), gpt-4.1-mini→gpt-4.1 (5x), gpt-4.1-nano→gpt-4.1 (20x)
- Fix: use longest-match-wins strategy instead of first-match
- Removed dangerous key.startswith(bare) reverse matching
2. CRITICAL: Top Tools section was empty for CLI sessions
- run_agent.py doesn't set tool_name on tool response messages (pre-existing)
- Insights now also extracts tool names from tool_calls JSON on assistant
messages, which IS populated for all sessions
- Uses max() merge strategy to avoid double-counting between sources
3. SELECT * replaced with explicit column list
- Skips system_prompt and model_config blobs (can be thousands of chars)
- Reduces memory and I/O for large session counts
4. Sets in overview dict converted to sorted lists
- models_with_pricing / models_without_pricing were Python sets
- Sets aren't JSON-serializable — would crash json.dumps()
5. Negative duration guard
- end > start check prevents negative durations from clock drift
6. Model breakdown sort fallback
- When all tokens are 0, now sorts by session count instead of arbitrary order
7. Removed unused timedelta import
Added 6 new tests: dated model pricing (4), tool_calls JSON extraction,
JSON serialization safety. Total: 69 tests.
Custom OAI endpoints, self-hosted models, and local inference should NOT
show fabricated cost estimates. Changed default pricing from $3/$12 per
million tokens to $0/$0 for unrecognized models.
- Added _has_known_pricing() to distinguish commercial vs custom models
- Models with known pricing show $ amounts; unknown models show 'N/A'
- Overview shows asterisk + note when some models lack pricing data
- Gateway format adds '(excludes custom/self-hosted models)' note
- Added 7 new tests for custom model cost handling
Inspired by Claude Code's /insights, adapted for Hermes Agent's multi-platform
architecture. Analyzes session history from state.db to produce comprehensive
usage insights.
Features:
- Overview stats: sessions, messages, tokens, estimated cost, active time
- Model breakdown: per-model sessions, tokens, and cost estimation
- Platform breakdown: CLI vs Telegram vs Discord etc. (unique to Hermes)
- Tool usage ranking: most-used tools with percentages
- Activity patterns: day-of-week chart, peak hours, streaks
- Notable sessions: longest, most messages, most tokens, most tool calls
- Cost estimation: real pricing data for 25+ models (OpenAI, Anthropic,
DeepSeek, Google, Meta) with fuzzy model name matching
- Configurable time window: --days flag (default 30)
- Source filtering: --source flag to filter by platform
Three entry points:
- /insights slash command in CLI (supports --days and --source flags)
- /insights slash command in gateway (compact markdown format)
- hermes insights CLI subcommand (standalone)
Includes 56 tests covering pricing helpers, format helpers, empty DB,
populated DB with multi-platform data, filtering, formatting, and edge cases.
Authored by Farukest. Fixes#432. Extracts _kill_port_process() helper
that uses netstat+taskkill on Windows and fuser on Linux. Previously,
fuser calls were inline with bare except-pass, so on Windows orphaned
bridge processes were never cleaned up — causing 'address already in use'
errors on reconnect. Includes 5 tests covering both platforms, port
matching edge cases, and exception suppression.
Authored by Farukest. Fixes#435. The retry summary in
_handle_max_iterations() hardcoded max_tokens instead of using
_max_tokens_param(), which returns max_completion_tokens for direct
OpenAI API (required by gpt-4o, o-series). The first attempt already
used _max_tokens_param correctly — only the retry path was wrong.
Includes 4 tests for _max_tokens_param provider detection.
Verifies explicit allowlist keys, catch-all _API_KEY/_TOKEN patterns,
case insensitivity, TERMINAL_SSH prefix, and config.yaml routing for
non-secret keys. Covers the fix from PR #469.
The mock handler checked for function_name == 'search' but the RPC
sends 'search_files'. Any test exercising search_files through the
mock would get 'Unknown tool' instead of the canned response.
The _TOOL_STUBS dict in code_execution_tool.py was out of sync with the
actual tool schemas, causing TypeErrors when the LLM used parameters it
sees in its system prompt but the sandbox stubs didn't accept:
search_files:
- Added missing params: context, offset, output_mode
- Fixed target default: 'grep' → 'content' (old value was obsolete)
patch:
- Added missing params: mode, patch (V4A multi-file patch support)
Also added 4 drift-detection tests (TestStubSchemaDrift) that will
catch future divergence between stubs and real schemas:
- test_stubs_cover_all_schema_params: every schema param in stub
- test_stubs_pass_all_params_to_rpc: every stub param sent over RPC
- test_search_files_target_uses_current_values: no obsolete values
- test_generated_module_accepts_all_params: generated code compiles
All 28 tests pass.
Authored by rovle. Adds Daytona as the sixth terminal execution backend
with cloud sandboxes, persistent workspaces, and full CLI/gateway integration.
Includes 24 unit tests and 8 integration tests.
The execute_code sandbox generates a hermes_tools.py stub module for LLM
scripts. Three common failure modes keep tripping up scripts:
1. json.loads(strict=True) rejects control chars in terminal() output
(e.g., GitHub issue bodies with literal tabs/newlines)
2. Shell backtick/quote interpretation when interpolating dynamic content
into terminal() commands (markdown with backticks gets eaten by bash)
3. No retry logic for transient network failures (API timeouts, rate limits)
Adds three convenience helpers to the generated hermes_tools module:
- json_parse(text) — json.loads with strict=False for tolerant parsing
- shell_quote(s) — shlex.quote() for safe shell interpolation
- retry(fn, max_attempts=3, delay=2) — exponential backoff wrapper
Also updates the EXECUTE_CODE_SCHEMA description to document these helpers
so LLMs know they're available without importing anything extra.
Includes 7 new tests (unit + integration) covering all three helpers.
The original implementation only supported xclip (X11), which silently
fails on WSL2 (can't access Windows clipboard for images), Wayland
desktops (xclip is X11-only), and VSCode terminal on WSL2.
Clipboard backend changes (hermes_cli/clipboard.py):
- WSL2: detect via /proc/version, use powershell.exe with .NET
System.Windows.Forms.Clipboard to extract images as base64 PNG
- Wayland: use wl-paste with MIME type detection, auto-convert BMP
to PNG for WSLg environments (via Pillow or ImageMagick)
- Dispatch order: WSL → Wayland → X11 (xclip), with fallthrough
- New has_clipboard_image() for lightweight clipboard checks
- Cache WSL detection result per-process
CLI changes (cli.py):
- /paste command: explicit clipboard image check for terminals where
BracketedPaste doesn't fire (image-only clipboard in VSCode/WinTerm)
- Ctrl+V keybinding: fallback for Linux terminals where Ctrl+V sends
raw byte instead of triggering bracketed paste
Tests: 80 tests (up from 37) covering WSL, Wayland, X11 dispatch,
BMP conversion, has_clipboard_image, and /paste command.
Copy an image to clipboard (screenshot, browser, etc.) and paste into
the Hermes CLI. The image is saved to ~/.hermes/images/, shown as a
badge above the input ([📎 Image #1]), and sent to the model as a
base64-encoded OpenAI vision multimodal content block.
Implementation:
- hermes_cli/clipboard.py: clean module with platform-specific extraction
- macOS: pngpaste (if installed) → osascript fallback (always available)
- Linux: xclip (apt install xclip)
- cli.py: BracketedPaste key handler checks clipboard on every paste,
image bar widget shows attached images, chat() converts to multimodal
content format, Ctrl+C clears attachments
Inspired by @m0at's fork (https://github.com/m0at/hermes-agent) which
implemented image paste support for local vision models. Reimplemented
cleanly as a separate module with tests.
On top of PR #460: self-hosted Firecrawl instances don't require an API
key (USE_DB_AUTHENTICATION=false), so don't force users to set a dummy
FIRECRAWL_API_KEY when FIRECRAWL_API_URL is set. Also adds a proper
self-hosting section to the configuration docs explaining what you get,
what you lose, and how to set it up (Docker stack, tradeoffs vs cloud).
Added 2 more tests (URL-only without key, neither-set raises).
Replaces the unsafe 128K fallback for unknown models with a descending
probe strategy (2M → 1M → 512K → 200K → 128K → 64K → 32K). When a
context-length error occurs, the agent steps down tiers and retries.
The discovered limit is cached per model+provider combo in
~/.hermes/context_length_cache.yaml so subsequent sessions skip probing.
Also parses API error messages to extract the actual context limit
(e.g. 'maximum context length is 32768 tokens') for instant resolution.
The CLI banner now displays the context window size next to the model
name (e.g. 'claude-opus-4 · 200K context · Nous Research').
Changes:
- agent/model_metadata.py: CONTEXT_PROBE_TIERS, persistent cache
(save/load/get), parse_context_limit_from_error(), get_next_probe_tier()
- agent/context_compressor.py: accepts base_url, passes to metadata
- run_agent.py: step-down logic in context error handler, caches on success
- cli.py + hermes_cli/banner.py: context length in welcome banner
- tests: 22 new tests for probing, parsing, and caching
Addresses #132. PR #319's approach (8K default) rejected — too conservative.
Adds optional FIRECRAWL_API_URL environment variable to support
self-hosted Firecrawl deployments alongside the cloud service.
- Add FIRECRAWL_API_URL to optional env vars in hermes_cli/config.py
- Update _get_firecrawl_client() in tools/web_tools.py to accept custom API URL
- Add tests for client initialization with/without URL
- Document new env var in installation and config guides
The Daytona SDK's process.exec(timeout=N) parameter is not enforced —
the server-side timeout never fires and the SDK has no client-side
fallback, causing commands to hang indefinitely.
Fix: wrap commands with timeout N sh -c '...' (coreutils) which
reliably kills the process and returns exit code 124. Added
shlex.quote for proper shell escaping and a secondary deadline (timeout + 10s) that force-stops the sandbox if the shell timeout somehow fails.
Signed-off-by: rovle <lovre.pesut@gmail.com>
state
- Replace logger.warning with warnings.warn for the disk cap so users
actually see it (logger was suppressed by CLI's log level config)
- Use SandboxState enum instead of string literals in
_ensure_sandbox_ready
Signed-off-by: rovle <lovre.pesut@gmail.com>
Authored by 0xbyt4. Wraps commands with unique fence markers to isolate real output
from shell init/exit noise (oh-my-zsh, macOS session restore, etc.). Falls back to
expanded pattern-based cleaning. Also fixes BSD find fallback and test module shadowing.
The retry summary in _handle_max_iterations hardcodes max_tokens instead
of calling _max_tokens_param(). For direct OpenAI API users (gpt-4o,
o-series), the correct parameter name is max_completion_tokens. The first
attempt at line 2697 already uses _max_tokens_param correctly but the
retry path at line 2743 was missed.
fuser command does not exist on Windows, causing orphaned bridge processes
to never be cleaned up. On crash recovery, the port stays occupied and the
next connect() fails with address-already-in-use.
Add _kill_port_process() helper that uses netstat+taskkill on Windows and
fuser on Linux/macOS. Replace both call sites in connect() and disconnect().
Adds a /update command to Telegram, Discord, and other gateway platforms
that runs `hermes update` to pull the latest code, update dependencies,
sync skills, and restart the gateway.
Implementation:
- Spawns `hermes update` in a separate systemd scope (systemd-run --user
--scope) so the process survives the gateway restart that hermes update
triggers at the end. Falls back to nohup if systemd-run is unavailable.
- Writes a marker file (.update_pending.json) with the originating
platform and chat_id before spawning the update.
- On gateway startup, _send_update_notification() checks for the marker,
reads the captured update output, sends the results back to the user,
and cleans up.
Also:
- Registers /update as a Discord slash command
- Updates README.md, docs/messaging.md, docs/slash-commands.md
- Adds 18 tests covering handler, notification, and edge cases
Authored by FarukEst. Fixes#392.
1. Initialize data={} before health-check loop to prevent NameError when
resp.json() raises after http_ready is set to True.
2. Extract _close_bridge_log() helper and call on all return False paths
to prevent file descriptor leaks on failed connection attempts.
Refactors disconnect() to reuse the same helper.
Authored by 0xbyt4.
The italic regex [^*]+ matched across newlines, corrupting bullet lists
using * markers (e.g. '* Item one\n* Item two' became italic garbage).
Fixed by adding \n to the negated character class: [^*\n]+.
Authored by 0xbyt4.
The dedup logic in GitHubSource.search() and unified_search() used
'r.trust_level == "trusted"' which let trusted results overwrite builtin
ones. Now uses ranked comparison: builtin (2) > trusted (1) > community (0).
Authored by 0xbyt4.
Two fixes:
- extract_images(): only remove extracted image tags, not all markdown image
tags. Previously  was silently dropped when real images
were also present.
- truncate_message(): walk chunk_body not full_chunk when tracking code block
state, so the reopened fence prefix doesn't toggle in_code off and leave
continuation chunks with unclosed code blocks.
Authored by 0xbyt4.
144 new tests covering gateway/pairing.py, tools/skill_manager_tool.py,
tools/skills_tool.py, honcho_integration/session.py, and
agent/auxiliary_client.py.
Authored by Farukest. Fixes#389.
Replaces hardcoded forward-slash string checks ('/.git/', '/.hub/') with
Path.parts membership test in _find_all_skills() and scan_skill_commands().
On Windows, str(Path) uses backslashes so the old filter never matched,
causing quarantined skills to appear as installed.
Authored by Farukest. Fixes#387.
Removes 'and not force' from the dangerous verdict check so --force
can never install skills with critical security findings (reverse shells,
data exfiltration, etc). The docstring already documented this behavior
but the code didn't enforce it.
Authored by Farukest. Fixes#385.
Replaces startswith() with Path.is_relative_to() in _check_structure()
symlink escape check — same fix pattern as skill_view() (PR #352).
Prevents symlinks escaping to sibling directories with shared name prefixes.
Fixes a bug where the refresh token was not persisted when the API key
mint failed (e.g., 402 insufficient credits, timeout). The rotated
refresh token was lost, causing subsequent auth attempts to fail with
a stale token.
Changes:
- Persist auth state immediately after each successful token refresh,
before attempting the mint
- Use latest in-memory refresh token on mint-retry paths (was using
the stale original)
- Atomic durable writes for auth.json (temp file + fsync + replace)
- Opt-in OAuth trace logging (HERMES_OAUTH_TRACE=1, fingerprint-only)
- 3 regression tests covering refresh+402, refresh+timeout, and
invalid-token retry behavior
Author: Robin Fernandes <rewbs>
Authored by PercyDikec. Fixes#394.
The transcript extraction used len(history) to find new messages, but
history includes session_meta entries stripped before reaching the agent.
This caused 1 message lost per turn from turn 2 onwards. Fix returns
history_offset (filtered length) from _run_agent and uses it for the slice.
Two fixes for the case where a user switches to a model with a smaller
context window while having a large existing session:
1. Preflight compression in run_conversation(): Before the main loop,
estimate tokens of loaded history + system prompt. If it exceeds the
model's compression threshold (85% of context), compress proactively
with up to 3 passes. This naturally handles model switches because
the gateway creates a fresh AIAgent per message with the current
model's context length.
2. Error handler reordering: Context-length errors (400 with 'maximum
context length' etc.) are now checked BEFORE the generic 4xx handler.
Previously, OpenRouter's 400-status context-length errors were caught
as non-retryable client errors and aborted immediately, never reaching
the compression+retry logic.
Reported by Sonicrida on Discord: 840-message session (2MB+) crashed
after switching from a large-context model to minimax via OpenRouter.
get_definitions() already wrapped check_fn() calls in try/except,
but is_toolset_available() did not. A failing check (network error,
missing import, bad config) would propagate uncaught and crash the
CLI banner, agent startup, and tools-info display.
Now is_toolset_available() catches all exceptions and returns False,
matching the existing pattern in get_definitions().
Added 4 tests covering exception handling in is_toolset_available(),
check_toolset_requirements(), get_definitions(), and
check_tool_availability().
Closes#402
The transcript extraction used len(history) to find new messages, but
history includes session_meta entries that are stripped before passing
to the agent. This mismatch caused 1 message to be lost from the
transcript on every turn after the first, because the slice offset
was too high. Use the filtered history length (history_offset) returned
by _run_agent instead.
Also changed the else branch from returning all agent_messages to
returning an empty list, so compressed/shorter agent output does not
duplicate the entire history into the transcript.
The hidden directory filter used hardcoded forward-slash strings like
'/.git/' and '/.hub/' to exclude internal directories. On Windows,
Path returns backslash-separated strings, so the filter never matched.
This caused quarantined skills in .hub/quarantine/ to appear as
installed skills and available slash commands on Windows.
Replaced string-based checks with Path.parts membership test which
works on both Windows and Unix.
The docstring states --force should never override dangerous verdicts,
but the condition `if result.verdict == "dangerous" and not force`
allowed force=True to skip the early return. Execution then fell
through to `if force: return True`, bypassing the policy block.
Removed `and not force` so dangerous skills are always blocked
regardless of the --force flag.
The symlink escape check in _check_structure() used startswith()
without a trailing separator. A symlink resolving to a sibling
directory with a shared prefix (e.g. 'axolotl-backdoor') would pass
the check for 'axolotl' since the string prefix matched.
Replaced with Path.is_relative_to() which correctly handles directory
boundaries and is consistent with the skill_view path check.
session_search was returning the current session if it matched the
query, which is redundant — the agent already has the current
conversation context. This wasted an LLM summarization call and a
result slot.
Added current_session_id parameter to session_search(). The agent
passes self.session_id and the search filters out any results where
either the raw or parent-resolved session ID matches. Both the raw
match and the parent-resolved match are checked to handle child
sessions from delegation.
Two tests added verifying the exclusion works and that other
sessions are still returned.
Replace the string-based startswith + os.sep approach with
Path.is_relative_to() (Python 3.9+, we require 3.10+). This is
the idiomatic pathlib way to check path containment — it handles
separators, case sensitivity, and the equal-path case natively
without string manipulation.
Simplified tests to match: removed the now-unnecessary
test_separator_is_os_native test since is_relative_to doesn't
depend on separator choice.
The session key construction logic was duplicated in 4 places
(session.py + 3 inline copies in run.py), which is exactly the
kind of drift that caused issue #349 in the first place.
Extracted build_session_key() as a public function in session.py.
SessionStore._generate_session_key() now delegates to it, and all
inline key construction in run.py has been replaced with calls to
the shared function. Tests updated to test the function directly.
The previous implementation used `len(self._entries) > 1` to check if any
sessions had ever been created. This failed for single-platform users because
when sessions reset (via /reset, auto-reset, or gateway restart), the entry
for the same session_key is replaced in _entries, not added. So len(_entries)
stays at 1 for users who only use one platform.
Fix: Query the SQLite database's session count instead. The database preserves
historical session records (marked as ended), so session_count() correctly
returns > 1 for returning users even after resets.
This prevents the agent from reintroducing itself to returning users after
every session reset.
Fixes#351
Improvements to the HA integration merged from PR #184:
- Add ha_list_services tool: discovers available services (actions) per
domain with descriptions and parameter fields. Tells the model what
it can do with each device type (e.g. light.turn_on accepts brightness,
color_name, transition). Closes the gap where the model had to guess
available actions.
- Add HA to hermes tools config: users can enable/disable the homeassistant
toolset and configure HASS_TOKEN + HASS_URL through 'hermes tools' setup
flow instead of manually editing .env.
- Fix should-fix items from code review:
- Remove sys.path.insert hack from gateway adapter
- Replace all print() calls with proper logger (info/warning/error)
- Move env var reads from import-time to handler-time via _get_config()
- Add dedicated REST session reuse in gateway send()
- Update ha_call_service description to reference ha_list_services for
action discovery.
- Update tests for new ha_list_services tool in toolset resolution.
Three section header comments in tests/test_run_agent.py used
'Grup' instead of 'Group':
- Line 124: # Grup 1: Pure Functions
- Line 276: # Grup 2: State / Structure Methods
- Line 572: # Grup 3: Conversation Loop Pieces (OpenAI mock)
Banner integration:
- MCP Servers section in CLI startup banner between Tools and Skills
- Shows each server with transport type, tool count, connection status
- Failed servers shown in red; section hidden when no MCP configured
- Summary line includes MCP server count
- Removed raw print() calls from discovery (banner handles display)
/reload-mcp command:
- New slash command in both CLI and gateway
- Disconnects all MCP servers, re-reads config.yaml, reconnects
- Reports what changed (added/removed/reconnected servers)
- Allows adding/removing MCP servers without restarting
Resources & Prompts support:
- 4 utility tools registered per server: list_resources, read_resource,
list_prompts, get_prompt
- Exposes MCP Resources (data sources) and Prompts (templates) as tools
- Proper parameter schemas (uri for read_resource, name for get_prompt)
- Handles text and binary resource content
- 23 new tests covering schemas, handlers, and registration
Test coverage: 74 MCP tests total, 1186 tests pass overall.
- Discovery is now parallel (asyncio.gather) instead of sequential,
fixing the 60s shared timeout issue with multiple servers
- Startup messages use print() so users see connection status even
with default log levels (the 'tools' logger is set to ERROR)
- Summary line shows total tools and failed servers count
- Validate conflicting config: warn if both 'url' and 'command' are
present (HTTP takes precedence)
- Update TODO.md: mark MCP as implemented, list remaining work
- Add test for conflicting config detection (51 tests total)
All 1163 tests pass.
When both OPENROUTER_API_KEY and OPENAI_API_KEY are set (e.g. OPENAI_API_KEY
in .bashrc), the wrong key was sent to OpenRouter causing auth failures.
Fixed key resolution order in cli.py and runtime_provider.py.
Fixes#289
- Wrap commands with unique fence markers (printf FENCE; cmd; printf FENCE)
to isolate real output from shell init/exit noise (oh-my-zsh, macOS
session restore/save, docker plugin errors, etc.)
- Expand _clean_shell_noise to cover zsh/macOS patterns and strip from
both beginning and end (fallback when fences are missing)
- Fix BSD find compatibility: fallback to simple find when -printf
produces empty output (macOS)
- Fix test_terminal_disk_usage: use sys.modules to get the real module
instead of the shadowed function from tools/__init__.py
- Add 13 new unit tests for fence extraction and zsh noise patterns
- Add threading.Lock protecting all shared state (_servers, _mcp_loop, _mcp_thread)
- Fix deadlock in shutdown_mcp_servers: _stop_mcp_loop was called inside
a _lock block but also acquires _lock (non-reentrant)
- Fix race condition in _ensure_mcp_loop with concurrent callers
- Change idempotency to per-server (retry failed servers, skip connected)
- Dynamic toolset injection via startswith("hermes-") instead of hardcoded list
- Parallel shutdown via asyncio.gather instead of sequential loop
- Add tests for partial failure retry, parallel shutdown, dynamic injection
Patch _servers to empty dict in tests that call discover_mcp_tools()
with mocked config, preventing interference from real MCP connections
that may exist when running within the full test suite.
Refactor MCP connections from AsyncExitStack to task-per-server
architecture. Each server now runs as a long-lived asyncio Task
with `async with stdio_client(...)`, ensuring anyio cancel-scope
cleanup happens in the same Task that opened the connection.
Connect to external MCP servers via stdio transport, discover their tools
at startup, and register them into the hermes-agent tool registry.
- New tools/mcp_tool.py: config loading, server connection via background
event loop, tool handler factories, discovery, and graceful shutdown
- model_tools.py: trigger MCP discovery after built-in tool imports
- cli.py: call shutdown_mcp_servers in _run_cleanup
- pyproject.toml: add mcp>=1.2.0 as optional dependency
- 27 unit tests covering config, schema conversion, handlers, registration,
SDK interaction, toolset injection, graceful fallback, and shutdown
Config format (in ~/.hermes/config.yaml):
mcp_servers:
filesystem:
command: "npx"
args: ["-y", "@modelcontextprotocol/server-filesystem", "/tmp"]
Added a fixture to redirect HERMES_HOME to a temporary directory during tests, preventing writes to the user's home directory. Updated the test for DebugSession to create a dedicated log directory for saving logs, ensuring test isolation and accuracy in assertions.
The TestFlushSentinelNotLeaked test from PR #227 had two issues:
1. flush_memories() uses get_text_auxiliary_client() which could bypass
agent.client entirely — mock it to return (None, None)
2. No assertion that the API was actually called — added guard assert
Without these fixes the test passed vacuously (API never called).
The TestRetryExhaustion tests from PR #223 didn't mock time.sleep/time.time,
causing the retry backoff loops (275s+ total) to run in real time. Tests would
time out instead of running quickly.
Added _make_fast_time_mock() helper that creates a mock time module where
time.time() advances 500s per call (so sleep_end is always in the past) and
time.sleep() is a no-op. Both tests now complete in <1s.
skill_view accepted arbitrary file_path values like '../../.env' and
would read files outside the skill directory, exposing API keys and
other sensitive data.
Added two layers of defense:
1. Reject paths with '..' components (fast, catches obvious traversal)
2. resolve() containment check with trailing '/' to prevent prefix
collisions (catches symlinks and edge cases)
Fix approach from PR #242 (@Bartok9). Vulnerability reported by
@Farukest (#220, PR #221). Tests rewritten to properly mock SKILLS_DIR.
Closes#220
The OpenAI API returns content: null on assistant messages that only
contain tool calls. msg.get('content', '') returns None (not '') when
the key exists with value None, causing TypeError on len() and string
concatenation in _generate_summary and compress.
Fix: msg.get('content') or '' — handles both missing keys and None.
Tests from PR #216 (@Farukest). Fix also in PR #215 (@cutepawss).
Both PRs had stale branches and couldn't be merged directly.
Closes#211
Priority is: CLI arg > config file > env var > default
(not env var > config file as the old comment stated)
The test failed because config.yaml had max_turns at both root level
and inside agent section. The test cleared agent.max_turns but the
root-level value still took precedence over the env var. Fixed the
test to clear both, and corrected the comment to match the intended
priority order.
Tests added:
- Roundtrip serialization of chat_topic via to_dict/from_dict
- chat_topic defaults to None when missing from dict
- Channel Topic line appears in session context prompt when set
- Channel Topic line is omitted when chat_topic is None
Follow-up to PR #248 (feat: Discord channel topic in session context).
Enhanced the AIAgent class to capture and normalize summary information for reasoning items. Implemented logic to handle summaries as lists, ensuring proper formatting for API interactions. Updated tests to validate the inclusion of summaries in reasoning items, both for existing and default cases.
Updated the authentication mechanism to store Codex OAuth tokens in the Hermes auth store located at ~/.hermes/auth.json instead of the previous ~/.codex/auth.json. This change includes refactoring related functions for reading and saving tokens, ensuring better management of authentication states and preventing conflicts between different applications. Adjusted tests to reflect the new storage structure and improved error handling for missing or malformed tokens.
Introduced a new `provider_routing` section in the CLI configuration to control how requests are routed across providers when using OpenRouter. This includes options for sorting providers by throughput, latency, or price, as well as allowing or ignoring specific providers, setting the order of provider attempts, and managing data collection policies. Updated relevant classes and documentation to support these features, enhancing flexibility in provider selection.
Remove loop_scope="function" parameter from async test decorators in
test_hooks.py. This matches the existing convention in the repo
(test_telegram_documents.py) and avoids requiring pytest-asyncio 0.23+.
All 144 new tests from PR #191 now pass.
Block dangerous HA service domains (shell_command, command_line,
python_script, pyscript, hassio, rest_command) that allow arbitrary
code execution or SSRF. Add regex validation for entity_id to prevent
path traversal attacks. 17 new tests covering both security features.
Fixes#241
When users set HONCHO_API_KEY via `hermes config set` or environment
variable, they expect the integration to activate. Previously, the
`enabled` flag defaulted to `false` when reading from global config,
requiring users to also explicitly enable Honcho.
This change auto-enables Honcho when:
- An API key is present (from config file or env var)
- AND `enabled` is not explicitly set to `false` in the config
Users who want to disable Honcho while keeping the API key can still
set `enabled: false` in their config.
Also adds unit tests for the auto-enable behavior.
Two fixes to the subagent progress display from PR #186:
1. Task index prefix: show 1-indexed prefix ([1], [2], ...) for ALL
tasks in batch mode (task_count > 1). Single tasks get no prefix.
Previously task 0 had no prefix while others did, making batch
output confusing.
2. Completion indicator: use spinner.print_above() instead of raw
print() for per-task completion lines (✓ [1/2] ...). Raw print
collided with the active spinner, mushing the completion text
onto the spinner line. Now prints cleanly above.
Added task_count parameter to _build_child_progress_callback and
_run_single_child. Updated tests accordingly.
print_above() used \033[K (erase-to-end-of-line) to clear the spinner
line before printing text above it. This causes garbled escape codes when
prompt_toolkit's patch_stdout is active in CLI mode.
Switched to the same spaces-based clearing approach used by stop() —
overwrite with blanks, then carriage return back to start of line.
Updated test assertion to match the new clearing method.
When subagents run via delegate_task, the user now sees real-time
progress instead of silence:
CLI: tree-view activity lines print above the delegation spinner
🔀 Delegating: research quantum computing
├─ 💭 "I'll search for papers first..."
├─ 🔍 web_search "quantum computing"
├─ 📖 read_file "paper.pdf"
└─ ⠹ working... (18.2s)
Gateway (Telegram/Discord): batched progress summaries sent every
5 tool calls to avoid message spam. Remaining tools flushed on
subagent completion.
Changes:
- agent/display.py: add KawaiiSpinner.print_above() to print
status lines above an active spinner without disrupting animation.
Uses captured stdout (self._out) so it works inside the child's
redirect_stdout(devnull).
- tools/delegate_tool.py: add _build_child_progress_callback()
that creates a per-child callback relaying tool calls and
thinking events to the parent's spinner (CLI) or progress
queue (gateway). Each child gets its own callback instance,
so parallel subagents don't share state. Includes _flush()
for gateway batch completion.
- run_agent.py: fire tool_progress_callback with '_thinking'
event when the model produces text content. Guarded by
_delegate_depth > 0 so only subagents fire this (prevents
gateway spam from main agent). REASONING_SCRATCHPAD/think/
reasoning XML tags are stripped before display.
Tests: 21 new tests covering print_above, callback builder,
thinking relay, SCRATCHPAD filtering, batching, flush, thread
isolation, delegate_depth guard, and prefix handling.
- Introduce a new test suite in `test_file_tools_live.py` to validate file operations and ensure accurate command execution in a real environment.
- Implement assertions to check for shell noise contamination in outputs, enhancing the reliability of command results.
- Create fixtures for setting up a local environment and populating directories with known file contents for comprehensive testing.
- Refactor shell noise handling in `process_registry.py` and `local.py` to support multiple noise patterns, improving output cleanliness.
- Introduce a new test suite for the `redact_sensitive_text` function, covering various sensitive data formats including API keys, tokens, and environment variables.
- Ensure that sensitive information is properly masked in logs and outputs while non-sensitive data remains unchanged.
- Add tests for different scenarios including JSON fields, authorization headers, and environment variable assignments.
- Implement a redacting formatter for logging to enhance security during log output.
- Enhanced Codex model discovery by fetching available models from the API, with fallback to local cache and defaults.
- Updated the context compressor's summary target tokens to 2500 for improved performance.
- Added external credential detection for Codex CLI to streamline authentication.
- Refactored various components to ensure consistent handling of authentication and model selection across the application.
The retry exhaustion checks used > instead of >= to compare
retry_count against max_retries. Since the while loop condition is
retry_count < max_retries, the check retry_count > max_retries can
never be true inside the loop. When retries are exhausted, the loop
exits and falls through to response.choices[0] on an invalid response,
crashing with IndexError instead of returning a proper error.
os.setsid, os.killpg, and os.getpgid do not exist on Windows and raise
AttributeError on import or first call. This breaks the terminal tool,
code execution sandbox, process registry, and WhatsApp bridge on Windows.
Added _IS_WINDOWS platform guard in all four affected files, following
the pattern documented in CONTRIBUTING.md. On Windows, preexec_fn is
set to None and process termination falls back to proc.terminate() /
proc.kill() instead of process group signals.
Files changed:
- tools/environments/local.py (3 call sites)
- tools/process_registry.py (2 call sites)
- tools/code_execution_tool.py (3 call sites)
- gateway/platforms/whatsapp.py (3 call sites)
/retry and /undo set session_entry.conversation_history which does not
exist on SessionEntry. The truncated history was never written to disk,
so the next message reload picked up the full unmodified transcript.
Added SessionStore.rewrite_transcript() that persists changes to both
the JSONL file and SQLite database, and updated both commands to use it.
/reset accessed self.session_store._sessions which does not exist on
SessionStore (the correct attribute is _entries). Also replaced the
hand-coded session key with _generate_session_key() to fix WhatsApp DM
sessions using the wrong key format.
Closes#210
The italic regex \*([^*]+)\* used [^*] which matches newlines, causing
bullet lists with * markers to be incorrectly converted to italic text.
Changed to [^*\n]+ to prevent cross-line matching.
Adds 43 tests for _escape_mdv2 and format_message covering code blocks,
bold/italic, headers, links, mixed formatting, and the regression case.
- unified_search and GitHubSource.search dedup: replace naive
`trust_level == "trusted"` check with ranked comparison so
"builtin" results are never overwritten by "trusted" or "community"
- Add 43 unit tests covering _parse_frontmatter_quick, trust_level_for,
HubLockFile CRUD, TapsManager ops, LobeHub _convert_to_skill_md,
unified_search dedup (with regression test), and append_audit_log
- extract_images: only remove extracted image tags from content, preserve
non-image markdown links (e.g. PDFs) that were previously silently lost
- truncate_message: walk only chunk_body (not prepended prefix) so the
reopened code fence does not toggle in_code off, leaving continuation
chunks with unclosed code blocks
- Add 49 unit tests covering MessageEvent command parsing, extract_images,
extract_media, truncate_message code block handling, and _get_human_delay
Gemini 3 thinking models attach extra_content with thought_signature
to function call responses. This must be echoed back on subsequent
API calls or the server rejects with a 400 error. The assistant
message builder was dropping this field, causing all Gemini 3 Flash/Pro
tool-calling flows to fail after the first function call.
- Auto-authorize HA events in gateway (system-generated, not user messages)
- Guard _read_events against None/closed WebSocket after failed reconnect
- Use UUID for send() message_id instead of polluting WS sequence counter
- entity_id parameter now takes precedence over data["entity_id"]
- Add ha_list_entities, ha_get_state, ha_call_service tools via REST API
- Add WebSocket gateway adapter for real-time state_changed event monitoring
- Support domain/entity filtering, cooldown, and auto-reconnect with backoff
- Use REST API for outbound notifications to avoid WS race condition
- Gate tool availability on HASS_TOKEN env var
- Add 82 unit tests covering real logic (filtering, payload building, event pipeline)
Fixes#160
The issue was that MEDIA tags were being extracted from ALL messages
in the conversation history, not just messages from the current turn.
This caused TTS voice messages generated in earlier turns to be
re-attached to every subsequent reply.
The fix:
- Track history_len before calling run_conversation
- Only scan messages AFTER history_len for MEDIA tags
- Add comprehensive tests to prevent regression
This ensures each voice message is sent exactly once, when it's
generated, not on every subsequent message in the session.
The 413 "Request Entity Too Large" error from the LLM API was caught by the
generic 4xx handler which aborts immediately. This is wrong for 413 — it's a
payload-size issue that can be resolved by compressing conversation history.
- Intercept 413 before the generic 4xx block and route to _compress_context
- Exclude 413 from generic is_client_error detection
- Add 'request entity too large' to context-length phrases as safety net
- Add tests for 413 compression behavior
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Sanitize filenames in cache_document_from_bytes to prevent path traversal (strip directory components, null bytes, resolve check)
- Reject documents with None file_size instead of silently allowing download
- Cap text file injection at 100 KB to prevent oversized prompt payloads
- Sanitize display_name in run.py context notes to block prompt injection via filenames
- Add 35 unit tests covering document cache utilities and Telegram document handling
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add 15 new tests in two classes:
- TestRmFalsePositiveFix (8 tests): verify filenames starting with 'r'
(readme.txt, requirements.txt, report.csv, etc.) are NOT falsely
flagged as 'recursive delete'
- TestRmRecursiveFlagVariants (7 tests): verify all recursive delete
flag styles (-r, -rf, -rfv, -fr, -irf, --recursive, sudo rm -rf)
are still correctly caught
All 29 tests pass (14 existing + 15 new).
The regex `ignore\s+(previous|all|above|prior)\s+instructions` only
allowed ONE word between "ignore" and "instructions". Multi-word
variants like "Ignore ALL prior instructions" bypassed the scanner
because "ALL" matched the alternation but then `\s+instructions`
failed to match "prior".
Fix: use `(?:\w+\s+)*` groups to allow optional extra words before
and after the keyword alternation.
Cover model_tools, toolset_distributions, context_compressor,
prompt_caching, cronjob_tools, session_search, process_registry,
and cron/scheduler with 127 new test cases.
These tests documented the macOS symlink bypass bug with
platform-conditional assertions. The fix and proper regression
tests are in PR #61 (tests/tools/test_write_deny.py), so remove
them here to avoid ordering conflicts between the two PRs.
On macOS, /etc is a symlink to /private/etc. The _is_write_denied()
function resolves the input path with os.path.realpath() but the deny
list entries were stored as literal strings ("/etc/shadow"). This meant
the resolved path "/private/etc/shadow" never matched, allowing writes
to sensitive system files on macOS.
Fix: Apply os.path.realpath() to deny list entries at module load time
so both sides of the comparison use resolved paths.
Adds 19 regression tests in tests/tools/test_write_deny.py.
- Renamed test method for clarity and added comprehensive tests for `SessionSource` including handling of numeric `chat_id`, missing optional fields, and invalid platforms.
- Introduced tests for session source descriptions based on chat types and names, ensuring accurate representation in prompts.
- Improved file tools tests by validating schema structures, ensuring no duplicate model IDs, and enhancing error handling in file operations.
- Introduced a new `_cleanup_test_artifacts` function to remove test-generated files and directories after test execution.
- Integrated the cleanup function into the `test_current_implementation` and `test_interruption_and_resume` tests to ensure proper resource management and prevent clutter from leftover files.
- Replaced the Nous API key check with the Auxiliary Model check in the WebToolsTester class.
- Updated the environment configuration to reflect the change in API key validation, ensuring accurate reporting of available keys.
- Introduced a shared interrupt signaling mechanism to allow tools to check for user interrupts during long-running operations.
- Updated the AIAgent to handle interrupts more effectively, ensuring in-progress tool calls are canceled and multiple interrupt messages are combined into one prompt.
- Enhanced the CLI configuration to include container resource limits (CPU, memory, disk) and persistence options for Docker, Singularity, and Modal environments.
- Improved documentation to clarify interrupt behaviors and container resource settings, providing users with better guidance on configuration and usage.
- Deleted test scripts for Nous API limits, patterns, and temperature checks to streamline the testing suite.
- These scripts were no longer necessary and their removal helps maintain a cleaner codebase.
- Introduced the `delegate_task` tool, allowing the main agent to spawn child AIAgent instances with isolated context for complex tasks.
- Supported both single-task and batch processing (up to 3 concurrent tasks) to enhance task management capabilities.
- Updated configuration options for delegation, including maximum iterations and default toolsets for subagents.
- Enhanced documentation to provide clear guidance on using the delegation feature and its configuration.
- Added comprehensive tests to ensure the functionality and reliability of the delegation logic.
- Introduced a new `execute_code` tool that allows the agent to run Python scripts that call Hermes tools via RPC, reducing the number of round trips required for tool interactions.
- Added configuration options for timeout and maximum tool calls in the sandbox environment.
- Updated the toolset definitions to include the new code execution capabilities, ensuring integration across platforms.
- Implemented comprehensive tests for the code execution sandbox, covering various scenarios including tool call limits and error handling.
- Enhanced the CLI and documentation to reflect the new functionality, providing users with clear guidance on using the code execution tool.
- Introduced new browser automation tools in `browser_tool.py` for navigating, interacting with, and extracting content from web pages using the agent-browser CLI and Browserbase cloud execution.
- Updated `.env.example` to include new configuration options for Browserbase API keys and session settings.
- Enhanced `model_tools.py` and `toolsets.py` to integrate browser tools into the existing tool framework, ensuring consistent access across toolsets.
- Updated `README.md` with setup instructions for browser tools and their usage examples.
- Added new test script `test_modal_terminal.py` to validate Modal terminal backend functionality.
- Improved `run_agent.py` to support browser tool integration and logging enhancements for better tracking of API responses.