LlamaMan 0.8.6 - What's New

👉 Introductory article

LlamaMan is a self-hosted web UI for managing llama.cpp instances. Point it at a directory of GGUF files, launch models with full control over GPU layer offload and context size, and get an Ollama-compatible API on port 42069 (nice) that OpenWebUI or any other client can talk to. Think Ollama but without the mystery decisions.

Six changes in this release. The big one is proxy sampling overrides - the rest fills in gaps that were annoying to work around.

Proxy Sampling Parameter Overrides

The Ollama-compatible proxy now lets you lock down the sampling parameters it forwards to llama.cpp. When you enable overrides on an instance, the proxy ignores whatever temperature, top_k, top_p, and presence_penalty the client sends and substitutes your configured values instead.

Useful when you have a client (Open WebUI, a custom agent, whatever) that hardcodes its own sampling params and you want the model to behave consistently regardless. Set it once in the instance config, forget about it.

Supported parameters:

Temperature - capped at 2.0
Top-K - integer, must be >= 0
Top-P - must be > 0 and <= 1
Presence Penalty - range -2.0 to 2.0

All bounds are validated and return a clear error if you're out of range.

Download Source Tracking

LlamaMan now records which HuggingFace repo a model was downloaded from. When you pull a GGUF from HuggingFace, the repo ID gets stored in settings against the download path. This is groundwork for features that need to know where a file came from - re-download, update checks, that kind of thing.

No UI change yet, it's happening in the background.

Failed Download Auto-Retry

Downloads that fail mid-way now have an auto-retry mechanism. Toggle it on in Settings >> Downloads. Default retry count is 3, configurable to whatever you want (minimum 1).

When a download fails, the background monitor checks if auto-retry is enabled and whether the attempt limit has been reached. If there's headroom, it restarts the download and increments the attempt counter. If it hits the limit, it marks the download as failed as before.

Manual retry via the UI still works the same - it resets the counter and starts fresh.

Models JSON Export Includes Presets

The "Download Stored Models" button in the system panel now fetches preset configs alongside model metadata and bundles them into the export. If a model has a saved preset, it's included inline in the JSON. Useful if you're migrating setups or just want a snapshot of how everything is configured.

GPU Polling Every 10 Seconds

GPU VRAM stats were refreshing every 15 seconds. They now refresh every 10 seconds, matching the system info polling interval. Small change but if you're watching VRAM fill up while loading a model it's noticeably more responsive.

No breaking changes, no migrations. Pull the latest image and you're done.

GitHub: github.com/nullata/llamaman
Docker Hub: hub.docker.com/r/nullata/llamaman

Live long and prosper. 🖖👽

Proxy Sampling Parameter Overrides

Download Source Tracking

Failed Download Auto-Retry

Models JSON Export Includes Presets

GPU Polling Every 10 Seconds

Subscribe to Updates