LlamaMan

LlamaMan

A web UI for launching, monitoring, and managing multiple llama.cpp servers

Docker JavaScript llama.cpp Python

LlamaMan is a browser-based UI for launching, monitoring, and managing multiple llama.cpp server instances from inside a Docker container. Includes an Ollama-compatible API proxy so it works as a drop-in replacement for Ollama with Open WebUI.

Setup now via DockerHub

Features

Launch and manage multiple `llama.cpp` server instances from a clean browser dashboard.


Run models in isolated Docker containers with configurable GPU layers, context size, CPU threads, memory limits, parallel slots, GPU selection, and extra server flags.


Use LlamaMan as a drop-in Ollama-compatible backend for Open WebUI and other Ollama clients.


Serve OpenAI-compatible chat completions, model listings, and embeddings from local GGUF models.


Auto-start models on demand when requests arrive, using saved presets or sensible defaults.


Keep GPU memory under control with idle sleep, wake-on-request, and configurable model eviction limits.


Monitor every running model with live status, logs, load time, tokens per second, time to first token, request counts, CPU/RAM usage, and GPU assignment.


View native GPU telemetry for NVIDIA, AMD, and Intel Arc, including VRAM, utilization, and temperature.


Download GGUF models directly from Hugging Face with progress tracking, pause/resume, retry, cancellation, saved tokens, and speed limits.


Save per-model launch presets, notes, favorites, and proxy-side sampling overrides.


Protect the dashboard and model APIs with first-run admin setup, session login, API keys, and optional bearer-token enforcement for all serving endpoints.