LlamaMan - Somebody Had to Do It

Aside from the business projects, that run on commercial VPS instances with GPUs for inference with vLLM and LiteLLM, I also run a home lab on Unraid with a 4060 TI.

Since it's Unraid, I used a dockerized Ollama instance for quite a while to run experiments and develop my own agents for the sake of understanding the workings of AI agents while on the major side-quest of getting open weight models to run reliable quality AI agents without degradation. More on that in the future.

One day, my ISP decided to have a seizure and decided to cut off access to my home network. Spoilers - I don't have a backup connection at home (yet). Coincidentally just that very same day I had to do a live presentation for one of my upcoming projects that uses a lightweight LLM. So, well WTF do we do now? The ISP said they'd turn on the connection by the end of the day but that's WAAAY too long for me to wait. So I powered on my desktop PC, which runs a 5950X and an RTX 2070 mind you, quickly downloaded LMStudio and downloaded the model I needed in GGUF, and tested it with a single 'hi' prompt.

I hadn't run an LLM on that PC before that. I saw that it works, I quickly switched out the API settings in my demo app, shut it off, grabbed the PC and headed out for the presentation. Ignoring the awkwardness of me entering the room with a PC under my arm and looking for the right HDMI cable to plug in, when the demo started up and I had a moment where I was like "HUH. This thing runs almost as fast as the 4060. HUH.".

So a few days after that I decided, "you know what? I'm going to run Llama.cpp in a container on the Unraid server and compare performance." It was immediately obvious that the models I ran on the plain Llama.cpp were nearly 2x faster than on Ollama. How is that even possible? Now, don't get me wrong - I'm not writing this to intentionally sh💩t on Ollama - it's a great entry level tool to self hosted LLMs. But it's not great by a long shot.

So I was like:

  1. WTF are Ollama doing exactly to slow things down so much?
  2. How is there no other project to handle llama.cpp instances in a simple intuitive way?

I genuinely spent a week searching for someone who'd already built a clean web UI to manage llama.cpp instances in Docker. Nobody had. So I did.

However, I do also want to give a shoutout to llama-swap. It's a cool project but it's not the kind of straight forward I like.

Roughly four Sundays and about 75% vibe coding later - LlamaMan was born (stands for Llama Manager - but let's embrace the cringe meme here). The gigachad llama logo was non-negotiable from day one, and yes - the Ollama-compatible proxy runs on port 42069 (nice). It's a serious project, I promise.


Why llama.cpp and Not vLLM?

I originally considered building this around vLLM. It's fast, it's well-maintained, but then I just couldn't let go of one feature that llama.cpp-based projects have: system memory offload.

llama.cpp lets you partially offload models to your system RAM. That means you can run a 70B model on a GPU that has no business running a 70B model - you offload what fits to the GPU and let the rest spill into RAM. Is it as fast as full GPU offload? Nope. But it works, and for a self-hosted setup where you're juggling multiple models on consumer hardware, it's the single best feature in the ecosystem.

vLLM doesn't do that. vLLM wants your model to fit in VRAM or it wants nothing to do with you. And sure - for a production deployment serving thousands of requests, that makes sense. But for a homelab? For someone running models on a single GPU and wanting to experiment with different sizes? System memory offload is a lifesaver.

That said - I'd be interested in making a vLLM branch of this project at some point. vLLM is usually for production deployments where people have highly automated workflows, so it's not on my priority list. But if anyone is interested, hit me up on the site, YouTube, or X and I'll consider bumping it up.


The Features

Here's what LlamaMan actually does. I'll keep it brief because the README covers the details - this is the highlight reel.

Model Library

Point it at a directory full of GGUF files and it scans everything automatically. Shows you the quant type, file size, and reads the GGUF metadata to auto-detect the layer count. So when you're deciding how many layers to offload to your GPU, you actually know what you're working with. Having control over your own hardware - what a novel concept in 2026.

One-Click Launch with Full Control

Select a model, configure your GPU layers, context size, threads, parallel sequences, and hit launch. You get a running llama-server instance on its own port with an OpenAI-compatible API. You know - the stuff you'd expect from a management UI.

The key difference from a certain other tool 😉: you control the GPU layer offload. All of it. You decide how many layers go to GPU and how many stay in RAM. No mysterious decisions made on your behalf where a model that easily fits in your VRAM gets half-offloaded to CPU for absolutely no discernible reason.

Preset Configs

Save your launch settings per model. Next time you want to spin up that model, everything is pre-filled. These presets are also what the proxy uses when auto-starting models, so it's not just convenience - it's how the system remembers your preferences.

Download Manager

Pull models straight from HuggingFace without leaving the UI. It supports global and per-download speed throttling too.

GPU VRAM Monitoring

Real-time per-GPU VRAM and Core usage bars. Works with NVIDIA (via nvidia-smi) and AMD (via rocm-smi, though the AMD side is experimental and untested). You can see exactly how much VRAM each GPU has free before you launch another instance. Useful information to have when you're about to load a 13B model and you're not sure if it'll fit alongside the 7B that's already running.

Idle Timeout & Auto-Sleep

Set an idle timeout per instance. After N minutes of no requests, the server stops and frees your VRAM. Next request that comes in? It auto-relaunches on the same port with the same config. Your client doesn't know anything happened - it just sees a slightly longer response time on that first request. It's like the model took a nap and woke up fresh.

Ollama-Compatible Proxy

This is the big one. LlamaMan exposes an Ollama-compatible API on port 42069 (nice). Point Open WebUI at it and everything just works - model discovery, auto-start on demand, chat completions, the whole deal.

It also does LRU model eviction - set a max number of concurrent models and when you request a new one, the least recently used one gets stopped to free up resources. Embedding models are excluded from this limit, so your embedding model stays loaded permanently.

Concurrency Gating & Shared Queues

Set a max concurrent request limit per instance. Excess requests queue up in a FIFO queue instead of hammering your server. If multiple instances of the same model are running, they can share a queue for basic load balancing. Requests beyond the queue depth get a clean HTTP 429 instead of a mysterious timeout.

Authentication

Two-layer auth: session-based login for the browser UI, and bearer token API keys for external tools. There's a "require auth" toggle that controls whether model endpoints need authentication or stay open. When it's on, all three port surfaces are protected - the management UI, the Ollama proxy, and every individual instance port.

"But WHY does this thing need authentication as well?" 🤓 - I have a cat. 🐈

Storage Backends

JSON files by default - zero config, just works (just remember to attach a volume for that). Or MariaDB/MySQL if you want a proper database. Tables auto-create on first connection.


Getting It Running

You need Docker and an NVIDIA GPU with the Container Toolkit installed. That's it.

docker compose up --build
  • Management UI: http://localhost:5000
  • Ollama-compatible proxy: http://localhost:42069 # nice
  • Instance ports: 8000-8020

On first launch, hit the UI and create an admin account at /setup. From there, you're off to the races - select a model, configure your launch settings, and start chatting.

For Open WebUI integration:

open-webui:
  environment:
    - OLLAMA_BASE_URL=http://llamaman:42069 # nice

If you have require_auth enabled (it is by default), create an API key in the LlamaMan UI and pass it to Open WebUI:

    - OPENAI_API_KEYS=llm-your-api-key-here

What's Next

Despite the whole "four Sundays and vibe coding" origin story, this project has legs. Further down the line, I'd like to add some cool built-in features that most other LLM applications which are readily available to self-host don't have. But that's for another time - when we are just slightly older and wiser.


Source and Image

Licensed under the Elastic License 2.0.

If you want to support this project, subscribing to the YouTube channel goes a long way.


Live long and prosper. 🖖👽

Share this article

Copied!