Local-first AI memory.
Hybrid when you need it.
UPtrim is a reverse proxy that sits on your machine, between your chat app and every LLM you use. It makes local LLMs actually useful by giving them a proper memory — and then teaches them to collaborate with cloud models like Claude and GPT without ever letting your context leave your box.
Any LLM. Any frontend.
Your memory in the middle.
UPtrim speaks OpenAI-compatible on both sides. Plug in Ollama or llama.cpp on one end, Open WebUI or SillyTavern on the other — or skip the whole stack and just use our bundled chat client. Memory, identity, and routing follow you everywhere.
- Ollama
- llama.cpp
- LM Studio
- vLLM · TGI
- Anthropic (Claude)
- OpenAI (GPT-5)
- OpenRouter
- Groq · Together · Fireworks
/v1/chat/completions, it works.config
file
- UPtrim Chat
- Open WebUI
- SillyTavern
- LibreChat · LobeChat
- Continue.dev
- Cline · Aider
- BoltAI · Msty
- Cursor (custom endpoint)
localhost:9099 — and go.No frontend? No problem.
Bundled chat UI lives at localhost:9099. Cream, touch-friendly, searchable. Use your local Llama or Qwen like your own private ChatGPT — with memory, file upload, and multi-user accounts. Zero other apps required.
One memory across every app.
Start a chat in Open WebUI. Continue it in SillyTavern. Keep coding in Cline. Same identity, same facts, same files — because UPtrim holds the state, not the frontend. Your AI finally feels like yours.
Swap models mid-sentence.
Llama today, Qwen tomorrow, rent GPT-5 for an hour? Aliases let you rename any backend. Your frontend keeps calling gpt-4. UPtrim silently routes to whichever brain is best — local when it can, cloud when it matters.
Code with your local
and cloud, hand-in-hand.
Local models are fast and free. Cloud models are expensive and sharp. UPtrim's hybrid router routes every request to the right brain, automatically — while both share the same memory.
You draft.
Your local Llama or Qwen takes the first pass — scaffolds the function, sketches the test, writes the commit message. Free, instant, fully offline.
It decides.
Hit a hard problem? Router spots the complexity and quietly escalates to Claude or GPT — passing your full project memory so the cloud model picks up mid-thought.
Claude reviews.
Opus does the heavy lifting — refactors, catches edge cases, explains the tricky parts. Every decision gets written back to your local memory so next time your local model remembers too.
Both sides see the same memory. Your preferences, project history, past decisions, code style — injected into whichever model handles the request. Local and cloud stop being two separate tools. They become one AI that knows you.
What you get for $0.
No credit card. No sign-up. Drop UPtrim on your machine, point it at your local LLM — and every feature below just works, fully offline.
Persistent memory
Every turn, UPtrim pulls names, preferences, projects, and relationships out of your chat — stored in local SQLite with FTS5 keyword search. Up to 5,000 facts, all editable from the dashboard.
- Smart fact extractionspaCy NLP (TRF / FULL / LITE) with regex fallback.
- Intent-aware injectionMemories ranked by relevance and staleness, per message.
- Dedup & consolidationMerges duplicates, resolves contradictions automatically.
- Basic knowledge graphEntity + relationship extraction, linked nodes in SQLite.
5 isolated user vaults
Five people share one proxy, each with their own memory, files, and conversations. Zero leakage between them. Identity resolved from Open WebUI headers, custom headers, or HMAC tokens.
- Secret ShieldAPI keys, passwords, AWS/OAuth tokens redacted pre-storage.
- Prompt-injection scanHeuristic filter blocks poisoned inputs from memory writes.
- Rate limitingPer-user caps, burst protection, stale-request filtering.
- HMAC-SHA256 tokensLabelled API keys, optional PBKDF2 passwords.
OpenAI-compat drop-in
Point any chat app at localhost:9099 and swap backends on the fly. Ollama, llama.cpp, vLLM, LM Studio, Claude, GPT, OpenRouter — if it speaks /v1/chat/completions, UPtrim routes to it.
- Multi-backendSeveral backends at once, swap mid-session without reconnecting.
- Streaming + SSEFull SSE with think-block filter for reasoning models.
- Auto-discoveryDetects available models and context windows on startup.
- Read-only cloud OAuthOne cloud provider included on Free — no raw API keys.
Full web dashboard
Every memory, user, file, setting, and log — live at :9099/dashboard. Edit or delete anything. Each user also gets their own personal memory page to browse, pin, and prune their facts.
- Bundled chat UIUse your local LLM like a private ChatGPT — zero other apps.
- Agent modeReAct tool-use loop: memory search, URL fetch, file read — live.
- Terminal TUITextual live-monitor with stats, memory pressure, token gauges.
- Native desktop appThemes, slash commands, streaming CLI client.
50+ file formats
PDF, DOCX, XLSX, Markdown, code, JSON, YAML, logs — uploaded, auto-chunked, and injected as context when relevant. Per-user vault with 50 files each, fully isolated and searchable.
- Smart chunk injectionRelevance-ranked, budget-aware, dynamic sizing.
- Optional embeddingsFAISS + bundled BGE-base for semantic file search.
- Local image genAuto-routes image intents to your
sd.cppbackend. - Upload securityMIME allowlist, size caps, content-injection heuristics.
Multi-mode NLP + GPU
spaCy TRF / FULL / LITE / regex with graceful auto-fallback. GPU auto-detected on CUDA, MPS, or ROCm. CPU-only still works. Custom entity patterns for names, dates, health, and diet — no external models required.
- Full audit trailEvery memory op logged with user + timestamp.
- Error ring bufferDaily crash logs with full tracebacks, post-mortem ready.
- Perf metricsTokens in/out, cache hits, latency, NLP timing.
- Debug endpointsIntent DNA inspection, memory provenance graphs.
v1.0 is the forever-free baseline. Paid tiers stack ghost agents, hybrid cloud+local routing, sub-agent swarms, and production features on top — but the foundation below is yours, offline, today.
See It in Action
Click a scenario to see what happens.
Persistent memory
You mentioned weeks ago that you code in Python and prefer dark mode. UPtrim extracted those facts and stored them. Next session, they're injected into context automatically.
Regex and spaCy NLP extract facts from conversations. FTS5 indexes them. Intent classification decides which memories are relevant to inject per message.
Per-user isolation
Sarah asks about her meeting notes. Mike asks about his Python script. Their memories, files, and conversations are completely separate.
Identity resolution pulls user info from chat app headers. Each user gets isolated memory, file storage, and context — configurable trust modes control what happens with unknown users.
File-backed context
Upload PDFs, text files, or notes. Ask questions and UPtrim pulls relevant sections into the LLM's context window.
Files are stored locally per user. Content is chunked, indexed, and matched against incoming messages. Relevant chunks get injected alongside memory.
Agent mode
The LLM can search the web, fetch URLs, and query stored memories on its own. No browser extensions or plugins.
Agent mode exposes tool-use endpoints to the LLM. It decides when to call them based on the conversation. Results are injected into the response context.
Full Visibility
View, edit, or delete any stored memory. Manage users and settings from your browser.
Web Dashboard
Live stats, stored memories, user list, and every setting. All at localhost:9099.
Click to learn moreMy Memory Page
Every user can see what the AI remembers about them, upload files, and fix mistakes.
Click to learn moreAI Tools
Your AI can search the web, read files, and dig through memories on its own.
Click to learn moreThe Production Stack
Everything in Pro plus the features built for teams and production workflows — visual knowledge graph explorer, n8n workflow integration, and multi-agent collaboration.
Visual Knowledge Graph
Memories as interactive connected nodes. Zoom the whole network, click an entity to see every edge, trace how a conversation 3 weeks ago led to a decision yesterday. Pro tier gets the graph; Premium gets the explorer.
n8n + MCP
Expose UPtrim memory as MCP tools. Your n8n workflows can read, write, and query per-user memory — AI agents that remember across automations.
Ghost Mesh
Multi-agent collaboration: analyst + predictor + planner running in parallel, sharing the scratchpad, arguing before they commit. Sub-agent swarm on steroids.
Claude Code Offload
Hand hard multi-step tasks to a full Claude Code subprocess — preserves your session, context cache, and gives you code edits, shell access, and tool use without duplicating the harness.
Staleness Transparency
See why facts made it into this prompt. Which are fresh, which are stale, which got boosted, which got demoted. Every memory decision is audit-logged.
Unlimited TrimScript
Standard capped at 10 plugins, Pro at 50. Premium removes the cap. Plus hot-reload, visual blueprint builder, and priority access to the plugin registry.
Ambient Task Tracker
Picks up commitments and deadlines from normal conversation — "I'll finish the report by Friday", "remind me to call Sarah next week" — and surfaces them back when the time comes. No explicit to-do list required.
SLA + Early Access
48-hour priority support response, custom deployment help, on-prem licensing available. Plus first-look access to v2.1 features while they're still in development.
Runs on Your Hardware
UPtrim works with every major GPU backend. Your LLM handles inference, UPtrim handles memory.
NVIDIA CUDA
Full CUDA acceleration via llama.cpp, Ollama, and vLLM
AMD ROCm
ROCm support through compatible backends for AMD GPUs
Apple MLX
Native Apple Silicon acceleration via MLX and Metal