UPtrim | Platform — Memory, Hybrid Routing, SubAgents

v1.0 live · free on github 🚀 v2.0 preview

Local-first AI memory.
Hybrid when you need it.

UPtrim is a reverse proxy that sits on your machine, between your chat app and every LLM you use. It makes local LLMs actually useful by giving them a proper memory — and then teaches them to collaborate with cloud models like Claude and GPT without ever letting your context leave your box.

Without UPtrim

✗Closes chat, memory gone

✗Long chats degrade and loop

✗Everyone shares one memory pool

✗No way to reference your files

✗No visibility into stored context

With UPtrim

✓Facts persist across sessions in SQLite

✓Long chats get trimmed automatically, not truncated

✓Everyone gets their own private memory

✓Upload files and ask questions about them

✓Web dashboard for memory management and stats

5-minute setup · zero lock-in

Any LLM. Any frontend.
Your memory in the middle.

UPtrim speaks OpenAI-compatible on both sides. Plug in Ollama or llama.cpp on one end, Open WebUI or SillyTavern on the other — or skip the whole stack and just use our bundled chat client. Memory, identity, and routing follow you everywhere.

Backend any model

Ollama
llama.cpp
LM Studio
vLLM · TGI
Anthropic (Claude)
OpenAI (GPT-5)
OpenRouter
Groq · Together · Fireworks

if it speaks /v1/chat/completions, it works.

PORT :9099 · LIVE

U UPtrim

MEMORY · ROUTING · IDENTITY

~/uptrim — zsh

$uptrim --web \

--backend localhost:11434

✓ server running :9099

3

clients

247

memories

0:00

uptime

no
config
file

Frontend any chat app

UPtrim Chat
Open WebUI
SillyTavern
LibreChat · LobeChat
Continue.dev
Cline · Aider
BoltAI · Msty
Cursor (custom endpoint)

point its API url at localhost:9099 — and go.

◎

No frontend? No problem.

Bundled chat UI lives at localhost:9099. Cream, touch-friendly, searchable. Use your local Llama or Qwen like your own private ChatGPT — with memory, file upload, and multi-user accounts. Zero other apps required.

↔

One memory across every app.

Start a chat in Open WebUI. Continue it in SillyTavern. Keep coding in Cline. Same identity, same facts, same files — because UPtrim holds the state, not the frontend. Your AI finally feels like yours.

◈

Swap models mid-sentence.

Llama today, Qwen tomorrow, rent GPT-5 for an hour? Aliases let you rename any backend. Your frontend keeps calling gpt-4. UPtrim silently routes to whichever brain is best — local when it can, cloud when it matters.

Pro tier · why people upgrade

Code with your local
and cloud, hand-in-hand.

Local models are fast and free. Cloud models are expensive and sharp. UPtrim's hybrid router routes every request to the right brain, automatically — while both share the same memory.

STEP 1 · LOCAL

You draft.

Your local Llama or Qwen takes the first pass — scaffolds the function, sketches the test, writes the commit message. Free, instant, fully offline.

llama-3.1-8b → $0.00

STEP 2 · HYBRID

It decides.

Hit a hard problem? Router spots the complexity and quietly escalates to Claude or GPT — passing your full project memory so the cloud model picks up mid-thought.

capability_score > 0.85 → cloud

STEP 3 · CLOUD

Claude reviews.

Opus does the heavy lifting — refactors, catches edge cases, explains the tricky parts. Every decision gets written back to your local memory so next time your local model remembers too.

claude-opus-4 → $0.08

Why this matters

Both sides see the same memory. Your preferences, project history, past decisions, code style — injected into whichever model handles the request. Local and cloud stop being two separate tools. They become one AI that knows you.

Free Developer tier · v1.0 · forever

What you get for $0.

No credit card. No sign-up. Drop UPtrim on your machine, point it at your local LLM — and every feature below just works, fully offline.

01 Memory

🧠

Persistent memory

Every turn, UPtrim pulls names, preferences, projects, and relationships out of your chat — stored in local SQLite with FTS5 keyword search. Up to 5,000 facts, all editable from the dashboard.

Smart fact extractionspaCy NLP (TRF / FULL / LITE) with regex fallback.
Intent-aware injectionMemories ranked by relevance and staleness, per message.
Dedup & consolidationMerges duplicates, resolves contradictions automatically.
Basic knowledge graphEntity + relationship extraction, linked nodes in SQLite.

02 Privacy

🛡️

4 isolated user vaults

Four people share one proxy, each with their own memory, files, and conversations. Zero leakage between them. Identity resolved from Open WebUI headers, custom headers, or HMAC tokens.

Secret ShieldAPI keys, passwords, AWS/OAuth tokens redacted pre-storage.
Prompt-injection scanHeuristic filter blocks poisoned inputs from memory writes.
Rate limitingPer-user caps, burst protection, stale-request filtering.
HMAC-SHA256 tokensLabelled API keys, optional PBKDF2 passwords.

03 Any LLM

🔌

OpenAI-compat drop-in

Point any chat app at localhost:9099 and swap backends on the fly. Ollama, llama.cpp, vLLM, LM Studio, Claude, GPT, OpenRouter — if it speaks /v1/chat/completions, UPtrim routes to it.

Multi-backendSeveral backends at once, swap mid-session without reconnecting.
Streaming + SSEFull SSE with think-block filter for reasoning models.
Auto-discoveryDetects available models and context windows on startup.
Read-only cloud OAuthOne cloud provider included on Free — no raw API keys.

04 Dashboards & Tools

📊

Full web dashboard

Every memory, user, file, setting, and log — live at :9099/dashboard. Edit or delete anything. Each user also gets their own personal memory page to browse, pin, and prune their facts.

Bundled chat UIUse your local LLM like a private ChatGPT — zero other apps.
Agent modeReAct tool-use loop: memory search, URL fetch, file read — live.
Terminal TUITextual live-monitor with stats, memory pressure, token gauges.
Native desktop appThemes, slash commands, streaming CLI client.

05 Files & RAG

📄

50+ file formats

PDF, DOCX, XLSX, Markdown, code, JSON, YAML, logs — uploaded, auto-chunked, and injected as context when relevant. Per-user vault with 50 files each, fully isolated and searchable.

Smart chunk injectionRelevance-ranked, budget-aware, dynamic sizing.
Optional embeddingsFAISS + bundled BGE-base for semantic file search.
Local image genAuto-routes image intents to your sd.cpp backend.
Upload securityMIME allowlist, size caps, content-injection heuristics.

06 NLP & Ops

🎯

Multi-mode NLP + GPU

spaCy TRF / FULL / LITE / regex with graceful auto-fallback. GPU auto-detected on CUDA, MPS, or ROCm. CPU-only still works. Custom entity patterns for names, dates, health, and diet — no external models required.

Full audit trailEvery memory op logged with user + timestamp.
Error ring bufferDaily crash logs with full tracebacks, post-mortem ready.
Perf metricsTokens in/out, cache hits, latency, NLP timing.
Debug endpointsIntent DNA inspection, memory provenance graphs.

every line above → $0.00

v1.0 is the forever-free baseline. Paid tiers stack ghost agents, hybrid cloud+local routing, sub-agent swarms, and production features on top — but the foundation below is yours, offline, today.

See It in Action

Click a scenario to see what happens.

Persistent memory

You mentioned weeks ago that you code in Python and prefer dark mode. UPtrim extracted those facts and stored them. Next session, they're injected into context automatically.

Regex and spaCy NLP extract facts from conversations. FTS5 indexes them. Intent classification decides which memories are relevant to inject per message.

Can you help me set up a new project?

UPtrim injects: "User prefers Python, uses VS Code, likes dark themes"

Sure! I'll set up a Python project. Want me to include a pyproject.toml since you usually use that?

Per-user isolation

Sarah asks about her meeting notes. Mike asks about his Python script. Their memories, files, and conversations are completely separate.

Identity resolution pulls user info from chat app headers. Each user gets isolated memory, file storage, and context — configurable trust modes control what happens with unknown users.

Sarah: What was in my meeting notes?

UPtrim loads Sarah's memories + files only

Mike: Help me with my Python script

UPtrim loads Mike's memories + files only

File-backed context

Upload PDFs, text files, or notes. Ask questions and UPtrim pulls relevant sections into the LLM's context window.

Files are stored locally per user. Content is chunked, indexed, and matched against incoming messages. Relevant chunks get injected alongside memory.

Uploaded: company-handbook.pdf

What's the vacation policy?

UPtrim finds relevant section from handbook

According to your company handbook, you get 20 days of PTO per year, plus 10 holidays...

Agent mode

The LLM can search the web, fetch URLs, and query stored memories on its own. No browser extensions or plugins.

Agent mode exposes tool-use endpoints to the LLM. It decides when to call them based on the conversation. Results are injected into the response context.

What's the latest news about the Mars mission?

AI searches the web automatically

I found some recent updates — NASA announced yesterday that the Artemis crew...

Full Visibility

View, edit, or delete any stored memory. Manage users and settings from your browser.

📊

Web Dashboard

Live stats, stored memories, user list, and every setting. All at localhost:9099.

Click to learn more

🧑

My Memory Page

Every user can see what the AI remembers about them, upload files, and fix mistakes.

Click to learn more

🔧

AI Tools

Your AI can search the web, read files, and dig through memories on its own.

Click to learn more

Standard tier — $5/mo

The Full Local Experience

Everything in Free plus the intelligence layer that makes a local LLM feel like it actually knows you. No cloud required — all of this runs on your machine.

Standard

👻

Ghost Agent

Silent background agent that fires on every message. It resolves URLs you paste, pulls from deep memory, and enriches the system prompt before the LLM responds. You never see it working — you just get better answers.

URL Resolver Deep Memory Pull Code Context

Standard

🎭

Persona Engine

Learns each user's tone, vocabulary, and style from their chats. Adapts responses per person without explicit config — one gets terse code, another gets friendly walkthroughs.

Standard

🌿

Memory Consolidation

Sleep-cycle memory: facts strengthen with replay and fade with disuse. Background daemon merges near-duplicates and resolves contradictions by recency — hands-off maintenance.

Standard

🔍

Semantic Recall

Bundled local bge-base embeddings catch paraphrases and synonyms that keyword search misses. "code" finds memories tagged "programming."

Standard

🧭

Intent Classification

Detects whether you're asking, coding, analyzing, or brainstorming — and adjusts what memories inject. Different modes surface different slices of your stored context.

Standard

🔀

Branch & Version

Fork any conversation to explore an alternate path without losing the original. Diff two branches, merge the good parts, scrap the rest. Like git for chats.

Standard

🔄

Memory Lifecycle

Every memory lives in one of three states: active (injected every relevant message), fading (demoted after disuse), and dormant (rarely surfaced, eventually auto-expires). You see the state on every fact. The model treats each differently.

Standard

🧠

Brain Dashboard

Blueprint view of the whole pipeline. Watch memory injection, agent execution, embedding scores, and token spend update in real time. Zero-cost observability — see exactly what your AI is thinking.

Standard

🔐

Remote HTTPS Access

Expose UPtrim to your team over HTTPS via built-in Cloudflare Tunnel support. No port-forwarding. Per-user auth enforced. Your memory stays at home but you can reach it anywhere.

Standard

🛡️

Team & Parental Controls

Rate-limit users, cap daily tokens, restrict web access per user, enforce trust modes. Works great for schools, small teams, or households — everyone shares the proxy, no one abuses it.

Standard

💾

Encrypted Backup

Export your full memory as an AES-256 encrypted archive. Restore to a new machine in seconds. Free tier exports plaintext; Standard adds encryption + selective restore.

Standard

⚡

TrimScript Plugins

Tiny hooks that run on every message — transform prompts, filter memory writes, redirect routes, reshape outputs. Pick from the community pack. Zero compile step, zero new tooling, zero cloud calls.

pre-inject on-memory-write post-response route-override

12 hook
slots

Pro tier — $15/mo · v2.0

The Hybrid Intelligence Stack

Where local LLMs and cloud models start working together. Shared memory across both, sub-agent swarms, live observability — and your hardware still stays in the loop.

Pro

💡

Hybrid Cloud + Local Routing

Route every request to the right brain. Small talk hits your free local LLM. Hard problems get Claude Opus or GPT-4 — automatically, based on the Capability Matrix. OAuth means no raw API keys. Cost tracking means no surprise bills.

Smart Router Anthropic OAuth OpenAI OAuth Cost Meter

Pro

💰

Hybrid Cost Tracker

Real-time spend per request, per model, per user. Set monthly budgets with auto-cutoffs. See exactly which prompts cost the most and why — local calls show as $0.00.

Pro

🏷️

Model Aliases

Teach UPtrim your own names: fast → local 3B, smart → Opus, cheap → GPT-4o-mini. Swap implementations anytime — your frontend never knows.

Pro

🤖

Sub-Agent Swarm

Spawn parallel agents for multi-step work — code review, research, data analysis. Each runs with a scoped memory slice, persistent task ID, shared scratchpad, and full audit trail.

Pro

👻

Ghost Pre-Search

The Ghost doesn't wait for you to hit send. While you're still typing, it predicts what you're about to ask based on your memory, conversation history, and the partial prompt — then quietly runs web searches, URL fetches, and memory recalls in parallel. By the time you hit enter, the answer is already in the system prompt.

Predictive web search Follow-up anticipation Parallel URL fetch Live price & docs

Example: You start typing "what's the pricing on Anthropic's". Before you finish the sentence, Ghost has already fetched the current Anthropic pricing page, your past notes on Anthropic costs, and an estimate of Opus spend on your last 10 prompts. The answer is waiting when you hit send.

Pro

🧩

Workflow Planner

Breaks complex asks into stages and assigns each to the best model. Cloud for analysis, local for iteration, back to cloud for polish — both sides sharing your memory.

Pro

✨

Prompt Enhancement

Before a cloud call, a local model expands lazy prompts into detailed specs. Same question, dramatically better answer — without you rewriting it.

Pro

📚

History Search

Every past chat, indexed by date, topic, and content. Time-grouped sessions with auto topic inference. That thing you said three weeks ago — still findable.

Pro

🎛️

Context Selector

Fine-tune exactly which memory categories inject per message, per user. Turn off work memories for personal chats, or crank up code context when you're deep in a debug session.

Pro

📫

Ghost Inbox

One feed for every ghost-agent result, dream memory, search hit, and async task UPtrim ran for you. Unread badges, dismiss/act actions, and a per-source filter. Like email for your AI's background work.

Pro

🧪

Capability Matrix v2

Live per-model scoring on cost, latency, quality, and context window. The router reads this to pick a backend — and you can see exactly why every call went where it did, with the score for each candidate.

Pro

🗜️

Context Compression

Long chats roll into running summary notes instead of getting truncated. Message #287 still remembers what happened on message #14 — it's there in the rolling memo, not lost to the context window.

Pro

🔧

TrimScript — 40 slots, write your own

Pro unlocks the full scripting surface. Forty hook slots, all four lifecycle events, plus a sandboxed authoring playground. Hot-reload your scripts, watch them fire in the Brain Dashboard, push to the community. Same TrimScript that ships built-in — now in your hands.

pre-inject on-memory-write route-override post-response on-agent-call custom authoring

40 hook
slots

Premium tier — $30/mo

The Production Stack

Everything in Pro plus the features built for teams and production workflows — visual knowledge graph explorer, n8n workflow integration, and multi-agent collaboration.

Premium

🕸️

Visual Knowledge Graph

Memories as interactive connected nodes. Zoom the whole network, click an entity to see every edge, trace how a conversation 3 weeks ago led to a decision yesterday. Pro tier gets the graph; Premium gets the explorer.

Premium

🔗

n8n + MCP

Expose UPtrim memory as MCP tools. Your n8n workflows can read, write, and query per-user memory — AI agents that remember across automations.

Premium

🕸️

Ghost Mesh

Multi-agent collaboration: analyst + predictor + planner running in parallel, sharing the scratchpad, arguing before they commit. Sub-agent swarm on steroids.

Premium

🧪

Claude Code Offload

Hand hard multi-step tasks to a full Claude Code subprocess — preserves your session, context cache, and gives you code edits, shell access, and tool use without duplicating the harness.

Premium

🕰️

Staleness Transparency

See why facts made it into this prompt. Which are fresh, which are stale, which got boosted, which got demoted. Every memory decision is audit-logged.

Premium

🔌

Unlimited TrimScript

Standard capped at 10 plugins, Pro at 50. Premium removes the cap. Plus hot-reload, visual blueprint builder, and priority access to the plugin registry.

Premium

📓

Ambient Task Tracker

Picks up commitments and deadlines from normal conversation — "I'll finish the report by Friday", "remind me to call Sarah next week" — and surfaces them back when the time comes. No explicit to-do list required.

Premium

⚡

SLA + Early Access

48-hour priority support response, custom deployment help, on-prem licensing available. Plus first-look access to v2.1 features while they're still in development.

Premium

🎼

LLM Conductor

Decomposes every turn into typed subtasks — research this, summarize that, write the function, review it — routes each to the best brain (cloud or local) under cost / latency / quality constraints, runs them in parallel where possible, then assembles one coherent response. Every subtask logged to a per-turn ledger so you can see exactly what the proxy spent and where.

DAG execution Cost / quality / latency modes Per-turn ledger Parallel groups

Premium

🏛️

Internal Parliament

Auto-clusters your memory into 5–8 topical personas. They debate in the background on idle GPU. When you ask a decision-shaped question, their transcript becomes context — you get a multi-perspective answer instead of one model's monologue.

Premium

🧑‍💻

Codex Subprocess Offload

Run an OpenAI Codex subprocess from inside UPtrim — full memory passed through, results streamed back. Pairs with Claude Code offload so you can pick your coding agent and keep the same context regardless.

Runs on Your Hardware

UPtrim works with every major GPU backend. Your LLM handles inference, UPtrim handles memory.

🟢

NVIDIA CUDA

Full CUDA acceleration via llama.cpp, Ollama, and vLLM

🔴

AMD ROCm

ROCm support through compatible backends for AMD GPUs

🍎

Apple MLX

Native Apple Silicon acceleration via MLX and Metal

Local-first AI memory.Hybrid when you need it.

Any LLM. Any frontend.Your memory in the middle.

No frontend? No problem.

One memory across every app.

Swap models mid-sentence.

Code with your localand cloud, hand-in-hand.

You draft.

It decides.

Claude reviews.

What you get for $0.

Persistent memory

4 isolated user vaults

OpenAI-compat drop-in

Full web dashboard

50+ file formats

Multi-mode NLP + GPU

See It in Action

Persistent memory

Per-user isolation

File-backed context

Agent mode

Full Visibility

Web Dashboard

My Memory Page

AI Tools

The Full Local Experience

Ghost Agent

Persona Engine

Memory Consolidation

Semantic Recall

Intent Classification

Branch & Version

Memory Lifecycle

Brain Dashboard

Remote HTTPS Access

Team & Parental Controls

Encrypted Backup

TrimScript Plugins

The Hybrid Intelligence Stack

Hybrid Cloud + Local Routing

Hybrid Cost Tracker

Model Aliases

Sub-Agent Swarm

Ghost Pre-Search

Workflow Planner

Prompt Enhancement

History Search

Context Selector

Ghost Inbox

Capability Matrix v2

Context Compression

TrimScript — 40 slots, write your own

The Production Stack

Visual Knowledge Graph

n8n + MCP

Ghost Mesh

Claude Code Offload

Staleness Transparency

Unlimited TrimScript

Ambient Task Tracker

SLA + Early Access

LLM Conductor

Internal Parliament

Codex Subprocess Offload

Runs on Your Hardware

NVIDIA CUDA

AMD ROCm

Apple MLX

Start Free. Scale When You're Ready.

Local-first AI memory.
Hybrid when you need it.

Any LLM. Any frontend.
Your memory in the middle.

Code with your local
and cloud, hand-in-hand.