A few days ago I shared a project I’ve been working on called “LatentGate” — a local-first pipeline that reduces LLM API token usage by processing inputs before sending them to the model.
After some great feedback, I’ve now turned it into:
A pip-installable Python package
A VS Code extension (runs as a local proxy)
MCP server support for tools like Claude Code, Cursor, Cline, Continue
PyPI → pip install latent-gate
VS Code → LatentGate — Local-First AI Compression
What it does
- Images (~1000–1300 tokens) → compressed to ~150 tokens using local vision models (Ollama + LLaVA)
- Long prompts / conversations → compressed locally before hitting cloud APIs
- Works with OpenAI / Claude / Gemini APIs
- Fully local preprocessing (no data leaves your machine before compression)
The idea is inspired by VL-JEPA — predicting in embedding space, then decoding selectively.
Why I built this
While experimenting with GPT-4o / vision APIs, I noticed most costs come from raw input size (especially images and long prompts).
So instead of optimizing prompts endlessly, I tried:
→ “What if we reduce what we send in the first place?”
What I’m looking for
I’d love feedback from this community, especially:
- Edge cases where compression breaks context
- Cases where output quality drops noticeably
- Prompt / API compatibility issues (OpenAI especially)
- Performance bottlenecks
- Better approaches to selective decoding or compression
If you try it and something fails — that’s honestly the most valuable thing for me right now.
If you’re exploring similar ideas (local-first processing, token optimization, MCP workflows), I’d love to hear your thoughts