A few days ago I shared a project I’ve been working on called “LatentGate” — a local-first pipeline that reduces LLM API token usage by processing inputs before sending them to the model.

After some great feedback, I’ve now turned it into:

A pip-installable Python package
A VS Code extension (runs as a local proxy)
MCP server support for tools like Claude Code, Cursor, Cline, Continue

PyPI → pip install latent-gate
VS Code → LatentGate — Local-First AI Compression


What it does

  • Images (~1000–1300 tokens) → compressed to ~150 tokens using local vision models (Ollama + LLaVA)
  • Long prompts / conversations → compressed locally before hitting cloud APIs
  • Works with OpenAI / Claude / Gemini APIs
  • Fully local preprocessing (no data leaves your machine before compression)

The idea is inspired by VL-JEPA — predicting in embedding space, then decoding selectively.


Why I built this

While experimenting with GPT-4o / vision APIs, I noticed most costs come from raw input size (especially images and long prompts).

So instead of optimizing prompts endlessly, I tried:
→ “What if we reduce what we send in the first place?”


What I’m looking for

I’d love feedback from this community, especially:

  • Edge cases where compression breaks context
  • Cases where output quality drops noticeably
  • Prompt / API compatibility issues (OpenAI especially)
  • Performance bottlenecks
  • Better approaches to selective decoding or compression

If you try it and something fails — that’s honestly the most valuable thing for me right now.


If you’re exploring similar ideas (local-first processing, token optimization, MCP workflows), I’d love to hear your thoughts