AI/ML News & Innovations Hub

A few days ago I shared a project I’ve been working on called “LatentGate” — a local-first pipeline that reduces LLM API token usage by processing inputs before sending them to the model.

After some great feedback, I’ve now turned it into:

A pip-installable Python package
A VS Code extension (runs as a local proxy)
MCP server support for tools like Claude Code, Cursor, Cline, Continue

PyPI → pip install latent-gate
VS Code → LatentGate — Local-First AI Compression

What it does

Images (~1000–1300 tokens) → compressed to ~150 tokens using local vision models (Ollama + LLaVA)
Long prompts / conversations → compressed locally before hitting cloud APIs
Works with OpenAI / Claude / Gemini APIs
Fully local preprocessing (no data leaves your machine before compression)

The idea is inspired by VL-JEPA — predicting in embedding space, then decoding selectively.

Why I built this

While experimenting with GPT-4o / vision APIs, I noticed most costs come from raw input size (especially images and long prompts).

So instead of optimizing prompts endlessly, I tried:
→ “What if we reduce what we send in the first place?”

What I’m looking for

I’d love feedback from this community, especially:

Edge cases where compression breaks context
Cases where output quality drops noticeably
Prompt / API compatibility issues (OpenAI especially)
Performance bottlenecks
Better approaches to selective decoding or compression

If you try it and something fails — that’s honestly the most valuable thing for me right now.

If you’re exploring similar ideas (local-first processing, token optimization, MCP workflows), I’d love to hear your thoughts

Can local preprocessing cut LLM API costs?

What it does

Why I built this

What I’m looking for