Can local preprocessing cut LLM API costs?
A few days ago I shared a project I’ve been working on called “LatentGate” — a local-first pipeline that reduces LLM API token usage by processing inputs before sending them to the model. After some great feedback, I’ve now turned it into: A pip-installable Python package A VS Code extension (runs as a local proxy) MCP server support for tools like Claude Code, Cursor, Cline, Continue PyPI → pip install latent-gate VS Code → LatentGate — Local-First AI Compression What it does Images (~1000–1300 tokens) → compressed to ~150 tokens using local vision models (Ollama + LLaVA) Long prompts / conversations → compressed locally before hitting cloud APIs Works with OpenAI / Claude / Gemini APIs Fully local preprocessing (no data leaves your machine before compression) The idea is inspired by VL-JEPA — predicting in embedding space, then decoding selectively. Why I built this While experimenting with GPT-4o / vision APIs, I noticed most costs come from raw input size (especially images and long prompts). So instead of optimizing prompts endlessly, I tried: → “What if we reduce what we send in the first place?” What I’m looking for I’d love feedback from this community, especially: Edge cases where compression breaks context Cases where output quality drops noticeably Prompt / API compatibility issues (OpenAI especially) Performance bottlenecks Better approaches to selective decoding or compression If you try it and something fails — that’s honestly the most valuable thing for me rig…