AI/ML News & Innovations Hub

OpenAI Community 2026-06-29 13:51 UTC Score 63.0 AI-116-20260629-social-media-d0056176

Can local preprocessing cut LLM API costs?

A few days ago I shared a project I’ve been working on called “LatentGate” — a local-first pipeline that reduces LLM API token usage by processing inputs before sending them to the model. After some great feedback, I’ve now turned it into: A pip-installable Python package A VS Code extension (runs as a local proxy) MCP server support for tools like Claude Code, Cursor, Cline, Continue PyPI → pip install latent-gate VS Code → LatentGate — Local-First AI Compression What it does Images (~1000–1300 tokens) → compressed to ~150 tokens using local vision models (Ollama + LLaVA) Long prompts / conversations → compressed locally before hitting cloud APIs Works with OpenAI / Claude / Gemini APIs Fully local preprocessing (no data leaves your machine before compression) The idea is inspired by VL-JEPA — predicting in embedding space, then decoding selectively. Why I built this While experimenting with GPT-4o / vision APIs, I noticed most costs come from raw input size (especially images and long prompts). So instead of optimizing prompts endlessly, I tried: → “What if we reduce what we send in the first place?” What I’m looking for I’d love feedback from this community, especially: Edge cases where compression breaks context Cases where output quality drops noticeably Prompt / API compatibility issues (OpenAI especially) Performance bottlenecks Better approaches to selective decoding or compression If you try it and something fails — that’s honestly the most valuable thing for me rig…

Read article →

MarkTechPost 2026-06-28 04:58 UTC Score 78.0 AI-032-20260628-ai-specialis-4f84a0b2

Liquid AI Ships LFM2.5-230M with llama.cpp, MLX, vLLM, SGLang, and ONNX Support for On-Device Inference

Liquid AI released LFM2.5-230M, its smallest model yet. The 230M-parameter, open-weight model runs on-device at 213 tok/s on a Galaxy S25 Ultra and 42 on a Raspberry Pi 5. Built on the LFM2 architecture, it targets tool use and data extraction, beating larger models like Qwen3.5-0.8B and Gemma 3 1B on instruction following. The post Liquid AI Ships LFM2.5-230M with llama.cpp, MLX, vLLM, SGLang, and ONNX Support for On-Device Inference appeared first on MarkTechPost .

Read article →

Towards Data Science 2026-06-26 16:30 UTC Score 61.0 AI-036-20260626-ai-specialis-044daf0b

From Local LLM to Tool-Using Agent

Using Gemma 4, Ollama, OpenAI Agents SDK, and Tavily MCP to build a lightweight research agent The post From Local LLM to Tool-Using Agent appeared first on Towards Data Science .

Read article →

CIO AI 2026-06-25 14:40 UTC Score 28.0 USR-0125-20260625-global-ai-ne-c4257a67

La métrica que hizo tropezar la estrategia ‘AI-first’ de Duolingo

En abril de 2026, el CEO de Duolingo, Luis von Ahn, reconoció que la compañía había retirado uno de los elementos más delicados de su estrategia de inteligencia artificial: el uso de IA dejaba de contar en las evaluaciones de desempeño de sus empleados . Lo llamativo es que, un año antes, una crisis pública en toda regla no había conseguido cambiar su estrategia ni un milímetro. El primer debate se abrió en la primavera de 2025, cuando Duolingo se declaró ‘AI-first’ . Ahí saltó la discusión habitual de la IA frente a las personas. Prendió rápido: usuarios borrándose la app y las redes de la marca inundadas de críticas. Von Ahn resolvió con oficio la crisis reputacional: aclaraciones, matices y un tono más suave. Le funcionó. El fuego se apagó, la estrategia siguió intacta y la empresa continuó creciendo. Pero se había abierto un segundo debate, menos visible pero igualmente importante: el de la evaluación de los empleados. Ese no se aplacaba con una nota de prensa. Fuera apenas trascendió: lo que una empresa haga con sus evaluaciones internas no provoca bajas masivas ni incendia TikTok. Dentro fue otra cosa. No hubo clamor, pero sí una objeción de fondo. Y esta vez el CEO cedió. La comunicación fue casi inversa a la del año anterior: no hubo gran rectificación pública ni operación de imagen. Von Ahn lo mencionó casi de pasada en un podcast: esa métrica se había retirado. Una crisis pública no movió la estrategia . Una objeción interna, sí. Lo interesante no es tanto la difer…

Read article →

InfoWorld AI 2026-06-24 09:00 UTC Score 42.0 USR-0126-20260624-global-ai-ne-35d2d2c5

Using Visual Studio Code’s ‘air-gapped’ AI model mode

Microsoft has been pushing hard to make Visual Studio Code a major way to consume its AI services, mostly in the form of GitHub Copilot . GitHub Copilot’s deep integration with VS Code brings many conveniences — inline autocomplete, for instance — but it’s frustrating for those, like me, who would rather use another model provider, or even a locally hosted LLM, for those functions. Visual Studio Code 1.122 introduced a new feature, “ Use BYOK [Bring Your Own Key] without a GitHub sign-in ,” that allows you to “use chat, tools, and MCP servers in air-gapped or restricted environments where GitHub sign-in isn’t possible.” More importantly, it “enables fully offline workflows with local models like Ollama.” In other words, you can now use locally hosted LLMs for chat, tools, and Model Context Protocol servers inside Visual Studio Code. The one thing you still can’t do is use a local LLM for inline and next-edit suggestions — at least, not without additional tooling. Choosing a model for BYOK mode If you want to use a local LLM with VS Code’s bring-your-own-model system, the first thing you need is a way to host the model. VS Code lacks a model-hosting mechanism of its own, although it’s conceivable that a VS Code extension may offer something like that in the future. That said, hosting models is complicated enough that a dedicated app is really needed for the job. One easy way to host models is via a product like LM Studio , a convenient GUI for standing up, serving, and managi…

Read article →

NVIDIA Developer YouTube 2026-06-15 21:55 UTC Score 59.0 AI-144-20260615-podcasts-and-176b0d7c

Local GenAI on Jetson: OSS models using different inferencing frameworks: Ollama, llama.cpp, & vLLM

This opening session builds the foundation for running popular OSS models such as Gemma, Qwen directly on Jetson — no cloud required. We cover when to use Ollama for rapid local prototyping versus vLLM for higher-throughput serving, show how the same workflow applies to both power different OSS models, and walk through the real decisions behind model choice, containers, quantization, and performance tuning on edge hardware. We close with a teaser of OpenClaw and a bonus take-home challenge to kick off community building. You will learn how to deploy open-source AI models on NVIDIA Jetson — no cloud required, from first launch to production-ready serving. We'll cover: Getting models running on NVIDIA Jetson — spin up popular OSS models (open-source large language models (LLMs) like Gemma and Qwen (LLMs and VLMs) using Ollama or vLLM on Jetson hardware and verify they're working end-to-end. Choosing the right inference engine — understand the practical tradeoffs between Ollama for rapid local prototyping, vLLM for higher-throughput serving, and llama.cpp, so you can pick the right tool for your use case. NVIDIA Jetson-specific serving strategies — walk through the real decisions behind model choice, containers, and performance tuning tailored for Orin and Thor, including what works, what doesn't, and why. Performance fundamentals — get introduced to quantization and speculative decoding: what they are, how they work, and when to reach for them on edge hardware. Real-world appl…

Read article →

Data Science Stack Exchange 2026-06-12 10:02 UTC Score 24.0 AI-111-20260612-social-media-024a8446

Matching first names, full names and pronouns

I am working on a graph store of entities and relationships extracted from a factual test document of around 500 words. The first pass (NER) extracts named entities, the second extracts relationships (RE). For a given person, there are different references in the text: Maria, Maria Gotthard, Dr. Maria Gotthard and can also be referred to by 'she', for example 'she was rewarded by the company'. The goal is to merge all these references into one entity so that the relationship graph is not fragmented into different contexts. I have seen a few posts on different forums saying this is a very difficult problem, but hopefully someone out there has some insights or experience to share 🙂 To make things interesting, references to the same entity can occur in different chunks of text, making it impossible for the LLM (currently Ollama/Mistral) to process the cross-chunk context in one call. To address this, I have added a pass across all extracted entities, including exact text matching and a Levenshtein similarity check, but this does not handle first name v full name and comes with a host of other issues. It has a high risk of over-merging, for example if a set of entities consist of incrementally numbered items they will all be merged into one entity. I am wondering if there is a particular architecture for this problem, for example pre-processing a document to link related entities before extracting. Doesn't have to be LLM-based, heuristics and algorithms sometimes do the trick as…

Read article →

Practical AI Podcast 2026-05-07 09:00 UTC Score 34.0 AI-143-20260507-podcasts-and-db3298dd

The Myth of Model Wars: Open vs Closed AI in 2026

In this fully connected episode, Dan and Chris break down one of the biggest questions in AI today: do open vs. closed models still matter? From the rise of physical AI and edge devices to the shifting landscape of open-source models like LLaMA, they explore whether the “model wars” are becoming irrelevant. The conversation then dives into a bigger transformation, the rise of agentic systems, workflows, and AI-driven infrastructure. Featuring: Chris Benson – Website , LinkedIn , Bluesky , GitHub , X Daniel Whitenack – Website , GitHub , X Upcoming Events: Register for upcoming webinars here ! Midwest AI Summit 2026

Read article →

Qdrant Blog 2024-04-10 00:04 UTC Score 46.0 USR-0074-20240410-ai-specialis-09812eb6

New RAG Horizons with Qdrant Hybrid Cloud and LlamaIndex

We’re happy to announce the collaboration between LlamaIndex and Qdrant’s new Hybrid Cloud launch , aimed at empowering engineers and scientists worldwide to swiftly and securely develop and scale their GenAI applications. By leveraging LlamaIndex’s robust framework, users can maximize the potential of vector search and create stable and effective AI products. Qdrant Hybrid Cloud offers the same Qdrant functionality on a Kubernetes-based architecture, which further expands the ability of LlamaIndex to support any user on any environment.

Read article →

Anyscale Blog 2023-09-06 00:00 UTC Score 31.0 USR-0085-20230906-ai-specialis-75ca9dbd

codellama A large language model that can use text prompts to generate and discuss code. 7b 13b 34b 70b 5.7M Pulls 199 Tags Updated 1 year ago

Read article →

Ollama Library — Score 33.0 AI-099-nodate-model-datase-feb8a4a4

llama2 Llama 2 is a collection of foundation language models ranging from 7B to 70B parameters. 7b 13b 70b 7.2M Pulls 102 Tags Updated 2 years ago

Read article →

Ollama Library — Score 33.0 AI-099-nodate-model-datase-4d541c0e

llama3 Meta Llama 3: The most capable openly available LLM to date 8b 70b 24.5M Pulls 68 Tags Updated 2 years ago

Read article →

Ollama Library — Score 38.0 AI-099-nodate-model-datase-acf2da58

llama3.2 Meta's Llama 3.2 goes small with 1B and 3B models. tools 1b 3b 74.5M Pulls 63 Tags Updated 1 year ago

Read article →

Ollama Library — Score 43.0 AI-099-nodate-model-datase-27db67ec

llama3.1 Llama 3.1 is a new state-of-the-art model from Meta available in 8B, 70B and 405B parameter sizes. tools 8b 70b 405b 116.6M Pulls 93 Tags Updated 1 year ago

Read article →

LlamaIndex GitHub — Score 10.0 AI-118-nodate-social-media-070cda80

AI/ML News & Innovations Hub

Llama

Can local preprocessing cut LLM API costs?

Liquid AI Ships LFM2.5-230M with llama.cpp, MLX, vLLM, SGLang, and ONNX Support for On-Device Inference

From Local LLM to Tool-Using Agent

La métrica que hizo tropezar la estrategia ‘AI-first’ de Duolingo

Using Visual Studio Code’s ‘air-gapped’ AI model mode

Local GenAI on Jetson: OSS models using different inferencing frameworks: Ollama, llama.cpp, & vLLM

Matching first names, full names and pronouns

The Myth of Model Wars: Open vs Closed AI in 2026

New RAG Horizons with Qdrant Hybrid Cloud and LlamaIndex

Fine-Tuning LLMs: LoRA or Full-Parameter? An in-depth Analysis with Llama 2

Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring Models to Unique Applications

Meta Llama Llama 4 Scout, Llama 3.3 70B, Llama 3.1 8B

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

codellama A large language model that can use text prompts to generate and discuss code. 7b 13b 34b 70b 5.7M Pulls 199 Tags Updated 1 year ago

llama2 Llama 2 is a collection of foundation language models ranging from 7B to 70B parameters. 7b 13b 70b 7.2M Pulls 102 Tags Updated 2 years ago

llama3 Meta Llama 3: The most capable openly available LLM to date 8b 70b 24.5M Pulls 68 Tags Updated 2 years ago

llama3.2 Meta's Llama 3.2 goes small with 1B and 3B models. tools 1b 3b 74.5M Pulls 63 Tags Updated 1 year ago

llama3.1 Llama 3.1 is a new state-of-the-art model from Meta available in 8B, 70B and 405B parameter sizes. tools 8b 70b 405b 116.6M Pulls 93 Tags Updated 1 year ago

llama-index-utils

llama-index-integrations

llama-index-instrumentation

llama-index-core

Multi-agent patterns in LlamaIndex

Discover LlamaIndex Video Series

LlamaIndex Framework