AI/ML News & Innovations Hub

AI/ML news, top picks, and generated innovation digests.

★ Visit ai-karthik.com
422Sources
5100News Items
8Top Picks
43Blogs
runningLast Run

Multimodal AI

66 articles tagged with this keyword, sorted by most recent first.

← All Keywords
AWS Machine Learning Blog 2026-06-29 17:52 UTC Score 66.0 AI-057-20260629-official-ai--a55a80cd

Pair Nova 2 Lite with Claude for cost-optimized document processing

In this post, we show how pairing Amazon Nova 2 Lite with Anthropic’s Claude Sonnet 4.6 delivers an efficient solution for digitizing scanned documents at scale. We built a two-model pipeline on Amazon Bedrock for digitizing scanned yearbook pages. Amazon Nova 2 Lite handles native multimodal extraction in a single call: detecting photos, extracting visible names with coordinates, and returning page-level metadata. Claude Sonnet 4.6 then performs spatial reasoning to match names to faces based on page layout.

Synced 2026-06-27 23:13 UTC Score 46.0 AI-041-20260627-ai-specialis-fe605136

Comment on Unveiling Sora: OpenAI’s Breakthrough in Text-to-Video Generation by Seedream 5.0 pro

The part that stands out in current text-to-video workflows is how much time still goes into preparing visual references before a clip is generated. Seedream 5.0 pro looks relevant for teams comparing prompt-to-image and previsualization steps because it gives creators a faster way to draft images before moving into video or campaign production. I would usually test it on the product site first, then compare the output with the assets needed for a short launch page or storyboard: https://seedream50pro.com/

KDnuggets 2026-06-24 10:00 UTC Score 48.0 AI-033-20260624-ai-specialis-15fbad34

Top 7 Coding Models You Can Run Locally in 2026

Explore the best local coding models for private AI coding, fast GGUF inference, agentic workflows, multimodal development, and running powerful open models on your own GPU.

NVIDIA Developer YouTube 2026-06-24 07:02 UTC Score 77.0 AI-144-20260624-podcasts-and-1a7a6306

Nemotron Office Hours: The Nemotron 3 Model Family | Nemotron Labs

NVIDIA has released the full Nemotron 3 open model family — Ultra, Super, Nano, and Nano Omni. This office hours session covers each model in the series, and any questions you have about Nemotron 3 in general — what it's built for, when to use it, and what's available in open weights, training datasets, and fine-tuning recipes. What we'll cover: - Nemotron 3 Ultra — 550B MoE frontier reasoning model for long-running autonomous agents: 5x faster inference, up to 30% lower cost, hybrid Mamba-Transformer architecture, and MOPD training for consistent performance across agent harnesses - Nemotron 3 Super — mid-range 120B model targeting enterprise applications that need strong reasoning for multi-agent applications - Nemotron 3 Nano — 30B MoE with 3B active parameters, built for high-volume execution, highly accurate sub-agent accomplishing targeted tasks - Nemotron 3 Nano Omni — multimodal (text, image, audio, video) model purpose-built for targeted specialized agentic tasks - Open weights, training datasets, and fine-tuning recipes — what's available across the family and how to customize for your domain Building with or evaluating the Nemotron 3 family? Bring your questions — whether you're choosing between models, fine-tuning for your domain, or deploying at scale, the team will answer them live.

Artificial Intelligence News 2026-06-23 16:32 UTC Score 47.0 AI-029-20260623-ai-specialis-426cbb80

Omio scales travel product development using OpenAI models

Omio integrates OpenAI models across its engineering operations to accelerate travel product development and launch booking interfaces. The multimodal travel platform coordinates operations with over 3,000 transportation providers across 47 countries. Omio explicitly rejects the superficial addition of technology to outdated internal processes. The company’s CTO, Tomas Vocetka, requires all internal functions to completely redesign […] The post Omio scales travel product development using OpenAI models appeared first on AI News .

AWS Machine Learning Blog 2026-06-22 16:32 UTC Score 56.0 AI-057-20260622-official-ai--ffd939d5

Embed the world: Multimodal AI for searchable aerial imagery at scale

In this post, we walk through the problem space, our architecture on Amazon Bedrock and Amazon OpenSearch Serverless, the evaluation methodology we built on OpenStreetMap ground truth, four experiments that compared embedding models, fusion strategies, captioning, and search methods, and the practical guidance you can apply when building a similar system. You’ll learn which design choices move the needle for geospatial semantic search, including why Amazon Nova Multimodal Embeddings delivered the highest F1 scores across both benchmark queries in our evaluation. The work described here evolved into Vexcel Intelligence, a searchable imagery product.

Stack Overflow Machine Learning Tag 2026-06-19 13:42 UTC Score 23.0 AI-112-20260619-social-media-9edf6f48

How to update dynamic user embeddings with negative ratings in 768-d space without causing vector drift?

I am building a production-grade recommendation system for a short-video platform (processing around 50k videos). The architecture utilizes a vector database ( Qdrant ) to store and query 768-dimensional video embeddings generated by a VideoCLIP model. To track user preferences in real-time, I implement an online learning mechanism that updates a single user_vector iteratively after each interaction based on a computed rating (bounded between [-1.0, 1.0] via tanh ). The Goal & The Problem I want my system to actively update the user vector on both positive and negative signals . Initially, I tried a standard linear combination: updated_vector = (alpha * u) + (beta * v * rating) Where u is the current user vector, v is the video vector, alpha is the decay, and beta is the learning rate. However, when a user gives a negative rating (e.g., -0.8 ), multiplying the VideoCLIP embedding by a negative scalar flips its direction entirely. In a 768-d multimodal space, adding this inverted vector creates massive noise across unrelated dimensions, causing aggressive vector drift instead of just moving away from that specific topic. On the other hand, simply clamping negative ratings to 0 fixes the geometry but creates a severe feedback loop/frozen vector issue where the profile stops evolving during consecutive negative interactions. What I Want to Achieve I need a mathematically sound way to update the vector during negative interactions so that the user profile actively flees from the…

DeepLearning.AI YouTube 2026-06-17 15:00 UTC Score 51.0 AI-138-20260617-podcasts-and-73e9c00c

Voice for AI Agents and Applications

Learn more: https://bit.ly/4vPQ3HE Voice is one of the most natural human interfaces, but adding it to AI applications has historically forced a tradeoff: fast voice-to-voice models that sacrifice reliability, or accurate speech-to-text-to-LLM-to-speech pipelines that add latency. This course teaches you how to get both, using Vocal Bridge's architecture that pairs a real-time foreground agent with a reasoning background agent. Taught by Ashwyn Sharma, CEO and Co-Founder of Vocal Bridge (an AI Fund portfolio company), this course covers three practical integration patterns that meet you where you are: voice embedded in an application, voice layered onto an existing agent without touching its logic, and voice as a tool your LLM can call when it decides a conversation is the right modality. In detail, you'll survey the traditional voice stack and its tradeoffs, then explore three live integration patterns to understand when each one applies. Build a voice-interactive tic-tac-toe game where voice commands and mouse clicks work together over a single synchronized channel, then add a voice layer to an existing agent with minimal code, leaving your prompts, RAG pipeline, and tools untouched. Give your agent a make_phone_call tool so it can dial a real number, hold a conversation with a demo agent, and stream the transcript back live. Set up evaluation-driven development using Vocal Bridge's multimodal evaluator to score calls, catch regressions, and refine prompts before issues re…

Roboflow Blog 2026-06-16 18:38 UTC Score 39.0 USR-0088-20260616-ai-specialis-d1799a2f

Automated Tire Sidewall OCR

Automate tire sidewall OCR to extract DOT codes, sizes, and brands. Learn to combine RF-DETR and multimodal LLMs into a Roboflow Vision Agent.

Analytics Vidhya 2026-06-12 07:30 UTC Score 35.0 AI-034-20260612-ai-specialis-c15b6022

Gemini Omni: AI Video Generation Inside Gemini

Gemini models have always kept up with AI advancements. From text-based chatbots in 2023, Gemini has evolved into a multimodal system capable of understanding and generating text, audio, images… and now videos. AI video generation is no longer a standalone tool. With Gemini Omni, video creation becomes mainstream. Gemini Omni isn’t important because it generates […] The post Gemini Omni: AI Video Generation Inside Gemini appeared first on Analytics Vidhya .

Stack Overflow Machine Learning Tag 2026-06-10 06:41 UTC Score 26.0 AI-112-20260610-social-media-cffb11ce

Will a 80 GB GPU and a 48 GB GPU give identical results on an open source text-to-video model for the same quantization and seed?

I am considering to buy GPUs for my project of open source text-to-video models like ltx-2-19b (lightricks) or wan-v2.2-a14b. I read online that the same configuration/quantization and seed will give similar results in quality, only difference is in speed/latency of generation. Is this true? Or will there be a difference ?

Amazon Science AI 2026-06-05 15:58 UTC Score 62.0 AI-058-20260605-official-ai--c8931f7d

Replication as learning: Scalable knowledge distillation for multimodal enterprise agents

Enterprise environments differ fundamentally from the clean settings assumed in LLM research: knowledge is distributed across heterogeneous sources, often incomplete or inconsistent, and key procedural logic is implicitly encoded in artifacts rather than explicitly documented. In such settings, retrieval-based approaches are insufficient, as no single source contains the full workflow. We propose a replication-driven knowledge distillation framework for scalable learning in multimodal agents. The agent learns by reverse-engineering validated artifacts (e.g., Excel workbooks), reconstructing the underlying data pipeline, and distilling the inferred logic into structured knowledge (claims, procedures, and domain patterns). This enables synthesis and validation across noisy sources and supports reuse in future tasks. We evaluate on 120 simulated enterprise environments with multimodal inputs (SQL, spreadsheets, documentation, messaging app, emails, images, PDFs, CSV) and controlled noise. Our method consistently outperforms retrieval-based baselines on both task execution and conceptual understanding, and remains robust under environmental drift.

Gradient Flow 2026-06-03 12:59 UTC Score 42.0 USR-0119-20260603-ai-specialis-1499c60f

Your Enterprise Data Deserves Better Than a Chatbot

Large language models and their multimodal variants remain the foundation models most people encounter first. That makes sense. Text, images, audio, and video cover a huge range of knowledge-work tasks, and today’s chatbots are far more capable than the text-only systems many people first tried. But enterprise AI does not run on chat alone. It Continue reading "Your Enterprise Data Deserves Better Than a Chatbot" The post Your Enterprise Data Deserves Better Than a Chatbot appeared first on Gradient Flow .

Gradient Flow 2026-06-02 13:00 UTC Score 42.0 USR-0119-20260602-ai-specialis-4c47e97e

The smartest AI teams are moving past chatbots

Subscribe • Previous Issues Your Enterprise Data Deserves Better Than a Chatbot Large language models and their multimodal variants remain the foundation models most people encounter first. That makes sense. Text, images, audio, and video cover a huge range of knowledge-work tasks, and today’s chatbots are far more capable than the text-only systems many people first tried. Continue reading "The smartest AI teams are moving past chatbots" The post The smartest AI teams are moving past chatbots appeared first on Gradient Flow .

DeepLearning.AI YouTube 2026-05-22 19:12 UTC Score 20.0 AI-138-20260522-podcasts-and-a65d0753

Semantic Search Starts With Embeddings

“Budget” and “financials” are different words, but embeddings understand they’re related. That’s the foundation behind semantic search and one of the core building blocks of modern multimodal systems. Learn how embeddings power retrieval across text, audio, images, and video in Building Multimodal Data Pipelines: https://hubs.la/Q04hJ9w10

Two Minute Papers 2026-05-13 16:07 UTC Score 47.0 AI-139-20260513-podcasts-and-156232e5

NVIDIA New AI Is An Efficiency Monster

❤️ Check out Lambda here and sign up for their GPU Cloud: https://lambda.ai/papers 📝 The paper is available here: https://arxiv.org/abs/2604.24954 https://developer.nvidia.com/blog/nvidia-nemotron-3-nano-omni-powers-multimodal-agent-reasoning-in-a-single-efficient-open-model/ https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence Our Patreon if you wish to support us: https://www.patreon.com/TwoMinutePapers 🙏 We would like to thank our generous Patreon supporters who make Two Minute Papers possible: Adam Bridges, Benji Rabhan, B Shang, Cameron Navor, Charles Ian Norman Venn, Christian Ahlin, Eric T, Fred R, Gordon Child, Juan Benet, Michael Tedder, Owen Skarpness, Richard Sundvall, Ryan Stankye, Shawn Becker, Steef, Taras Bobrovytsky, Tazaur Sagenclaw, Tybie Fitzhugh, Ueli Gallizzi My research: https://cg.tuwien.ac.at/~zsolnai/ Thumbnail design: https://felicia.hu #nvidia

Apple Machine Learning Research 2026-05-11 00:00 UTC Score 58.0 AI-059-20260511-official-ai--81099b76

BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

Image captioning is one of the most fundamental tasks in computer vision. Owing to its open-ended nature, it has received significant attention in the era of multimodal large language models (MLLMs). In pursuit of ever more detailed and accurate captions, recent work has increasingly turned to reinforcement learning (RL). However, existing captioning-RL methods and evaluation metrics often emphasize a narrow notion of caption quality, inducing trade-offs across core dimensions of captioning. For example, utility-oriented objectives can encourage noisy, hallucinated, or overlong captions that…

TWIML AI Podcast 2026-04-30 20:21 UTC Score 56.0 AI-148-20260430-podcasts-and-779fdbb8

How to Engineer AI Inference Systems with Philip Kiely - #766

In this episode, Philip Kiely, head of AI education at Baseten, joins us to unpack the fast-evolving discipline of inference engineering. We explore why inference has become the stickiest and most critical workload in AI, how it blends GPU programming, applied research, and large-scale distributed systems, and where the line sits between inference and model serving. Philip shares how research-to-production can move in hours, not months, and why understanding “the knobs” of inference—batching, quantization, speculation, and KV cache reuse—lets teams design better products and SLAs. We trace the inference maturity journey from closed APIs to dedicated deployments and in-house platforms, discuss GPU lifecycles, and survey today’s runtime landscape, including vLLM, SGLang, and TensorRT LLM. Finally, we look ahead to agents and multimodality, making the case for specialized, workload-specific runtimes when performance and efficiency matter most. The complete show notes for this episode can be found at https://twimlai.com/go/766.

LanceDB Blog 2026-04-10 07:25 UTC Score 35.0 USR-0078-20260410-ai-specialis-d9f761e7

What is the LanceDB Multimodal Lakehouse?

Introducing the Multimodal Lakehouse - a unified platform for managing AI data from raw files to production-ready features, now part of LanceDB Enterprise.

Weaviate Blog 2026-04-01 00:00 UTC Score 36.0 USR-0073-20260401-ai-specialis-1ac34032

Multimodal Embeddings and RAG: A Practical Guide

Multimodal embeddings allow AI systems to search and reason across text, images, audio, and video in their native formats. This blog covers the key intuitions behind how this all works and walks through three practical implementations using Weaviate and Gemini.

TWIML AI Podcast 2026-03-26 22:35 UTC Score 51.0 AI-148-20260326-podcasts-and-02c16b3f

The Race to Production-Grade Diffusion LLMs with Stefano Ermon - #764

Today, we're joined by Stefano Ermon, associate professor at Stanford University and CEO of Inception Labs to discuss diffusion language models. We dig into how diffusion approaches—traditionally used for images—are being adapted for text and code generation, the technical challenges of applying continuous methods to discrete token spaces, and how diffusion models compare to traditional autoregressive LLMs. Stefano introduces Mercury 2, a commercial-scale diffusion LLM that can generate multiple tokens simultaneously and achieve inference speeds 5-10x faster than small frontier models, paving the way for latency-sensitive applications like voice interactions and fast agentic loops. We also cover the open research challenges in diffusion LLM training, serving infrastructure requirements, and post-training for diffusion-based systems. Finally, Stefano shares his perspective on whether diffusion models can rival or surpass autoregressive LLMs at scale, the advantages for highly controllable generation, and what the future of multimodal diffusion models might look like. The complete show notes for this episode can be found at https://twimlai.com/go/764.

MongoDB AI Blog 2026-01-15 20:15 UTC Score 82.0 USR-0070-20260115-ai-specialis-0045c0cd Top pick

MongoDB.local San Francisco 2026: Ship Production AI, Faster

Today at MongoDB.local San Francisco, we announced capabilities that collapse the distance between AI prototype and production. Building AI applications means solving real problems: keeping conversational context clean and queryable, retrieving the right information from thousands of past interactions, connecting AI agents to your data without custom plumbing. These aren't theoretical challenges, they're the friction points that slow teams down every day. The AI era demands more from your data platform. MongoDB gives you everything you need to build quickly. Voyage AI: the best gets better Embedding models can make or break AI search experiences. We're proud that voyage-3-large has been the world's top-performing embedding model on Hugging Face's RTEB benchmark since its inception. But we didn’t rest on our laurels. There’s a new model at the top of the charts. Today, we're pleased to announce that the Voyage 4 model family is now generally available. The best just got better. The voyage-4 series models operate in a shared embedding space, allowing for cross-model compatibility and unprecedented flexibility to optimize for accuracy, speed, or cost. This release also includes voyage-4-nano, our first open-weight model available on HuggingFace, perfect for local development. Additionally, we're launching the new voyage-multimodal-3.5 model, which has been specifically trained to support video content alongside text and images. For developers building multimodal AI applications…

MongoDB AI Blog 2026-01-12 16:00 UTC Score 52.0 USR-0070-20260112-ai-specialis-c3dd5859

Vision RAG: Enabling Search on Any Documents

Information comes in many shapes and forms. While retrieval-augmented generation (RAG) primarily focuses on plain text, it overlooks vast amounts of data along the way. Most enterprise knowledge resides in complex documents, slides, graphics, and other multimodal sources. Yet, extracting useful information from these formats using optical character recognition (OCR) or other parsing techniques is often low-fidelity, brittle, and expensive. Vision RAG makes complex documents—including their figures and tables—searchable by using multimodal embeddings, eliminating the need for complex and costly text extraction. This guide explores how Voyage AI’s latest model powers this capability and provides a step-by-step implementation walkthrough. Vision RAG: Building upon text RAG Vision RAG is an evolution of traditional RAG built on the same two components: retrieval and generation. In traditional RAG, unstructured text data is indexed for semantic search. At query time, the system retrieves relevant documents or chunks and appends them to the user’s prompt so the large language model (LLM) can produce more grounded, context-aware answers. Figure 1. Text RAG with Voyage AI and MongoDB. Text RAG with Voyage AI and MongoDB Enterprise data, however, is rarely just clean plain text. Critical information often lives in PDFs, slides, diagrams, dashboards, and other visual formats. Today, this is typically handled by parsing tools and OCR services. Those approaches create several problems:…

Practical AI Podcast 2026-01-09 20:08 UTC Score 42.0 AI-143-20260109-podcasts-and-59f43d07

2025 was the year of agents, what's coming in 2026?

In this start-of-year FC episode, Chris and Daniel break down what really mattered in AI in 2025, and what to expect in 2026. They explore the rise of AI agents, the practical reality of multimodal AI, and how reasoning models are reshaping workflows. The conversation dives into infrastructure and energy constraints, the continued value of predictive models, and why orchestration (not just better models) is becoming the defining skill for AI teams. The episode wraps with grounded 2026 predictions on where AI systems, tooling, and builders are headed next. Featuring: Chris Benson – Website , LinkedIn , Bluesky , GitHub , X Daniel Whitenack – Website , GitHub , X Sponsor: Framer - The enterprise-grade website builder that lets your team ship faster. Get 30% off at framer.com/practicalai Upcoming Events: Register for upcoming webinars here !

TWIML AI Podcast 2025-12-09 19:46 UTC Score 51.0 AI-148-20251209-podcasts-and-5b69421e

Why Vision Language Models Ignore What They See with Munawar Hayat - #758

In this episode, we’re joined by Munawar Hayat, researcher at Qualcomm AI Research, to discuss a series of papers presented at NeurIPS 2025 focusing on multimodal and generative AI. We dive into the persistent challenge of object hallucination in Vision-Language Models (VLMs), why models often discard visual information in favor of pre-trained language priors, and how his team used attention-guided alignment to enforce better visual grounding. We also explore a novel approach to generalized contrastive learning designed to solve complex, composed retrieval tasks—such as searching via combined text and image queries—without increasing inference costs. Finally, we cover the difficulties generative models face when rendering multiple human subjects, and the new "MultiHuman Testbench" his team created to measure and mitigate issues like identity leakage and attribute blending. Throughout the discussion, we examine how these innovations align with the need for efficient, on-device AI deployment. The complete show notes for this episode can be found at https://twimlai.com/go/758.

TWIML AI Podcast 2025-10-28 20:26 UTC Score 56.0 AI-148-20251028-podcasts-and-240f74bd

High-Efficiency Diffusion Models for On-Device Image Generation and Editing with Hung Bui - #753

In this episode, Hung Bui, Technology Vice President at Qualcomm, joins us to explore the latest high-efficiency techniques for running generative AI, particularly diffusion models, on-device. We dive deep into the technical challenges of deploying these models, which are powerful but computationally expensive due to their iterative sampling process. Hung details his team's work on SwiftBrush and SwiftEdit, which enable high-quality text-to-image generation and editing in a single inference step. He explains their novel distillation framework, where a multi-step teacher model guides the training of an efficient, single-step student model. We explore the architecture and training, including the use of a secondary 'coach' network that aligns the student's denoising function with the teacher's, allowing the model to bypass the iterative process entirely. Finally, we discuss how these efficiency breakthroughs pave the way for personalized on-device agents and the challenges of running reasoning models with techniques like inference-time scaling under a fixed compute budget. The complete show notes for this episode can be found at https://twimlai.com/go/753.

Vector Institute News 2025-08-08 18:08 UTC Score 39.0 USR-0017-20250808-research-aca-d70d8490

When AI Meets Human Matters: Evaluating Multimodal Models Through a Human-Centred Lens – Introducing HumaniBench

By Shaina Raza and Veronica Chatrath AI models are rapidly becoming bigger, faster, and more capable at understanding images and text together. However, while accuracy and speed are often celebrated, […] The post When AI Meets Human Matters: Evaluating Multimodal Models Through a Human-Centred Lens – Introducing HumaniBench appeared first on Vector Institute for Artificial Intelligence .

The Gradient 2025-06-04 14:00 UTC Score 25.0 AI-037-20250604-ai-specialis-6895a2b0

AGI Is Not Multimodal

"In projecting language back as the model for thought, we lose sight of the tacit embodied understanding that undergirds our intelligence." –Terry Winograd The recent successes of generative AI models have convinced some that AGI is imminent. While these models appear to capture the essence of human

Berkeley AI Research Blog 2025-04-08 10:30 UTC Score 39.0 USR-0004-20250408-research-aca-ec075507

Repurposing Protein Folding Models for Generation with Latent Diffusion

PLAID is a multimodal generative model that simultaneously generates protein 1D sequence and 3D structure, by learning the latent space of protein folding models. The awarding of the 2024 Nobel Prize to AlphaFold2 marks an important moment of recognition for the of AI role in biology. What comes next after protein folding? In PLAID , we develop a method that learns to sample from the latent space of protein folding models to generate new proteins. It can accept compositional function and organism prompts , and can be trained on sequence databases , which are 2-4 orders of magnitude larger than structure databases. Unlike many previous protein structure generative models, PLAID addresses the multimodal co-generation problem setting: simultaneously generating both discrete sequence and continuous all-atom structural coordinates. From structure prediction to real-world drug design Though recent works demonstrate promise for the ability of diffusion models to generate proteins, there still exist limitations of previous models that make them impractical for real-world applications, such as: All-atom generation : Many existing generative models only produce the backbone atoms. To produce the all-atom structure and place the sidechain atoms, we need to know the sequence. This creates a multimodal generation problem that requires simultaneous generation of discrete and continuous modalities. Organism specificity : Proteins biologics intended for human use need to be humanized , to a…

TOPBOTS 2024-11-25 14:05 UTC Score 37.0 AI-043-20241125-ai-specialis-2c2ac547

Advancing AI in 2024: Highlights from 10 Groundbreaking Research Papers

In this article, we delve into ten groundbreaking research papers that expand the frontiers of AI across diverse domains, including large language models, multimodal processing, video generation and editing, and the creation of interactive environments. The post Advancing AI in 2024: Highlights from 10 Groundbreaking Research Papers appeared first on TOPBOTS .

Chip Huyen Blog 2023-10-10 00:00 UTC Score 53.0 USR-0111-20231010-ai-specialis-f4a68771

Multimodality and Large Multimodal Models (LMMs)

For a long time, each ML model operated in one data mode – text (translation, language modeling), image (object detection, image classification), or audio (speech recognition). However, natural intelligence is not limited to just a single modality. Humans can read, talk, and see. We listen to music to relax and watch out for strange noises to detect danger. Being able to work with multimodal data is essential for us or any AI to operate in the real world. OpenAI noted in their GPT-4V system card that “ incorporating additional modalities (such as image inputs) into LLMs is viewed by some as a key frontier in AI research and development .” Incorporating additional modalities to LLMs (Large Language Models) creates LMMs (Large Multimodal Models). Not all multimodal systems are LMMs. For example, text-to-image models like Midjourney, Stable Diffusion, and Dall-E are multimodal but don’t have a language model component. Multimodal can mean one or more of the following: Input and output are of different modalities (e.g. text-to-image, image-to-text) Inputs are multimodal (e.g. a system that can process both text and images) Outputs are multimodal (e.g. a system that can generate both text and images) This post covers multimodal systems in general, including LMMs. It consists of 3 parts. Part 1 covers the context for multimodality, including why multimodal, different data modalities, and types of multimodal tasks. Part 2 discusses the fundamentals of a multimodal system, using the…

Chip Huyen Blog 2023-08-16 00:00 UTC Score 50.0 USR-0111-20230816-ai-specialis-06d67c0f

Open challenges in LLM research

[ LinkedIn discussion , Twitter thread ] Never before in my life had I seen so many smart people working on the same goal: making LLMs better. After talking to many people working in both industry and academia, I noticed the 10 major research directions that emerged. The first two directions, hallucinations and context learning, are probably the most talked about today. I’m the most excited about numbers 3 (multimodality), 5 (new architecture), and 6 (GPU alternatives). 1. Reduce and measure hallucinations Hallucination is a heavily discussed topic already so I’ll be quick. Hallucination happens when an AI model makes stuff up. For many creative use cases, hallucination is a feature. However, for most other use cases, hallucination is a bug. I was at a panel on LLM with Dropbox, Langchain, Elastics, and Anthropic recently, and the #1 roadblock they see for companies to adopt LLMs in production is hallucination. Mitigating hallucination and developing metrics to measure hallucination is a blossoming research topic, and I’ve seen many startups focus on this problem. There are also ad-hoc tips to reduce hallucination, such as adding more context to the prompt, chain-of-thought, self-consistency, or asking your model to be concise in its response. To learn more about hallucination: Survey of Hallucination in Natural Language Generation (Ji et al., 2022) How Language Model Hallucinations Can Snowball (Zhang et al., 2023) A Multitask, Multilingual, Multimodal Evaluation of ChatGPT…

Lilian Weng Blog 2023-03-15 00:00 UTC Score 37.0 USR-0112-20230315-ai-specialis-c01a9c77

Prompt Engineering

Prompt Engineering , also known as In-Context Prompting , refers to methods for how to communicate with LLM to steer its behavior for desired outcomes without updating the model weights. It is an empirical science and the effect of prompt engineering methods can vary a lot among models, thus requiring heavy experimentation and heuristics. This post only focuses on prompt engineering for autoregressive language models, so nothing with Cloze tests, image generation or multimodality models. At its core, the goal of prompt engineering is about alignment and model steerability. Check my previous post on controllable text generation.

Lilian Weng Blog 2022-06-09 22:10 UTC Score 31.0 USR-0112-20220609-ai-specialis-2cce1820

Generalized Visual Language Models

Processing images to generate text, such as image captioning and visual question-answering, has been studied for years. Traditionally such systems rely on an object detection network as a vision encoder to capture visual features and then produce text via a text decoder. Given a large amount of existing literature, in this post, I would like to only focus on one approach for solving vision language tasks, which is to extend pre-trained generalized language models to be capable of consuming visual signals .

Stanford AI Lab Blog 2021-10-08 07:00 UTC Score 41.0 USR-0006-20211008-research-aca-b4d49fa6

Stanford AI Lab Papers at ICCV 2021

The International Conference on Computer Vision (ICCV 2021) will be hosted virtually next week. We’re excited to share all the work from SAIL that will be presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford! List of Accepted Papers GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-efficient Medical Image Recognition Authors : Mars Huang Contact : mschuang@stanford.edu Keywords : medical image, self-supervised learning, multimodal fusion 3D Shape Generation and Completion Through Point-Voxel Diffusion Authors : Linqi Zhou, Yilun Du, Jiajun Wu Contact : linqizhou@stanford.edu Links: Paper | Video | Website Keywords : diffusion, shape generation CAPTRA: CAtegory-level Pose Tracking for Rigid and Articulated Objects from Point Clouds Authors : Yijia Weng*, He Wang*, Qiang Zhou, Yuzhe Qin, Yueqi Duan, Qingnan Fan, Baoquan Chen, Hao Su, Leonidas J. Guibas Contact : yijiaw@stanford.edu Award nominations: Oral Presentation Links: Paper | Video | Website Keywords : category-level object pose tracking, articulated objects Detecting Human-Object Relationships in Videos Authors : Jingwei Ji, Rishi Desai, Juan Carlos Niebles Contact : jingweij@cs.stanford.edu Links: Paper Keywords : human-object relationships, video, detection, transformer, spatio-temporal reasoning Geography-Aware Self-Supervised Learning Authors : Kumar Ayush, Bura…