AI/ML News & Innovations Hub

AI/ML news, top picks, and generated innovation digests.

★ Visit ai-karthik.com
422Sources
5100News Items
8Top Picks
43Blogs
runningLast Run

Computer Vision

22 articles tagged with this keyword, sorted by most recent first.

← All Keywords
Transactions on Machine Learning Research 2026-06-30 00:00 UTC Score 49.0 AI-084-20260630-research-pap-960a167b

Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection

Attention mechanisms have revolutionized several domains of artificial intelligence, such as natural language processing and computer vision, by enabling models to selectively focus on relevant parts of the input data. While recent work has characterized the optimization dynamics of gradient descent (GD) in attention-based models and the structural properties of its preferred solutions, less is known about more general optimization algorithms such as mirror descent (MD). In this paper, we investigate the convergence properties and implicit biases of a family of MD algorithms tailored for softmax attention mechanisms, with the potential function chosen as the $p$-th power of the $\ell_p$-norm. Specifically, we show that these algorithms converge in direction to a generalized hard-margin SVM with an $\ell_p$-norm objective when applied to a classification problem using a softmax attention model. Notably, our theoretical results reveal that the convergence rate is comparable to that of traditional GD in simpler models, despite the highly nonlinear and nonconvex nature of the present problem. Additionally, we delve into the joint optimization dynamics of the key-query matrix and the decoder, establishing conditions under which this complex joint optimization converges to their respective hard-margin SVM solutions. Lastly, our numerical experiments on real data demonstrate that MD algorithms improve generalization over standard GD and excel in optimal token selection.

Stack Overflow Machine Learning Tag 2026-06-19 06:38 UTC Score 32.0 AI-112-20260619-social-media-360de22b

Machine learning model test on new dataset [closed]

I made a machine learning computer vision model trained it on a known dataset now i want to test that model performance over my dataset how to do that. I am open to suggestions and best practices please give me detailed workflow . I'd really appreciate your help . Also feel free to share references links , YouTube links etc . I'd like to mention that my model is a mix of 5 pre existing model trained on a high standard data. Hope it has learned good enough are there other ways of knowing that apart from standard metric scores.

Artificial Intelligence News 2026-06-18 15:57 UTC Score 26.0 AI-029-20260618-ai-specialis-32d55aee

Computer vision deployments drive retail productivity gains

Computer vision deployments are driving retail productivity gains as operators automate physical shelf tracking to protect eroding margins. This hardware deployment directly addresses the persistent in-store execution failures currently costing the industry billions. A study authored by Coresight Research – in partnership with technology providers Simbe and RELEX Solutions – calculates the exact cost of […] The post Computer vision deployments drive retail productivity gains appeared first on AI News .

Roboflow Blog 2026-06-16 16:39 UTC Score 31.0 USR-0088-20260616-ai-specialis-09dd376e

Contact Lens Defect Inspection

Train a Roboflow object detection model, detect defects on each contact lens, and sort results into pass, review, and fail with a Custom Python Block.

NVIDIA Developer YouTube 2026-06-15 22:00 UTC Score 56.0 AI-144-20260615-podcasts-and-35153e06

Local Agents on Jetson: OpenClaw, NemoClaw, and AI You Can Build Into Daily Life

This session moves from running a local model to running a local autonomous agent. OpenClaw is a fully local AI assistant that runs on Jetson and connects to chat workflows, browser-based tools, and multi-step tasks. NemoClaw extends this with sandboxing, onboarding, inference routing, and policy controls for safer and more structured agent deployments. We'll show what changes when an AI system can take actions, use tools, and run privately on your own hardware — 24/7, at home, on the edge. Use cases include building dynamic browser-based games, prototyping smart computer vision apps, and running long research tasks without a cloud dependency. You will learn how to move from running a local model to running a fully local autonomous agent on NVIDIA Jetson. We'll cover: Building a local assistant with OpenClaw — extend the Episode 1 baseline into a full local assistant architecture that connects to chat workflows, browser-based tools, and multi-step tasks — running privately on your own hardware, 24/7. NVIDIA Orin Nano vs. AGX Orin vs. Thor — compare hardware paths side by side so you can make the right choice for your deployment constraints and performance needs. Why tool-calling models matter — see what changes when an AI system can take actions, use tools, and run autonomously, and what breaks when your model can't do it reliably. Safer local agents with NemoClaw — go further with sandboxing, onboarding, inference routing, and policy controls that make local agent deploymen…

NVIDIA Developer YouTube 2026-06-12 07:06 UTC Score 67.0 AI-144-20260612-podcasts-and-0509f277

Generate Synthetic Data for Physical AI With NVIDIA Brev Launchables and Agent Skills

Join NVIDIA for a live demonstration of how developers can generate synthetic data for physical AI using NVIDIA Brev Launchables and agent skills. Building synthetic data pipelines for robotics, digital twins, and autonomous systems often requires configuring GPU infrastructure, simulation environments, notebooks, and orchestration tools before meaningful work can begin. In this livestream, we'll show how NVIDIA Brev Launchables and agent skills simplify that process by packaging these components into ready-to-run workflows that help developers move from setup to data generation faster. In this livestream, you'll learn how to: - Launch preconfigured Physical AI development environments - Generate synthetic data using AI-powered workflows - Accelerate robotics, simulation, and digital twin development - Scale from individual tasks to larger synthetic data pipelines - Integrate data generation workflows into broader Physical AI ecosystems Through live, hands-on demonstrations, we'll show how developers can streamline synthetic data creation and reduce the complexity of building Physical AI workflows. Whether you're building robots, training computer vision models, creating digital twins, developing autonomous systems, or exploring Physical AI applications, this session provides a practical introduction to synthetic data generation with NVIDIA Brev Launchables and agent skills. -------------------------------- 📓 Resources Launchable: - Nurec: https://brev.nvidia.com/launchable/…

Apple Machine Learning Research 2026-05-28 00:00 UTC Score 34.0 AI-059-20260528-official-ai--6b440667

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

Apple is presenting new research at the annual IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , which takes place in person in Denver at the Colorado Convention Center from June 3 to June 7. We are proud to sponsor the conference, which brings together the scientific and industrial research communities in computer vision and pattern recognition. Below is an overview of Apple’s participation at CVPR 2026.

Apple Machine Learning Research 2026-05-11 00:00 UTC Score 58.0 AI-059-20260511-official-ai--81099b76

BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

Image captioning is one of the most fundamental tasks in computer vision. Owing to its open-ended nature, it has received significant attention in the era of multimodal large language models (MLLMs). In pursuit of ever more detailed and accurate captions, recent work has increasingly turned to reinforcement learning (RL). However, existing captioning-RL methods and evaluation metrics often emphasize a narrow notion of caption quality, inducing trade-offs across core dimensions of captioning. For example, utility-oriented objectives can encourage noisy, hallucinated, or overlong captions that…

AI Stack Exchange 2024-05-30 20:45 UTC Score 23.0 AI-110-20240530-social-media-e1bfd7d7

Is reinforcement learning suitable for application automation?

I have basically automatised the use of an app through the use of OCR and computer vision. So basically when a word or an image is detected it will perform a certain action. When that action is successfully completed it will go to the next state. Now I want to try basically with a more "heuristic" approach and I thought about reinforcement learning. Why? Because I am aiming to build a tool that basically understand automatically what actions to perform in a certain state. But I have a doubt. Even though I don't need to declare an association like this (it would beat the purpose of deep reinforcement learning or deep learning in general): if(state.MENU_VIEW) clickManager.clickOnFolder(); ... I still need to define the states, the actions and the reward. Meaning I would need to instruct my app that when the OCR result is "Open Folder" it means the state I am in is MENU_VIEW. I simply wouldn't tell my app what action to perform in a that state. Am I correct? What I am trying to say is: how exactly could I make it so that the states (and maybe also the actions?) are generated automatically? The reward in this case scenario would be basically the folder being opened successfully.

Chip Huyen Blog 2023-10-10 00:00 UTC Score 53.0 USR-0111-20231010-ai-specialis-f4a68771

Multimodality and Large Multimodal Models (LMMs)

For a long time, each ML model operated in one data mode – text (translation, language modeling), image (object detection, image classification), or audio (speech recognition). However, natural intelligence is not limited to just a single modality. Humans can read, talk, and see. We listen to music to relax and watch out for strange noises to detect danger. Being able to work with multimodal data is essential for us or any AI to operate in the real world. OpenAI noted in their GPT-4V system card that “ incorporating additional modalities (such as image inputs) into LLMs is viewed by some as a key frontier in AI research and development .” Incorporating additional modalities to LLMs (Large Language Models) creates LMMs (Large Multimodal Models). Not all multimodal systems are LMMs. For example, text-to-image models like Midjourney, Stable Diffusion, and Dall-E are multimodal but don’t have a language model component. Multimodal can mean one or more of the following: Input and output are of different modalities (e.g. text-to-image, image-to-text) Inputs are multimodal (e.g. a system that can process both text and images) Outputs are multimodal (e.g. a system that can generate both text and images) This post covers multimodal systems in general, including LMMs. It consists of 3 parts. Part 1 covers the context for multimodality, including why multimodal, different data modalities, and types of multimodal tasks. Part 2 discusses the fundamentals of a multimodal system, using the…

Data Science Stack Exchange 2023-05-08 11:11 UTC Score 9.0 AI-111-20230508-social-media-1e8a095f

How does the background class work in object detection?

I am using YOLOv5 for object detection. I understand that any labelled classes that are not predicted, that is, false negatives (FN) shows up as background. But how are the false positive (FP) being calculated? As in if the background is not explicitly labelled in the data, how are we calculating the false positives? Please see the following confusion matrix for reference. The last row is "background FN". The last column is "background FP". Image source: https://github.com/ultralytics/yolov5/issues/6738

Lilian Weng Blog 2022-06-09 22:10 UTC Score 31.0 USR-0112-20220609-ai-specialis-2cce1820

Generalized Visual Language Models

Processing images to generate text, such as image captioning and visual question-answering, has been studied for years. Traditionally such systems rely on an object detection network as a vision encoder to capture visual features and then produce text via a text decoder. Given a large amount of existing literature, in this post, I would like to only focus on one approach for solving vision language tasks, which is to extend pre-trained generalized language models to be capable of consuming visual signals .

Stanford AI Lab Blog 2021-10-08 07:00 UTC Score 41.0 USR-0006-20211008-research-aca-b4d49fa6

Stanford AI Lab Papers at ICCV 2021

The International Conference on Computer Vision (ICCV 2021) will be hosted virtually next week. We’re excited to share all the work from SAIL that will be presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford! List of Accepted Papers GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-efficient Medical Image Recognition Authors : Mars Huang Contact : mschuang@stanford.edu Keywords : medical image, self-supervised learning, multimodal fusion 3D Shape Generation and Completion Through Point-Voxel Diffusion Authors : Linqi Zhou, Yilun Du, Jiajun Wu Contact : linqizhou@stanford.edu Links: Paper | Video | Website Keywords : diffusion, shape generation CAPTRA: CAtegory-level Pose Tracking for Rigid and Articulated Objects from Point Clouds Authors : Yijia Weng*, He Wang*, Qiang Zhou, Yuzhe Qin, Yueqi Duan, Qingnan Fan, Baoquan Chen, Hao Su, Leonidas J. Guibas Contact : yijiaw@stanford.edu Award nominations: Oral Presentation Links: Paper | Video | Website Keywords : category-level object pose tracking, articulated objects Detecting Human-Object Relationships in Videos Authors : Jingwei Ji, Rishi Desai, Juan Carlos Niebles Contact : jingweij@cs.stanford.edu Links: Paper Keywords : human-object relationships, video, detection, transformer, spatio-temporal reasoning Geography-Aware Self-Supervised Learning Authors : Kumar Ayush, Bura…

Jay Alammar Blog 2020-12-17 00:00 UTC Score 39.0 USR-0113-20201217-ai-specialis-fb351fb3

Interfaces for Explaining Transformer Language Models

Interfaces for exploring transformer language models by looking at input saliency and neuron activation. Explorable #1: Input saliency of a list of countries generated by a language model Tap or hover over the output tokens: Explorable #2: Neuron activation analysis reveals four groups of neurons, each is associated with generating a certain type of token Tap or hover over the sparklines on the left to isolate a certain factor: The Transformer architecture has been powering a number of the recent advances in NLP. A breakdown of this architecture is provided here . Pre-trained language models based on the architecture, in both its auto-regressive (models that use their own output as input to next time-steps and that process tokens from left-to-right, like GPT2) and denoising (models trained by corrupting/masking the input and that process tokens bidirectionally, like BERT) variants continue to push the envelope in various tasks in NLP and, more recently, in computer vision. Our understanding of why these models work so well, however, still lags behind these developments. This exposition series continues the pursuit to interpret and visualize the inner-workings of transformer-based language models. We illustrate how some key interpretability methods apply to transformer-based language models. This article focuses on auto-regressive models, but these methods are applicable to other architectures and tasks as well. This is the first article in the series. In it, we present explo…

Andrej Karpathy Blog 2016-09-07 11:00 UTC Score 36.0 USR-0115-20160907-ai-specialis-85602144

A Survival Guide to a PhD

This guide is patterned after my “Doing well in your courses” , a post I wrote a long time ago on some of the tips/tricks I’ve developed during my undergrad. I’ve received nice comments about that guide, so in the same spirit, now that my PhD has come to an end I wanted to compile a similar retrospective document in hopes that it might be helpful to some. Unlike the undergraduate guide, this one was much more difficult to write because there is significantly more variation in how one can traverse the PhD experience. Therefore, many things are likely contentious and a good fraction will be specific to what I’m familiar with (Computer Science / Machine Learning / Computer Vision research). But disclaimers are boring, lets get to it! Preliminaries First, should you want to get a PhD? I was in a fortunate position of knowing since young age that I really wanted a PhD. Unfortunately it wasn’t for any very well-thought-through considerations: First, I really liked school and learning things and I wanted to learn as much as possible, and second, I really wanted to be like Gordon Freeman from the game Half-Life (who has a PhD from MIT in theoretical physics). I loved that game. But what if you’re more sensible in making your life’s decisions? Should you want to do a PhD? There’s a very nice Quora thread and in the summary of considerations that follows I’ll borrow/restate several from Justin/Ben/others there. I’ll assume that the second option you are considering is joining a medium…