AI/ML News & Innovations Hub

AI/ML news, top picks, and generated innovation digests.

★ Visit ai-karthik.com
422Sources
5100News Items
8Top Picks
43Blogs
runningLast Run

NLP

17 articles tagged with this keyword, sorted by most recent first.

← All Keywords
Transactions on Machine Learning Research 2026-06-30 00:00 UTC Score 49.0 AI-084-20260630-research-pap-960a167b

Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection

Attention mechanisms have revolutionized several domains of artificial intelligence, such as natural language processing and computer vision, by enabling models to selectively focus on relevant parts of the input data. While recent work has characterized the optimization dynamics of gradient descent (GD) in attention-based models and the structural properties of its preferred solutions, less is known about more general optimization algorithms such as mirror descent (MD). In this paper, we investigate the convergence properties and implicit biases of a family of MD algorithms tailored for softmax attention mechanisms, with the potential function chosen as the $p$-th power of the $\ell_p$-norm. Specifically, we show that these algorithms converge in direction to a generalized hard-margin SVM with an $\ell_p$-norm objective when applied to a classification problem using a softmax attention model. Notably, our theoretical results reveal that the convergence rate is comparable to that of traditional GD in simpler models, despite the highly nonlinear and nonconvex nature of the present problem. Additionally, we delve into the joint optimization dynamics of the key-query matrix and the decoder, establishing conditions under which this complex joint optimization converges to their respective hard-margin SVM solutions. Lastly, our numerical experiments on real data demonstrate that MD algorithms improve generalization over standard GD and excel in optimal token selection.

Transactions on Machine Learning Research 2026-06-30 00:00 UTC Score 40.0 AI-084-20260630-research-pap-bedebbba

Transformers Can Overcome the Curse of Dimensionality: A Theoretical Study from an Approximation Perspective

The Transformer model is widely used in various application areas of machine learning, such as natural language processing. This paper investigates the approximation of the Hölder continuous function class $\mathcal{H}_{Q}^{\beta}\left([0,1]^{d\times n},\mathbb{R}^{d\times n}\right)$ by Transformers and constructs several Transformers that can overcome the curse of dimensionality. These Transformers consist of one self-attention layer with one head and the softmax function as the activation function, along with several feedforward layers. For example, to achieve an approximation accuracy of $\epsilon$, if the activation functions of the feedforward layers in the Transformer are ReLU and floor, only $\mathcal{O}\left(\log\frac{1}{\epsilon}\right)$ layers of feedforward layers are needed, with widths of these layers not exceeding $\mathcal{O}\left(\frac{1}{\epsilon^{2/\beta}}\log\frac{1}{\epsilon}\right)$. If other activation functions are allowed in the feedforward layers, the width of the feedforward layers can be further reduced to a constant. These results demonstrate that Transformers have a strong expressive capability. The construction in this paper is based on the Kolmogorov-Arnold Superposition Theorem and does not require the concept of contextual mapping, hence our proof is more intuitively clear compared to previous Transformer approximation works. Additionally, the translation technique proposed in this paper helps to apply the previous approximation results of fe…

Machine Learning Mastery 2026-06-16 12:00 UTC Score 27.0 AI-039-20260616-ai-specialis-e9483392

Building an End-to-End Sentiment Analysis Pipeline with Scikit-LLM

Traditional machine learning pipelines for predictive tasks like text classification usually rely on extracting structured, numerical features from raw text — for instance, TF-IDF frequencies or token embeddings — to feed into classical models such as logistic regression, ensembles, or support vector machines.

Machine Learning Mastery 2026-06-11 12:00 UTC Score 18.0 AI-039-20260611-ai-specialis-824e0fa0

Multi-Label Text Classification with Scikit-LLM

Text classification typically boils down to scenarios where a product review is "positive" or "negative", or a customer inquiry belongs to one category or another.

Stanford AI Lab Blog 2022-05-31 07:00 UTC Score 47.0 USR-0006-20220531-research-aca-a57ebba7

LinkBERT: Improving Language Model Training with Document Link

Language Model Pretraining Language models (LMs), like BERT 1 and the GPT series 2 , achieve remarkable performance on many natural language processing (NLP) tasks. They are now the foundation of today’s NLP systems. 3 These models serve important roles in products and tools that we use every day, such as search engines like Google 4 and personal assistants like Alexa 5 . These LMs are powerful because they can be pretrained via self-supervised learning on massive amounts of text data on the web without the need for labels, after which the pretrained models can be quickly adapted to a wide range of new tasks without much task-specific finetuning. For instance, BERT is pretrained to predict randomly masked words in original text (masked language modeling), e.g. predicting the masked word “dog” from “My __ is fetching the ball”. GPTs are pretrained to predict the next word given a previous sequence of text (causal language modeling), e.g. predicting the next word “ball” from “My dog is fetching the”. In either cases, through pretraining, LMs learn to encode various knowledge from a text corpus that helps to perform downstream applications involving language understanding or generation. In particular, LMs can learn world knowledge (associations between concepts like “dog”, “fetch”, “ball”) from training text where the concepts appear together, and help for knowledge-intensive applications like question answering. 6 Challenges. A challenge with most common LM pretraining strateg…

Stanford AI Lab Blog 2022-05-25 07:00 UTC Score 47.0 USR-0006-20220525-research-aca-2eecb290

Stanford AI Lab Papers and Talks at ACL 2022

The 60th Annual Meeting of the Association for Computational Linguistics (ACL) 2022 is taking place May 22nd - May 27th. We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford! List of Accepted Papers LinkBERT: Pretraining Language Models with Document Links Authors : Michihiro Yasunaga, Jure Leskovec*, Percy Liang* Contact : myasu@cs.stanford.edu Links: Paper | Website Keywords : language model, pretraining, knowledge, hyperlink, bionlp When classifying grammatical role, BERT doesn’t care about word order… except when it matters Authors : Isabel Papadimitriou, Richard Futrell, Kyle Mahowald Contact : isabelvp@stanford.edu Links: Paper Keywords : large language models, analysis, word order, order invariance, grammatical role, syntax, semantics Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words Authors : Kaitlyn Zhou, Kawin Ethayarajh, Dallas Card, Dan Jurafsky Contact : katezhou@stanford.edu Keywords : cosine similarity, training data frequency, model analysis Faithful or Extractive? On Mitigating the Faithfulness-Abstractiveness Trade-off in Abstractive Summarization Authors : Faisal Ladhak, Esin Durmus, He He, Claire Cardie, Kathleen McKeown Contact : esdurmus@stanford.edu Links: Paper Keywords : text summarization, text generation, evaluation, faithfulness Sp…

Stanford AI Lab Blog 2022-01-21 08:00 UTC Score 58.0 USR-0006-20220121-research-aca-1e3c1829

Reward Isn't Free: Supervising Robot Learning with Language and Video from the Web

This work was conducted as part of SAIL and CRFM . Deep learning has enabled improvements in the capabilities of robots on a range of problems such as grasping 1 and locomotion 2 in recent years. However, building the quintessential home robot that can perform a range of interactive tasks, from cooking to cleaning, in novel environments has remained elusive. While a number of hardware and software challenges remain, a necessary component is robots that can generalize their prior knowledge to new environments, tasks, and objects in a zero or few shot manner. For example, a home robot tasked with setting the dining table cannot afford lengthy re-training for every new dish, piece of cutlery, or dining room it may need to interact with. A natural way to enable such generalization in our robots is to train them on rich data sources that contain a wide range of different environments, tasks, and objects. Indeed, this recipe of massive, diverse datasets combined with scalable offline learning algorithms (e.g. self-supervised or cheaply supervised learning) has been the backbone of the many recent successes of foundation models 3 in NLP 4 5 6 7 8 9 and vision 10 11 12 . Replicating these impressive generalization and adaptation capabilities in robot learning algorithms would certainly be a step toward robots that can be used in unstructured, real world environments. However, directly extending this recipe to robotics is nontrivial, as we neither have sufficiently large and diverse…

Stanford AI Lab Blog 2021-11-05 07:00 UTC Score 52.0 USR-0006-20211105-research-aca-02e23852

Stanford AI Lab Papers at EMNLP/CoNLL 2021

The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021) will take place next week, colocated with CoNLL 2021. We’re excited to share all the work from SAIL that will be presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford! List of Accepted Papers Calibrate your listeners! Robust communication-based training for pragmatic speakers Authors : Rose E. Wang, Julia White, Jesse Mu, Noah D. Goodman Contact : rewang@stanford.edu Links: Paper | Video Keywords : language generation, pragmatics, communication-based training, calibration, uncertainty Cross-Domain Data Integration for Named Entity Disambiguation in Biomedical Text Authors : Maya Varma, Laurel Orr, Sen Wu, Megan Leszczynski, Xiao Ling, Christopher Ré Contact : mvarma2@stanford.edu Links: Paper | Video Keywords : named entity disambiguation, biomedical text, rare entities, data integration ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts Authors : Yuta Koreeda, Christopher D. Manning Contact : koreeda@stanford.edu Links: Paper | Website Keywords : natural language inference, contract, law, legal, dataset Venue : The Findings of EMNLP 2021 The Emergence of the Shape Bias Results from Communicative Efficiency Authors : Eva Portelance, Michael C. Frank, Dan Jurafsky, Alessandro Sordoni, Romain Laroche Contact : portelan@stanford.edu Links…

Stanford AI Lab Blog 2021-10-13 07:00 UTC Score 52.0 USR-0006-20211013-research-aca-a7eb5ef1

Selective Classification Can Magnify Disparities Across Groups

Selective classification, where models are allowed to “abstain” when they are uncertain about a prediction, is a useful approach for deploying models in settings where errors are costly. For example, in medicine, model errors can have life-or-death ramifications, but abstentions can be easily handled by backing off to a doctor, who then makes a diagnosis. Across a range of applications from vision 1 2 3 and NLP 4 5 , even simple selective classifiers, relying only on model logits, routinely and often dramatically improve accuracy by abstaining. This makes selective classification a compelling tool for ML practitioners 6 7 . However, in our recent ICLR paper, we find that despite reliably improving average accuracy, selective classification can fail to improve and even hurt the accuracy over certain subpopulations of the data . As a motivating example, consider the task of diagnosing pleural effusion, or fluid in the lungs, from chest X-rays. Pleural effusion is often treated with a chest drain, so many pleural effusion cases also have chest drains, while most cases without pleural effusion do not have chest drains 8 . While selective classification improves average accuracy for this task, we find that it does not appreciably improve accuracy on the most clinically relevant subgroup, or subpopulation, of the data: those that have pleural effusion but don’t yet have a chest drain, i.e. those that have pleural effusion but have not yet been treated for it. Practitioners, thus,…

Lilian Weng Blog 2021-03-21 00:00 UTC Score 39.0 USR-0112-20210321-ai-specialis-3e60cc8a

Reducing Toxicity in Language Models

Large pretrained language models are trained over a sizable collection of online data. They unavoidably acquire certain toxic behavior and biases from the Internet. Pretrained language models are very powerful and have shown great success in many NLP tasks. However, to safely deploy them for practical real-world applications demands a strong safety control over the model generation process.

AI Stack Exchange 2021-01-13 13:34 UTC Score 23.0 AI-110-20210113-social-media-8c81b4b2

How to extract parameters from a text using AI/NLP

lets say I have three texts: "make a heading that says hello word" "make a heading of hello world" "create heading consist of hello world" How can I fetch those groups of words using AI which is referring to heading i.e hello world in this case. Which AI frameworks or libraries can do that? in all examples heading is pointing to hello world (which i am referring as group of words). so basically i want those words which will be a part of heading or in other word there is a relationship between them. another example i can give is "I am watching Breaking bad" so there is a relationship between watching and breaking bad and i want to extract what are you watching. What's the best approach? Do I have to train a model for that or there are some other techniques that can get it done?

Jay Alammar Blog 2020-12-17 00:00 UTC Score 39.0 USR-0113-20201217-ai-specialis-fb351fb3

Interfaces for Explaining Transformer Language Models

Interfaces for exploring transformer language models by looking at input saliency and neuron activation. Explorable #1: Input saliency of a list of countries generated by a language model Tap or hover over the output tokens: Explorable #2: Neuron activation analysis reveals four groups of neurons, each is associated with generating a certain type of token Tap or hover over the sparklines on the left to isolate a certain factor: The Transformer architecture has been powering a number of the recent advances in NLP. A breakdown of this architecture is provided here . Pre-trained language models based on the architecture, in both its auto-regressive (models that use their own output as input to next time-steps and that process tokens from left-to-right, like GPT2) and denoising (models trained by corrupting/masking the input and that process tokens bidirectionally, like BERT) variants continue to push the envelope in various tasks in NLP and, more recently, in computer vision. Our understanding of why these models work so well, however, still lags behind these developments. This exposition series continues the pursuit to interpret and visualize the inner-workings of transformer-based language models. We illustrate how some key interpretability methods apply to transformer-based language models. This article focuses on auto-regressive models, but these methods are applicable to other architectures and tasks as well. This is the first article in the series. In it, we present explo…