AI/ML News & Innovations Hub

Chip Huyen Blog 2023-10-10 00:00 UTC Score 53.0 USR-0111-20231010-ai-specialis-f4a68771

Multimodality and Large Multimodal Models (LMMs)

For a long time, each ML model operated in one data mode – text (translation, language modeling), image (object detection, image classification), or audio (speech recognition). However, natural intelligence is not limited to just a single modality. Humans can read, talk, and see. We listen to music to relax and watch out for strange noises to detect danger. Being able to work with multimodal data is essential for us or any AI to operate in the real world. OpenAI noted in their GPT-4V system card that “ incorporating additional modalities (such as image inputs) into LLMs is viewed by some as a key frontier in AI research and development .” Incorporating additional modalities to LLMs (Large Language Models) creates LMMs (Large Multimodal Models). Not all multimodal systems are LMMs. For example, text-to-image models like Midjourney, Stable Diffusion, and Dall-E are multimodal but don’t have a language model component. Multimodal can mean one or more of the following: Input and output are of different modalities (e.g. text-to-image, image-to-text) Inputs are multimodal (e.g. a system that can process both text and images) Outputs are multimodal (e.g. a system that can generate both text and images) This post covers multimodal systems in general, including LMMs. It consists of 3 parts. Part 1 covers the context for multimodality, including why multimodal, different data modalities, and types of multimodal tasks. Part 2 discusses the fundamentals of a multimodal system, using the…

Read article →

Jay Alammar Blog 2023-01-01 00:00 UTC Score 41.0 USR-0113-20230101-ai-specialis-60429c7c

Remaking Old Computer Graphics With AI Image Generation

Can AI Image generation tools make re-imagined, higher-resolution versions of old video game graphics? Over the last few days, I used AI image generation to reproduce one of my childhood nightmares. I wrestled with Stable Diffusion, Dall-E and Midjourney to see how these commercial AI generation tools can help retell an old visual story - the intro cinematic to an old video game (Nemesis 2 on the MSX). This post describes the process and my experience in using these models/services to retell a story in higher fidelity graphics. Meet Dr. Venom This fine-looking gentleman is the villain in a video game. Dr. Venom appears in the intro cinematic of Nemesis 2, a 1987 video game. This image, in particular, comes at a dramatic reveal in the cinematic. Let’s update these graphics with visual generative AI tools and see how they compare and where each succeeds and fails. Remaking Old Computer graphics with AI Image Generation Here’s a side-by-side look at the panels from the original cinematic (left column) and the final ones generated by the AI tools (right column): This figure does not show the final Dr. Venom graphic because I want you to witness it as I had, in the proper context and alongside the appropriate music. You can watch that here:

Read article →

DALL·E

Multimodality and Large Multimodal Models (LMMs)

Remaking Old Computer Graphics With AI Image Generation