multimodal AI - Adjacent

May 24

Google's Gemini learns to process any type of input at once

Google's latest multimodal architecture processes text, image, video, and audio natively instead of converting everything into text tokens first. The approach is materially faster and more efficient than current methods. The competitive pressure sits on reasoning: if Gemini maintains coherence across disparate data types—video plus text prompt plus image context—it redefines what "understanding" means in an AI product, forcing OpenAI and Anthropic to either match the throughput or demonstrate that narrower pipelines deliver better reasoning on tasks that matter.