Google debuts "Gemini Omni" world model unifying reasoning and video creation

Marijan Hassan - Tech Journalist
May 25
3 min read

Google has completely unified its creative artificial intelligence ecosystem, unveiling Gemini Omni, a new family of "any-to-any" multimodal models, at its annual Google I/O 2026 developer conference. The launch, officially introduced by Google DeepMind CEO Demis Hassabis on May 19, 2026, marks the first time a top-tier AI company has collapsed separate text, image, audio, and video pipelines into a single, unified architecture.

The first model in the lineup, Gemini Omni Flash, went live immediately across consumer platforms, effectively replacing legacy tools like Veo as the default engine inside the Gemini app, Google Flow, and YouTube Shorts.

Understanding the "any-to-any" world model

Historically, AI companies ran a fragmented stack: one model for processing language, another for rendering images, and a separate system for generating video. Gemini Omni completely breaks that paradigm by processing and generating multiple media formats natively within a single forward pass.

Google DeepMind achieves this by fusing three previously distinct technologies into a singular framework: the core Gemini reasoning engine, the Veo video rendering backbone, and the Genie world simulation layer.

By integrating a world simulation layer, Gemini Omni behaves less like a standard text-to-video generator and more like a physics engine. The model exhibits an intuitive, built-in grasp of physical constraints, such as gravity, fluid dynamics, kinetic momentum, and light reflection, while simultaneously drawing on Gemini’s massive knowledge base of history, science, and cultural context to ensure high-fidelity outputs.

The killer feature: Multi-turn conversational video editing

While Omni can generate highly realistic 10-second clips from scratch, its most disruptive capability is conversational video editing. Internally described by Google engineers as "Nano Banana, but for video," the interface allows creators to edit existing footage or generated clips using natural language rather than complex post-production software.

Because the model retains persistent context across an entire chat session, creators can apply adjustments incrementally over multiple turns without resetting the scene:

Iterative changes: A user can upload a video and type, "Change the background to a rainy neon Tokyo alley," followed by, "Now, make the character walk faster and dim the streetlights."
Character and scene continuity: Omni allows users to stack up to five reference photos from the start, anchoring specific visual identities, props, and locations so they remain perfectly consistent across different shots.
Granular object swapping: Users can target specific elements within a frame, issuing commands like, "Replace the coffee cup on the desk with a glass vase," which the model executes while maintaining the surrounding lighting and shadows.

However, Google researchers have urged caution during early adoption. DeepMind engineers noted that because you are not working with explicit Photoshop-style selection layers, text prompts currently need to be highly specific to prevent the model from over-editing or altering parts of the video a creator intended to keep.

Aggressive distribution and built-in provenance

Google’s rollout strategy is heavily focused on immediate consumer scale. While competitors like OpenAI have restricted their advanced video tools from the general public, Google is pushing Omni Flash directly into the wild. The model is available for paid Google AI Plus subscribers ($7.99/month) inside Gemini and Google Flow, and it is launching as a free, native tool inside YouTube Shorts and the YouTube Create app.

This gives millions of short-form creators instant access to portrait-optimized generative video and personalized digital avatars. To address the inevitable deepfake and safety concerns that come with frictionless video manipulation, Google is embedding mandatory, dual-layer tracking into every Omni file:

SynthID watermarking: Developed by DeepMind, this invisible pixel-level watermark is embedded directly at the moment of generation. It is imperceptible to viewers but designed to survive heavy editing, cropping, filters, and file compression.
C2PA content credentials: A signed, cryptographic manifest attached to the file metadata that provides a clear, verifiable audit trail of the video's origin.

By lowering the barrier to video creation down to a simple conversation and deploying it directly into the world’s largest short-form video platform, Google isn't just releasing a new model. It is establishing a massive, consumer-facing infrastructure designed to lock creators into the Gemini ecosystem for the foreseeable future.