Get all your news in one place.
100’s of premium titles.
One app.
Start reading
Tom’s Guide
Tom’s Guide
Technology
Ryan Morrison

WALT is a new AI video tool that creates photorealistic clips from a single image — you have to see it to believe it

WALT can create fluid movement.

A new artificial intelligence model called WALT can take a simple image or text input and convert it into a photorealistic video. Preview clips include dragons breathing fire, asteroids hitting the Earth and horses walking on a beach.

One of the more notable advances made by the Standford University team behind WALT is the ability to create consistent 3D motion on an object and do so from a natural language prompt.

Creating video from images or text is the next big frontier. It is a complex problem to solve, requiring more than just stitching a sequence of images together as each frame has to be a logical follow-on from the previous to create fluid motion.

What makes WALT stand out?

WALT was specifically trained to create fluid 3D motion (Image credit: Standford AI lab)

Companies like Pika Labs, Runway, Meta and StabilityAI all have generative video models that have varying degrees of fluidity, coherence and quality. Agrim Gupta, the researcher behind WALT, says it can generate video from text or images and be used for 3D motion.

Gupta says WALT was trained with both photographs and video clips stored inside the same latent space. This allowed for training across both at the same time, giving the model a deeper understanding of motion from the start.

WALT is designed to be scalable and efficient, allowing for state-of-the-art results for image generation across three models covering image and video. This allows for higher resolution and consistent motion.

"While generative modeling has seen tremendous recent advances for image," wrote Gupta and colleagues, "progress on video generation has lagged." He believes that a unified image and video framework will close the gap between image and video generation.

How does WALT compare to Runway and Pika Labs?

The AI model can also create fluid motion within a section of a video (Image credit: Stanford AI Lab)

The quality of motion in WALT seems to be a step up on other recent video models, particularly around 3D movement such as a burger turning on a table or horses walking. However, the quality of the output is a fraction of that built by Runway or Pika Labs.

However, this is a research model and the team is building it to scale. First, the base model produces small 128 x 128 pixel clips. This is then upsampled twice to get to 512 x 896 resolution at eight frames per second.

In contrast, Runway’s Gen-2 can create video clips up to 1536x896, although that requires a paid subscription. The default, free version generates video up to 768 x 448, so not as high of a resolution as possible with WALT.

Pika Labs works at similar resolutions but both Runway and Pika Labs can generate up to 24 frames per second, closer to that of production-quality video than the eight frames from WALT.

More from Tom's Guide

Sign up to read this article
Read news from 100’s of titles, curated specifically for you.
Already a member? Sign in here
Related Stories
Top stories on inkl right now
One subscription that gives you access to news from hundreds of sites
Already a member? Sign in here
Our Picks
Fourteen days free
Download the app
One app. One membership.
100+ trusted global sources.