Get all your news in one place.

100’s of premium titles.
One app.

Start reading

Get all your news in one place.

100’s of premium titles. One news app.

Start reading

TechRadar

Yiannis Andreopoulos

A new hybrid approach to photorealistic neural avatars

Nvidia Microsoft

A video conference screenshot showing the Microsoft Teams avatars, now in public preview.

Neural avatars have emerged as a new technology for interactive remote presence. Amongst other things, they are expected to influence video conferencing, mixed reality frameworks (e.g., remote appearances at physical meetings), and 2D or 3D gaming and metaverse applications. At the moment, they are limited to either cartoon representations of the speaker (e.g., Mesh avatars for Microsoft Teams) or experimental prototypes of photorealistic neural rendering of speakers, like NVIDIA Maxine video compression and Meta pixel codec avatars.

In both categories, the limitations in the rendering of intricate speaker’s expressions, gestures and body movements are severely limiting the value of remote presence and visual communication.

When faced with such poor rendering that does not reflect reality many users may simply prefer not to use any video at all. This is because we all evolved to be very astute observers of human faces, gestures and body movements. We use delicate expressions, hand gestures and body movements to convey trust and meaning, interpret human emotion, the speaker’s experience on the topic discussed, and many other things. It is widely accepted that the majority of human communication is non-verbal, so all these details matter a lot.

The iSIZE team has been working for almost 2 years on this problem and has identified a new way to introduce advanced AI tools to facilitate such remote presence applications. The key observation is that all 2D or 3D photorealistic neural rendering frameworks for human speakers that allow for substantial compression (i.e., 2x or more compression versus what conventional video codecs achieve today) cannot guarantee faithful representation of key features like eyes, mouth and certain types of expressions and gestures.

Due to the enormous diversity in human faces and expressions in humans, iSIZE estimates would take 1000x more complexity in neural rendering to get such intricate details right, and even then, there is no guarantee that the result will not have uncanny valley artifacts when deployed at scale for millions of users.

At the recent GTC conference, the company presented the latest generative AI live photorealistic avatar. We have looked on delivering a photorealistic experience while maintaining a lower bitrate compared to standard video codecs. Such substantial bitrate reduction is a commodity that enables:

5x lower video latency
5x lower transceiver power
Uninterrupted remote presence under poor wireless signal conditions
Significantly better Quality of Experience, increased user engagement time for 2D/ 3D, gaming and metaverse applications.

A custom pretrained neural warp engine is still being used, but it is actually made even simpler and faster than other such approaches. At the encoder side, such a neural engine will generate a general representation for the target video frame of the speaker (starting from the reference) with warp information that can be very compact, i.e., around 200bits/frame. Naturally, the uncanny valley issues can creep in, as you see in the mouth and eyes regions of Fig. 1.

Fig. 1. Neural warping that generates a baseline for the target video frame, with uncanny valley artifacts in the eyes and mouth. (Image credit: iSize)

However, what happens next is that a quality scoring neural proxy assesses the expected warp quality prior to it actually happening, and makes the following decisions:

Either it determines that the warp will fail in a major way and the frame is sent to a ‘sandwiched’ encoding engine like an HEVC or AV1 encoder.
Or it will only extract select region of interest components like the eyes and mouth and only send these components to the sandwiched encoder.

Then, there is an ROI (region-of-interest) blending engine that seamlessly blends the ROI components together with the warp outcome and the stitched result is displayed.

Fig. 2. Full framework of BitGen2D proposed by iSIZE. (Image credit: iSize)

As you can see, this avoids all uncanny valley problems. This framework is termed BitGen2D, as it bridges the gap between a generative avatar engine and a conventional video encoder. It is a hybrid approach that offloads all the difficult parts to the external encoder, while leaving the bulk of the frame to be warped by the neural engine.

The crucial components of this framework are the quality scoring proxy and the ROI blending. Notice that BitGen2D does not have to score the quality after the blending, and this is a deliberate choice, as, if it is done a-posteriori, this will increase the delay.

That is, when a failure is detected, the system would have to unroll the result and send the frame to the video encoder. Getting the quality scoring to work well in conjunction with the warp and blending is key. Furthermore, deep fakes cannot be created as the framework is operating directly from elements of the speaker's face.

BitGen2D provides high reliability as it can accurately render the various events that can happen during a video call such as hand movements and foreign objects entering the frame. It is also user agnostic as it does not require any fine-tuning on a specific identity. This enables BitGen2D as plug-and-play for any user and can also withstand situations like a change in appearance (clothes, accessories, haircut…) or a switch in speaker during a live call. The method is actively being extended to 3D data using NERFs and octrees.

“Generative AI’s incredible potential is inspiring virtually every industry to reimagine its business strategies and the technology required to achieve them,” NVIDIA CEO Jensen Huang said during the NVIDIA GTC Developer conference keynote address.

AI solutions are here to stay and are already transforming how we live, work and entertain.

We've rated the best webinar software.

Read news from 100’s of titles, curated specifically for you.

Already a member? Sign in here