A group of researchers with Samsung Labs have developed improved neural head avatar technology to the megapixel resolution. Using an animated driving image, the team has proposed a new set of neural architectures and training methods to deal with 'the particularly challenging task of cross-driving synthesis.'

The team has developed convincing neural avatars of historical figures and even some modern celebrities. The neural architecture takes a supplied driving image, which is a video of a person making different facial expressions and movements. It then applies to a static image, like a painted portrait or photograph. The system then turns the static image into a motion graphic where the head and face of the subject correspond to the movements of the driving image.

Figure 2: Overview of our base model. To encode the appearance of the source frame, we predict volumetric features v๐‘  and a global descriptor e๐‘  from the source image via an appearance encoder Eapp. In parallel, we predict the motion representations from both the source and driving images using a motion encoder Emtn. These representations consist of the explicit head rotations R๐‘ /๐‘‘, translations t๐‘ /๐‘‘, and the latent expression descriptors z๐‘ /๐‘‘. They are used to predict the 3D warpings w๐‘ โ†’ and wโ†’๐‘‘ via the separate warping generators W๐‘ โ†’ and Wโ†’๐‘‘. The first warping removes the source motion from the appearance features v๐‘  by mapping them into a canonical coordinate space, and the second one imposes the driver motion. The canonical volume is processed by a 3D convolutional network G3D, and the driving volume v๐‘ โ†’๐‘‘ is orthographically projected into 2D features and processed by a 2D convolutional network G2D, which predicts an output image xห†๐‘ โ†’๐‘‘ .

Credit: Samsung AI Center

What makes the system distinct is its impressive resolution and the fact that an animated avatar can be created using 'one shot.' The megapixel portraits, called MegaPortraits for short, rely upon two-stage training. The team describes its training setup as 'relatively standard.' It involves sampling two random frames from its dataset at each step, pulling a source frame and a driver frame. The model then 'imposes the motion of the driving frame (i.e., the head pose and the facial expression) onto the appearance of the source frame to produce an output image.' The learning signal is built using training episodes where the source and driver frames occur in the same video.

The team believes its approach is the first to achieve the impressive megapixel resolution. The system has two primary limitations. First, the VoxCeleb2 and FFHQ datasets used for training mostly comprise frontal or near-frontal views. In the case of rendering non-frontal head poses, there's a decrease in quality. The second limitation is some temporal flicker, as seen in the video above. This flickering is due to high-resolution images being limited to static view.

You can read the full research paper, 'MegaPortraits: One-shot Megapixel Neural Head Avatars,' by clicking here. The research team includes Nikita Drobyshev, Jenya Chelishev, Taras Khakhulin, Aleksei Ivakhenko, Victor Lempitsky, and Egor Zakharov.