Disney Research Studios and ETH Zurich have published a study detailing a new algorithm that is able to swap faces from one subject to another in high-resolution photos and videos. Of note, this system is able to fully automate the face-swapping process, presenting the first instance of megapixel-resolution machine-generated imagery that is 'temporally coherent' and photo-realistic.

The new algorithm involves taking the face of a subject and modifying it using the face of another person, blending the two so that the face from one person is presented with the expressions and movements of another.

The system involves a multi-way comb network trained with images of multiple people, as well as a blending method that preserves contrast and lighting. 'We also show that while progressive training enables generation of high-resolution images,' the researchers say, 'extending the architecture and training data beyond two people allows us to achieve higher fidelity in generated expressions.'

Key to the high level of quality is the 'landmark stabilization algorithm,' which Disney researchers describe as a 'crucial' aspect of dealing with high-resolution content. Though this isn't the first instance of face-swapping in footage, the study points out that existing methods used to generate characters like the young Carrie Fisher in Rogue One are both time-intensive and quite expensive.

Artificial intelligence has the potential to change this, ultimately enabling creators to rapidly generate computer characters using live-action footage and input images of the target. Generating realistic faces remains a big problem, however, producing what is referred to as the 'uncanny valley' look that limits the use of this tech.

This makes Disney's new technology particularly exciting, teasing a future in which creators will be able to generate photo-realistic, high-resolution, temporally-stable face swaps between two people. The researchers explain:

As our system is also capable of multi-way swaps -- allowing any pair of performances and appearances in our data to be swapped -- the possible benefits to visual effects are extensive, all at a fraction of the time and expense required using more traditional methods.

The study compares the face-swapping results from this new method to the results from existing algorithms, including DeepFaceLab and DeepFakes. Though the other algorithms were able to produce casually convincing results, they were unable to pass scrutiny and, in some cases, were either excessively blended or outright bizarre and uncanny.

This batch represents instances of failed face swapping

In comparison, the face swaps generated using the new method were realistic and maintained a high level of sharpness and detail at a 1024 x 1024 resolution, bypassing the soft, blurry results often seen when using DeepFakes. As well, the researchers note that DeepFakes has such heavy processing requirements that it was only able to generate a resolution of 128 x 128 pixels using an 11GB GPU.

When using morphable models, the researchers were able to increase the resolution to 500 x 500 pixels, but the results were typically unrealistic. Beyond that, the researchers were forced to train the conventional models for each pair of face swaps whereas the new algorithm could be simultaneously trained for all of the people used for the various face swaps.

However, the study points out that the new algorithm presents one big limitation also experienced by other, more conventional methods: the original head shape is maintained. Though the face swap may be very realistic, the face itself may not match the head shape properly, resulting in a generated character that looks a bit 'off' from what is expected.

Future research may result in a method for transferring the subject's head shape in addition to their face, producing not only photo-realistic results, but also the correct overall appearance for a digitally-recreated actor. The biggest obvious use for this technology is in film and television, enabling studios to quickly and cheaply (relatively speaking) create 3D models of aging or deceased actors.

This technology joins a growing body of research on face-swapping and model-generating algorithms that focus on still images rather than videos. NVIDIA, for example, published a study in late 2018 that demonstrated the generation of photo-realistic portraits of AI models that involved source and target images of real people.

Around a year later, the same company published new research that performed a similar face swap, but one involving dogs instead of humans. We've already seen the use of these various AI technologies reach the consumer level -- Let's Enhance 2.0, for example, recently introduced a new feature that utilizes machine learning to reconstruct the faces of subjects in low-resolution images.

As for the new study from Disney Research Studios and ETH Zurich, the full paper (PDF) can be found on Disney's website here.