‘A blue jay standing on a large basket of rainbow macarons.' Credit: Google

About a month after OpenAI announced DALL-E 2, its latest AI system to create images from text, Google has continued the AI "space race" with its own text-to-image diffusion model, Imagen. Google's results are extremely, perhaps even scarily, impressive.

Using a standard measure, FID, Google Imagen outpaces Open AI's DALL-E 2 with a score of 7.27 using the COCO dataset. Despite not being trained using COCO, Imagen still performed well here too. Imagen also bests DALL-E 2 and other competing text-to-image methods among human raters. You can read about the full testing results in Google's research paper.

'The Toronto skyline with Google brain logo written in fireworks.'

Imagen works by taking a natural language text input, like, 'A Golden Retriever dog wearing a blue checkered beret and red dotted turtleneck,' and then using a frozen T5-XXL encoder to turn that input text into embeddings. A 'conditional diffusion model' then maps the text embedding into a small 64x64 image. Imagen uses text-conditional super-resolution diffusion models to upsample the 64x64 image into a 256x256 and 1024x1024.

Compared to NVIDIA's GauGAN2 method from last fall, Imagen is significantly improved in terms of flexibility and results. AI is progressing rapidly. Consider the image below generated from 'a cute corgi lives in a house made out of sushi.' It looks believable, like someone really built a dog house from sushi that the corgi, perhaps unsurprisingly, loves.

'A cute corgi lives in a house made out of sushi.'

It's a cute creation. Seemingly all of what we've seen so far from Imagen is cute. Funny outfits on furry animals, cactuses with sunglasses, swimming teddy bears, royal raccoons, etc. Where are the people?

Whether innocent or ill-intentioned, we know that some users would immediately start typing in all sorts of phrases about people as soon as they had access to Imagen. I'm sure there'd be a lot of text inputs about adorable animals in humorous situations, but there'd also be input text about chefs, athletes, doctors, men, women, children, and much more. What would these people look like? Would doctors mostly be men, would flight attendants mostly be women, and would most people have light skin?

'A robot couple fine dining with Eiffel Tower in the background.' What would this couple look like if the text didn't include the word 'robot'?

We don't know how Imagen handles these text strings because Google has elected not to show any people. There are ethical challenges with text-to-image research. If a model can conceivably create just about any image from text, how good is a model at presenting unbiased results? AI models like Imagen are largely trained using datasets scraped from the web. Content on the internet is skewed and biased in ways that we are still trying to understand fully. These biases have negative societal impacts worth considering and, ideally, rectifying. Not just that, but Google used the LAION-400M dataset for Imagen, which is known to 'contain a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes.' A subset of the training group was filtered to remove noise and 'undesirable' content, but there remains a 'risk that Imagen has encoded harmful stereotypes and representations, which guides our decision to not release Imagen for public use without further safeguards in place.'

The text strings can become quite complicated. 'A marble statue of a koala DJ in front of a marble statue of a turntable. The koala is wearing large marble headphones.'

So no, you can't access Imagen for yourself. On its website, Google lets you click on specific words from a selected group to see results, like 'a photo of a fuzzy panda wearing a cowboy hat and a black leather jacket playing a guitar on top of a mountain,' but you can't search for anything to do with people or potentially problematic actions or items. If you could, you'd find that the model tends to generate images of people with lighter skin tones and reinforce traditional gender roles. Early research also indicates that Imagen reflects cultural biases through its depiction of certain items and events.

'A Pomeranian is sitting on the Kings throne wearing a crown. Two tiger soldiers are standing next to the throne.'

We know Google is aware of representation issues across its wide range of products and is working on improving realistic skin tone representation and reducing inherent biases. However, AI is still a 'Wild West' of sorts. While there are many talented, thoughtful people behind the scenes generating AI models, a model is basically on its own once unleashed. Depending upon the dataset used to train the model, it's difficult to predict what will happen when users can type in anything they want.

'A dragon fruit wearing karate belt in the snow.'

It's not Imagen's fault, or the fault of any other AI models that have struggled with the same problem. Models are being trained using massive datasets that contain visible and hidden biases, and these problems scale with the model. Even beyond marginalizing specific groups of people, AI models can generate very harmful content. If you asked an illustrator to draw or paint something horrific, many would turn you away in disgust. Text-to-image AI models don't have moral qualms and will produce anything. It's a problem, and it's unclear how it can be addressed.

'Teddy bears swimming at the Olympics 400mm Butterfly event.'

In the meantime, as AI research teams grapple with the societal and moral implications of their extremely impressive work, you can look at eerily realistic photos of skateboarding pandas, but you can't input your own text. Imagen is not available to the public, and neither is its code. However, you can learn a lot about the project in a new research paper.


All images courtesy of GooglE