Image Domain Translation

Elegant Image Domain Translation: A Diffusion-based Mechanism with Label Guidance

Here, we handle the research of Image Domain Translation in Semantic Segmentation.

Imagine teaching a computer to recognize and understand every tiny detail in a photo. This is what we call semantic segmentation – a crucial task in computer vision. In the last decade, researchers have made impressive strides in this field, but there’s a catch. To teach the computer, we need loads of detailed annotations for each pixel in an image, and getting those annotations is a real headache – it’s time-consuming and requires a lot of effort.

To tackle this issue, some smart folks came up with an idea. Instead of sweating over manual annotations, why not use computer-generated datasets? Think of it like playing a video game (like GTA5) to create images and labels. It’s way cheaper and faster than doing it all by hand. But, here’s the twist – the models trained on these synthetic datasets struggle when faced with real photos. It’s like learning to play a game really well but being clueless when you step into the real world. As shown in Figure 1, an AI model trained on gaming images (source domain) fails to function properly in a real-world environment (target domain).

Figure 1. Performance comparison of the source-domain model in the source domain and the target domain, respectively.

Figure 1. Performance comparison of the source-domain model in the source domain and the target domain, respectively.

Problem in Existing Image Domain Translation Methods

People have tried different tricks to solve this. Some focus on making sure the computer understands the features of both gaming and real images, getting some good results. Others try automatically creating pseudo labels for the real-world images, also doing pretty well. Then, there’s this cool idea called image translation. It’s like turning gaming images into real-world ones and training the computer with these transformed images and the gaming labels.

But here’s the hiccup – most methods use a popular technique called generative adversarial networks (GANs) for image translation. However, GANs are typically hard to teach and sometimes show shaky performance. They struggle to keep the tiny details in check, causing confusion between the transformed images and their labels. So, despite the efforts with GANs, there’s a roadblock in making image translation work smoothly for DASS. Surprisingly, not many folks are exploring different ways to make it better beyond GANs.

Inspiration from DDPMs

Denoising Diffusion Probabilistic Models (DDPMs) also called diffusion models, recently emerged as a promising alternative to GANs. They’re becoming popular for many sorts of tasks – creating images, fixing up pictures, and even tweaking how images look. Why are they getting all the attention? These models have a smoother learning process, making them more stable and reliable. So, inspired by their powerful abilities, we came up with a plan. We’re going to use their skills to guide the translation process and keep all the tiny details intact – it’s like having an artist who pays attention to every brushstroke.

Figure 2. Strategy comparison between previous methods and ours.

Figure 2. Strategy comparison between previous methods and ours.

As shown in Figure 2 (left), many researchers before us focused on training translation models using only images. However, the source images and target images aren’t a perfect match, making it tricky to train image translation models. We realized that the semantic labels from the source domain (where we’re learning) can be a goldmine for preserving source details during translation. So, we came up with a clever idea – let’s train a translation model that listens to these semantic labels and translates images accordingly, see Figure 2 (right).

Our Diffusion-based Method

Based on this idea, we proposed two diffusion-based modules – the Semantic Gradient Guidance (SGG) and the Progressive Translation Learning (PTL) – in our paper. SGG is like a GPS that enables the diffusion translation process using super-detailed source-domain labels, and PTL is the coach that helps SGG work smoothly in different domains.

Figure 3. Results comparison between previous methods and ours.

Figure 3. Results comparison between previous methods and ours.

With these designs on our side, as shown in Figure 3, compared to GAN-based methods, our diffusion framework aced the image translation for DASS, handling it with a fine touch. For more details, please refer to our ICCV paper.

Paper Link: Here

Author