Rethinking training of 3D GANs

Additional visualizations

Summary

We are witnessing a surge of works on building and improving 3D-aware generators. To induce a 3D-aware bias, such models rely on volumetric rendering, which is expensive to employ at high resolutions. The dominant strategy to address the scaling issue is to train a separate 2D decoder to upsample a low-resolution volumetrically rendered representation. But this solution comes at a cost. Not only does it break multi-view consistency, e.g. shape and texture change when a camera moves, but it also learns the geometry in a low fidelity. In this work, we take a different route to 3D synthesis and develop a non-upsampler-based generator with state-of-the-art image quality, high-resolution geometry and which trains $2.5 \times$ faster. For this, we revisit and improve patch-based optimization in two ways. First, we design a location- and scale-aware discriminator by modulating its filters with a hypernetwork. Second, we modify the patch sampling strategy based on an annealed beta distribution to stabilize training and accelerate the convergence. We train on four datasets (two introduced in this work) at $256^{2}$ and $512^{2}$ resolutions, directly, without the need of a 2D upsampler, and our model attains better or comparable FID and has higher fidelity geometry than the current SotA.

Paper Code M-Plants M-Food

Note: please, use the latest version of Chrome/Chromium or Safari to watch the videos (alternatively, you can download a video and watch it offline). Some of the videos can be displayed incorrectly in other web browsers (e.g., Firefox).

Random samples on FFHQ

Random samples on Cats

Random samples on Megascans Plants

Random samples on Megascans Food

Latent interpolations on Megascans Plants

Latent interpolations on Megascans Food

Background separation

In contrast to upsampler-based models, our generator is purely NeRF-based, so it can directly incorporate the advancements from the NeRF literature. In this example, we simply copy-pasted the code from NeRF++ for background separation via the inverse sphere parametrization. For this experiment, we didn't use pose conditioning in the discriminator (which we use for FFHQ and Cats to avoid flat surfaces — otherwise we have the same issues as EG3D and GRAM) and found that when the background separation is enabled, it learns to produce non-flat surfaces on its own, i.e. without direct guidance from the discriminator.