MatFuse: Controllable Material Generation with Diffusion Models

Architectural details.

Multi-Encoder. The multi-encoder architecture processes each SVBRDF map independently, thus learning specific latent representations to decouple the different information contained in each map (reflectance properties and geometry). In particular, we split the original encoder, with a latent codebook size of 16,384, into four smaller encoders, each with a codebook of 16, 384/4 = 4, 096. This improves reflectance map reconstruction, as highlighted in the ablation study. The slightly increased computation cost only affects training as the encoder is not used during inference. Each encoder in this setup has a smaller architecture compared to the single encoder. The multi-encoder compression model has a total of 156M parameters.
Condition Encoder. The condition encoder is composed of five heads, one accepting each condition, and produces two embeddings: a 2D embedding that represents the spatial condition and a 1D vector for the global condition. For text and image global embedding, we use CLIP with a ViT-B-16 backbone, with ∼150M parameters. The palette encoder is an MLP with two linear layers and GELU activation, having ∼160K parameters. Spatial conditions are encoded using a series of convolutional layers with the same compression factor of the autoencoder, f = 8, and each adds ∼395K parameters. In total, the condition encoder has about 150M parameters, with less than 1M trainable.