Multi-Encoder.
The multi-encoder architecture processes each SVBRDF map independently, thus learning specific
latent representations to decouple the different information contained in each map (reflectance
properties and geometry).
In particular, we split the original encoder, with a latent codebook size of 16,384, into four
smaller
encoders, each
with a codebook of 16, 384/4 = 4, 096. This improves reflectance map reconstruction, as highlighted
in
the ablation
study. The slightly increased computation cost only affects training as the encoder is not used
during
inference. Each
encoder in this setup has a smaller architecture compared to the single encoder. The multi-encoder
compression model
has a total of 156M parameters.
Condition Encoder.
The condition encoder is composed of five heads, one accepting each condition, and produces two
embeddings: a
2D embedding that represents the spatial condition and a 1D vector for the global condition. For
text
and image global
embedding, we use CLIP with a ViT-B-16 backbone, with ∼150M parameters. The palette encoder is an
MLP
with two linear layers and GELU activation, having ∼160K parameters. Spatial conditions are encoded
using
a series of convolutional layers with the same compression factor of the autoencoder, f = 8, and
each
adds ∼395K
parameters. In total, the condition encoder has about 150M parameters, with less than 1M trainable.