Publications | Riccardo Majellaro

2024

TMLR

Explicitly Disentangled Representations in Object-Centric Learning

Riccardo Majellaro, Jonathan Collu, Aske Plaat, and Thomas M Moerland

In Arxiv, 2024

Published in Transactions on Machine Learning Research (TMLR)

Abs PAGE PDF

Extracting structured representations from raw visual data is an important and long-standing challenge in machine learning. Recently, techniques for unsupervised learning of object-centric representations have raised growing interest. In this context, enhancing the robustness of the latent features can improve the efficiency and effectiveness of the training of downstream tasks. A promising step in this direction is to disentangle the factors that cause variation in the data. Previously, Invariant Slot Attention disentangled position, scale, and orientation from the remaining features. Extending this approach, we focus on separating the shape and texture components. In particular, we propose a novel architecture that biases object-centric models toward disentangling shape and texture components into two non-overlapping subsets of the latent space dimensions. These subsets are known a priori, hence before the training process. Experiments on a range of object-centric benchmarks reveal that our approach achieves the desired disentanglement while also numerically improving baseline performance in most cases. In addition, we show that our method can generate novel textures for a specific object or transfer textures between objects with distinct shapes.
Slot Structured World Models

Jonathan Collu, Riccardo Majellaro, Aske Plaat, and Thomas M Moerland

In Arxiv, 2024

Preprint

Abs PAGE PDF

The ability to perceive and reason about individual objects enables humans to build a robust understanding of the environment and its dynamics. Replicating such abilities in artificial systems would represent a significant milestone toward building intelligent agents. Contrastive Learning of Structured World Models (C-SWMs) took a step in this direction, proposing an unsupervised approach to embed images as compositions of individual object representations and model their pair-wise relationships. Yet the proposed architecture presents an encoder that cannot disambiguate different objects characterized by the same visual features, and the method has only been tested in settings where encoding just the object position and velocity was sufficient to learn the dynamics of the environment. In this regard, we introduce Slot Structured World Models (SSWMs), a class of world models augmenting C-SWMs with a pretrained object-centric encoder. We further propose a version of the Spriteworld environment that includes simple physics to challenge these models. Quantitative and qualitative measures show that the proposed method outperforms the baseline on the given environment, although it presents severe limitations in multi-step prediction.