Rethinking Temporal Consistency in Video Object-Centric Learning: From Prediction to Correspondence

1Affiliation
Teaser Image

Figure 1. From prediction to correspondence: rethinking temporal consistency in video object-centric learning. Top: The predictive paradigm uses content-blind initialization without spatial priors. Temporal consistency relies on learned dynamics modules that forecast future slot states through high-capacity Transformers or recurrent networks. Bottom: Grounded Correspondence eliminates learned temporal parameters entirely. Slots initialize from saliency peaks in frozen backbone features. Frame-to-frame identity is maintained through parameter-free Hungarian matching on slot representations. This paradigm shift achieves competitive performance with zero learnable parameters for temporal modeling.

Abstract

The de facto approach in video object-centric learning maintains temporal consistency through learned dynamics modules that predict future object representations, called slots. We demonstrate that these predictors function as expensive approximations of discrete correspondence problems. Modern self-supervised vision backbones already encode instance-discriminative features that distinguish objects reliably. Exploiting these features eliminates the need for learned temporal prediction. We introduce Grounded Correspondence, a framework that replaces learned transition functions with deterministic bipartite matching. Slots initialize from salient regions in frozen backbone features. Frame-to-frame identity is maintained through Hungarian matching on slot representations. The approach requires zero learnable parameters for temporal modeling yet achieves competitive performance on MOVi-D, MOVi-E, and YouTube-VIS.

Rethinking Slot Initialization

Slot Attention partitions visual tokens through iterative competitive grouping. In video object-centric learning, slots for frame t are refined through multiple attention iterations conditioned on the state from frame t-1. Current methods typically apply three iterations for the initial frame and two iterations for subsequent frames to ensure stable decompositions. This iterative overhead stems from architectural choices rather than fundamental requirements of the binding mechanism. When slot initialization exploits the structural priors already present in vision backbone features, the grouping process converges substantially faster.

Slot Initialization

Figure 2. Slot Attention convergence with different initialization strategies. Four examples from YouTube-VIS. For each example, left shows content-blind initialization across 3 iterations, right shows grounded initialization with 1 iteration. Content-blind initialization requires multiple iterations to stabilize object boundaries, while grounded initialization achieves stable segmentation in a single iteration.

DINOv2 Features

Figure 3. Emergent object-binding signals in DINOv2 features. Left: Principal component analysis (PCA) visualization of patch embeddings shows coherent spatial clustering by object instance. Right: Saliency map derived from feature similarities (brighter regions indicate higher saliency). Peaks correspond to object centers.

Index Permutation

Figure 4. Index permutation in independent discovery. Three YouTube-VIS video sequences. Each row shows consecutive frames from one sequence. Objects are segmented correctly in each frame, but slot assignments (indicated by colors) change randomly across time, demonstrating inconsistent identity tracking.

Prediction or Correspondence?

Existing methods assume that maintaining object identity across frames requires learned predictors that model dynamics. In this section, we test whether temporal consistency can instead be solved through correspondence: matching slot representations between consecutive frames without predicting future states.

Table 1. SlotContrast with and without learned temporal prediction on YouTube-VIS. The identity baseline Qt = St-1 matches the performance of the full model. Mean and standard deviation over 3 seeds.
Temporal Module ARI ↑ FG-ARI ↑ mBO ↑
Learned predictor 32.10.8 36.30.2 29.90.3
Identity (Qt = St-1) 33.01.1 36.60.7 30.50.6
Hungarian Identity Ratio

Figure 5. Hungarian Identity Ratio on YouTube-VIS equals 1.0 when slots are propagated as identity between frames, indicating that discrete matching suffices for temporal consistency.

Results

We evaluate our framework on MOVi-D and MOVi-E from the Kubric synthetic dataset, and on the real-world YouTube-VIS 2021 dataset. MOVi-D features high object density with both static and dynamic objects. MOVi-E introduces linear camera movement. YouTube-VIS 2021 consists of unconstrained video sequences.

Table 2. Performance on synthetic MOVi datasets and real-world YouTube-VIS. Grounded Correspondence uses zero learnable parameters for temporal modeling yet demonstrates strong improvements on synthetic benchmarks and competitive results on unconstrained sequences.
Method MOVi-D MOVi-E YouTube-VIS
ARI FG-ARI mBO ARI FG-ARI mBO ARI FG-ARI mBO
SlotContrast 58.00.8 58.00.8 30.00.4 68.90.9 68.90.9 26.60.6 32.10.8 36.30.2 29.90.3
Grounded Correspondence 73.71.7 73.71.7 28.40.3 75.71.4 75.71.4 23.40.5 30.14.4 33.11.6 29.31.9
MOVi-D Qualitative Results

Figure 6. Qualitative comparison on MOVi-D. Each row shows consecutive frames from one sequence. Top: Grounded Correspondence. Bottom: SlotContrast. Different colors indicate different slot assignments. Grounded Correspondence produces more compact object masks, assigning single objects to single slots. In contrast, SlotContrast exhibits a tendency to split individual objects into multiple parts, as seen in several sequences where unified objects in our method correspond to fragmented, multi-colored regions in the baseline. This over-segmentation behavior aligns with the lower ARI scores achieved by SlotContrast on this benchmark.

MOVi-E Qualitative Results

Figure 7. Qualitative comparison on MOVi-E. Each row shows consecutive frames from one sequence. Top: Grounded Correspondence. Bottom: SlotContrast. Different colors indicate different slot assignments. Grounded Correspondence generates compact masks that unify objects and background regions into coherent segments. SlotContrast exhibits fragmentation across both foreground and background, splitting continuous surfaces into multiple disconnected parts. The compact representation achieved by our method contributes to the substantial ARI improvement over the baseline on this benchmark.

YouTube-VIS Qualitative Results

Figure 8. Qualitative comparison on YouTube-VIS. Each row shows consecutive frames from one sequence. Top: Grounded Correspondence. Bottom: SlotContrast. Both methods achieve competitive performance on unconstrained real-world sequences.