Self-Supervised Any-Point Tracking by Contrastive Random Walks

University of Michigan
ECCV 2024

Global Matching Random Walks (GMRW): a simple, self-supervised method for tracking any point in a video.

Abstract

We present a simple, self-supervised approach to the Tracking Any Point (TAP) problem.

We train a global matching transformer to find cycle consistent tracks through video via contrastive random walks, using the transformer's attention-based global matching to define the transition matrices for a random walk on a space-time graph. The ability to perform “all pairs” comparisons between points allows the model to obtain high spatial precision and to obtain a strong contrastive learning signal, while avoiding many of the complexities of recent approaches (such as coarse-to-fine matching). To do this, we propose a number of design decisions that allow global matching architectures to be trained through self-supervision using cycle consistency. For example, we identify that transformer-based methods are sensitive to shortcut solutions, and propose a data augmentation scheme to address them.

Our method achieves strong performance on the TapVid benchmarks, outperforming previous self-supervised tracking methods, such as DIFT, and is competitive with several supervised methods.

Method

We propose a simple and effective self-supervised approach to the Tracking Any Point problem. We adapt the global matching transformer architecture to learn through cycle consistency: i.e., tracking forward in time, then backward, should take us back to where we started. In lieu of labeled data, we supervise the model via the contrastive random walk, using the self-attention from global matching to define the transition matrix for a random walk that moves between points in adjacent frames. This “all pairs” matching mechanism allows us to define transition matrices that consider large numbers of points at once, thereby increasing spatial precision and enabling us to obtain a richer learning signal by considering a large number of paths through the space-time graph on which the random walk is performed. Additionally, we identify that global matching architectures are susceptible to shortcut solutions (e.g., due to their use of positional encodings), and that previously proposed methods for addressing these shortcuts are insufficient. We therefore propose a type of data augmentation that removes these shortcuts.

Results

BibTeX

@InProceedings{shrivastava2024gmrw,
      title     = {Self-Supervised Any-Point Tracking by Contrastive Random Walks},
      author    = {Shrivastava, Ayush and Owens, Andrew},
      journal   = {European Conference on Computer Vision (ECCV)},
      year      = {2024},
      url       = {https://arxiv.org/abs/2409.16288},
}