Method
We propose a simple and effective self-supervised approach to the Tracking Any Point problem. We adapt the global matching transformer architecture to learn through cycle consistency: i.e., tracking forward in time, then backward, should take us back to where we started. In lieu of labeled data, we supervise the model via the contrastive random walk, using the self-attention from global matching to define the transition matrix for a random walk that moves between points in adjacent frames. This “all pairs” matching mechanism allows us to define transition matrices that consider large numbers of points at once, thereby increasing spatial precision and enabling us to obtain a richer learning signal by considering a large number of paths through the space-time graph on which the random walk is performed. Additionally, we identify that global matching architectures are susceptible to shortcut solutions (e.g., due to their use of positional encodings), and that previously proposed methods for addressing these shortcuts are insufficient. We therefore propose a type of data augmentation that removes these shortcuts.