3D-aware visual pretraining has proven effective in improving the performance of downstream robotic manipulation tasks. However, existing methods are constrained to Euclidean embedding spaces, whose flat geometry limits the modeling of structural relations among embeddings. As a result, they struggle to learn structured embeddings that are essential for robust spatial perception in robotic applications. To this end, we propose HyperMVP, a self-supervised framework for Hyperbolic MultiView Pretraining. The hyperbolic space offers geometric properties that are well suited for capturing structural relations. Methodologically, we extend the masked autoencoder paradigm and design a GeoLink encoder to learn multiview hyperbolic representations. The pretrained encoder is then finetuned with visuomotor policies on manipulation tasks. In addition, we introduce 3D-MOV, a large-scale dataset comprising multiple types of 3D point clouds to support pretraining. We evaluate HyperMVP on COLOSSEUM, RLBench, and real-world scenarios, where it consistently outperforms strong baselines across diverse tasks and perturbation settings. Our results highlight the potential of 3D-aware pretraining in a non-Euclidean space for learning robust and generalizable robotic manipulation policies.
In real-world, we study two manipulation tasks, a common 'pick and place bear' task and a high-precision 'plug in the charging cable' task.Real-world videos play at 2x speed by default.
Pick and place bear: The teddy bear is placed facing any direction, and the robot needs to pick it up and place it in a cardboard box to complete the task. In this task, we introduced various perturbations, including Light Change, MO Texture, Distractor Objects, and the combined effect of all these disturbances.
Plug in the charging cable: The robot needs to pick up a USB Type-A charging cable, accurately align it with the 14 mm * 6 mm receptacle on the charger, and precisely insert it.
Coming soon...