Hyperbolic Multiview Pretraining for Robotic Manipulation

Abstract

3D-aware visual pretraining has proven effective in improving the performance of downstream robotic manipulation tasks. However, existing methods are constrained to Euclidean embedding spaces, whose flat geometry limits the modeling of structural relations among embeddings. As a result, they struggle to learn structured embeddings that are essential for robust spatial perception in robotic applications. To this end, we propose HyperMVP, a self-supervised framework for Hyperbolic MultiView Pretraining. The hyperbolic space offers geometric properties that are well suited for capturing structural relations. Methodologically, we extend the masked autoencoder paradigm and design a GeoLink encoder to learn multiview hyperbolic representations. The pretrained encoder is then finetuned with visuomotor policies on manipulation tasks. In addition, we introduce 3D-MOV, a large-scale dataset comprising multiple types of 3D point clouds to support pretraining. We evaluate HyperMVP on COLOSSEUM, RLBench, and real-world scenarios, where it consistently outperforms strong baselines across diverse tasks and perturbation settings. Our results highlight the potential of 3D-aware pretraining in a non-Euclidean space for learning robust and generalizable robotic manipulation policies.

Introduction

AmpAttention vs standard attention
Overview of HyperMVP.(a)Illustration of the HyperMVP framework, including the 3D-MOV pretraining dataset, embedding spaces, and downstream applications. (b)Comparison of generalization performance(%) on the Colosseum under various perturbation settings.

  • We propose HyperMVP, it is the first framework to explore 3D multiview pretraining in hyperbolic space for robotic manipulation.
  • We introduce 3D-MOV, a large-scale dataset comprising four types of 3D point clouds, with each instance paired with five orthographic images. It provides a foundation for analyzing how different types of 3D data affect manipulation performance.
  • We present comprehensive evaluation results and analytical insights regarding the model’s performance across both simulated and real-world scenarios.

Pipeline

Description of the image
Methodology of HyperMVP. HyperMVP follows a pretraining–finetuning paradigm. During pretraining, each point cloud is rendered into five orthographic images. These images are masked and fed into the GeoLink encoder to learn multiview representations in Euclidean and hyperbolic spaces. To support pretraining, we introduce intra-view and inter-view reconstruction pretext tasks and build a large-scale 3D-MOV dataset. During finetuning, the pretrained GeoLink encoder is finetuned with the Robotic View Transformer (RVT) to learn visuomotor policies.

Reconstruction Results

AmpAttention vs standard attention
HyperMVP successfully restores fine details in images, such as the bird's claws in the left image and the car's hubcaps and windshield in the right image.


AmpAttention vs standard attention
HyperMVP is also capable of successfully restoring scene images, such as the refrigerator interior scene (on the left) and the desktop scene (on the right).


AmpAttention vs standard attention
Furthermore, HyperMVP is even capable of restoring Out-of-Domain (OOD) images. For instance, the robotic manipulation scene in the left image was not present in the training data distribution. Similarly, the figurine in the right image did not appear in the training set.


Simulation Experiments

Multi-task performance on RLBench. We report success rates (%) for 18 RLBench tasks, along with the mean success rate (%).
Description of the image

Execution examples — selected tasks

basketball in hoop
close laptop lid
get ice from fridge
move hanger
scoop with spatula
straighten rope
turn oven on

basketball in hoop under different perturbations

No perturbation
All perturbations
MO Color
RO Color
MO Texture
MO Size
RO Size
Light Color
Table Color
Table Texture
Distractor Object
Background Texture
Camera Pose

Real-world Experiments

In real-world, we study two manipulation tasks, a common 'pick and place bear' task and a high-precision 'plug in the charging cable' task.Real-world videos play at 2x speed by default.

Pick and place bear: The teddy bear is placed facing any direction, and the robot needs to pick it up and place it in a cardboard box to complete the task. In this task, we introduced various perturbations, including Light Change, MO Texture, Distractor Objects, and the combined effect of all these disturbances.

Plug in the charging cable: The robot needs to pick up a USB Type-A charging cable, accurately align it with the 14 mm * 6 mm receptacle on the charger, and precisely insert it.

Real world scene
Real-world scenes for the two manipulation tasks.
HyperMVP: pick and place bear
HyperMVP: pick and place bear under perturbations
MO Texture
The teddy bear's clothes
were changed.
Light Color
The color of the light
turned red.
Distractor Object
Task-irrelevant objects
were added to the environment.
Combine all the perturbations
on the left
HyperMVP: plug in the charging cable

BibTeX

Coming soon...