Hyperbolic Multiview Pretraining for Robotic Manipulation

Abstract

3D-aware visual pretraining has proven effective in improving the performance of downstream robotic manipulation tasks. However, existing methods are constrained to Euclidean embedding spaces, whose flat geometry limits the modeling of structural relations among embeddings. As a result, they struggle to learn structured embeddings that are essential for robust spatial perception in robotic applications. To this end, we propose HyperMVP, a self-supervised framework for Hyperbolic MultiView Pretraining. The hyperbolic space offers geometric properties that are well suited for capturing structural relations. Methodologically, we extend the masked autoencoder paradigm and design a GeoLink encoder to learn multiview hyperbolic representations. The pretrained encoder is then finetuned with visuomotor policies on manipulation tasks. In addition, we introduce 3D-MOV, a large-scale dataset comprising multiple types of 3D point clouds to support pretraining. We evaluate HyperMVP on COLOSSEUM, RLBench, and real-world scenarios, where it consistently outperforms strong baselines across diverse tasks and perturbation settings. Our results highlight the potential of 3D-aware pretraining in a non-Euclidean space for learning robust and generalizable robotic manipulation policies.

Introduction

AmpAttention vs standard attention — **Overview of HyperMVP.**(a)Illustration of the HyperMVP framework, including the 3D-MOV pretraining dataset, embedding spaces, and downstream applications. (b)Comparison of generalization performance(%) on the Colosseum under various perturbation settings.

We propose HyperMVP, it is the first framework to explore 3D multiview pretraining in hyperbolic space for robotic manipulation.
We introduce 3D-MOV, a large-scale dataset comprising four types of 3D point clouds, with each instance paired with five orthographic images. It provides a foundation for analyzing how different types of 3D data affect manipulation performance.
We present comprehensive evaluation results and analytical insights regarding the model’s performance across both simulated and real-world scenarios.

Pipeline

Methodology of HyperMVP. HyperMVP follows a pretraining–finetuning paradigm. During pretraining, each point cloud is rendered into five orthographic images. These images are masked and fed into the GeoLink encoder to learn multiview representations in Euclidean and hyperbolic spaces. To support pretraining, we introduce intra-view and inter-view reconstruction pretext tasks and build a large-scale 3D-MOV dataset. During finetuning, the pretrained GeoLink encoder is finetuned with the Robotic View Transformer (RVT) to learn visuomotor policies.

Reconstruction Results

Simulation Experiments

Multi-task performance on RLBench. We report success rates (%) for 18 RLBench tasks, along with the mean success rate (%).

Execution examples — selected tasks

basketball in hoop

close laptop lid

get ice from fridge

move hanger

scoop with spatula

straighten rope

turn oven on

basketball in hoop under different perturbations

No perturbation

All perturbations

MO Color

RO Color

MO Texture

MO Size

RO Size

Light Color

Table Color

Table Texture

Distractor Object

Background Texture

Camera Pose

Real-world Experiments

In real-world, we study two manipulation tasks, a common 'pick and place bear' task and a high-precision 'plug in the charging cable' task.Real-world videos play at 2x speed by default.

Pick and place bear: The teddy bear is placed facing any direction, and the robot needs to pick it up and place it in a cardboard box to complete the task. In this task, we introduced various perturbations, including Light Change, MO Texture, Distractor Objects, and the combined effect of all these disturbances.

Plug in the charging cable: The robot needs to pick up a USB Type-A charging cable, accurately align it with the 14 mm * 6 mm receptacle on the charger, and precisely insert it.

Real world scene — Real-world scenes for the two manipulation tasks.

✅ HyperMVP: pick and place bear

✅ HyperMVP: pick and place bear under perturbations

MO Texture
The teddy bear's clothes
were changed.

Light Color
The color of the light
turned red.

Distractor Object
Task-irrelevant objects
were added to the environment.

Combine all the perturbations
on the left

✅ HyperMVP: plug in the charging cable

BibTeX

Coming soon...