Vision-Free Object 6D Pose Estimation
for In-Hand Manipulation
via Multi-Modal Haptic Attention

CoRL 2025 Workshop
Dexterous Manipulation: Learning and Control with Diverse Modalities
Spotlight

1KIST, 2Korea University
f1

TL;DR:

We propose a vision-free approach that integrates multiple haptic sensing modalities
among kinesthetic, contact, and proprioceptive signals, as well as their temporal dynamics.

Abstract

Humans are capable of in-hand manipulation without visual feedback, by inferring its pose through haptic feedback. However, in-hand manipulation with multi-fingered robotic hands remains highly challenging due to severe self-occlusion and limited visual accessibility.

To address this problem, we propose a vision-free approach that integrates multiple haptic sensing modalities. Specifically, we develop a haptic attention-based pose estimator that captures correlations among kinesthetic, contact, and proprioceptive signals, as well as their temporal dynamics.

Experimental results demonstrate that haptic feedback alone enables reliable pose estimation and that contact-rich sensing substantially improves reorientation performance. Our pose estimator achieves average errors of only 4.94 mm in position and 11.6 degrees in orientation during 300 iterations (10 seconds), underscoring the effectiveness of haptic-driven pose estimation for dexterous manipulation.


Vision-Free Object Pose Estimator

We propose an LSTM attention-based estimator
that integrates kinesthetic, cutaneous, and proprioceptive signals,
enabling vision-free object pose estimation.


f2

Our BiLSTM-based haptic pose estimator.
(A) Data collection: joint angles, forces/torques at finger bases, and binary fingertip contacts.
(B) Multimodal haptic histories and the previous estimate are used to predict the current object pose.


f2

Experimental Setup


Simulation: Isaac Sim with 8196 parallel environments
Training: PPO, 100K steps, and three random seeds
Evaluation: 500 rollouts per instance
H/W: Single RTX 4090 GPU
Observation Modality:
1. Groun Truth (GT)
2. Random Noise as object pose
3. Pose Estimator (PE) with haptic feedback

f2

Experiments



Performance on In-Hand Reorientation (Simulation)

We compare two conditions: (1) a ground-truth oracle baseline using privileged object pose, and (2) our proposed haptic-only estimator. Each condition is tested with three random seeds. As summarized in Table~\ref{tab:evaluation}, the oracle baseline achieves an average stability of 88.7 seconds and 77.3 successful reorientations, while our estimator maintains stability for 27.1 seconds with 3.3 successful reorientations. These results demonstrate the feasibility of vision-free haptic-based pose estimation, while also highlighting the performance gap relative to ground-truth sensing. Importantly, they underline the critical role of accurate real-time pose estimation: stable grasping can be achieved without precise feedback, but successful reorientation requires fine-grained and continuous pose tracking.


dataset

Performance on In-Hand Rotation (Real-World)

we evaluate the feasibility of our method on a physical platform.
Using a canned tomato object with an AprilTag attached to its bottom surface,
we collect 20K haptic samples and ground-truth poses while executing an in-hand rotation policy.
Haptic feedback is recorded directly from the robotic hand platform.


We then train the model solely on this real-world dataset, which is nearly 200 times smaller than the simulated dataset.
Over 10 seconds (100 steps) of continuous inference,
the estimator achieves an average error of 38.2 mm in position and 3.67 degrees in orientation.
the results emphasize the need for larger and more diverse real-world datasets to close the sim-to-real gap.


Results

f2

These results demonstrate the feasibility of vision-free haptic-based pose estimation,
while also highlighting the performance gap relative to ground-truth sensing.
Importantly, they underline the critical role of accurate real-time pose estimation:
stable grasping can be achieved without precise feedback,
but successful reorientation requires fine-grained and continuous pose tracking.



result

From top to bottom, the plots show the ground-truth (GT) and predicted values for the x, y, and z positions, respectively.
The bottom plot illustrates the orientation error.
Predictions are generated in an autoregressive manner,
where each estimated pose is recursively fed back into the model as input for the next prediction.

BibTeX

@inproceedings{
      vision2025ahn,
      title={Vision-Free Pose Estimation for In-Hand Manipulation via Multi-Modal Haptic Attention},
      author={Chanyoung Ahn and Sungwoo Park and Donghyun Hwang},
      booktitle={CoRL Workshop},
      year={2025},
      }

Affiliations