Dexterous in-hand manipulation requires continuous 6D pose tracking, yet the manipulating fingers inevitably occlude the object from the camera. We study how to structure the sparse haptic signals already available on multi-fingered hands — proprioception, proximal force/torque, and binary contact — to complement a pretrained visual pose tracker under occlusion.
We propose a kinematic-aware finger-level encoder and systematically compare it against four alternative designs through three levels of evaluation: per-frame refinement, sequential open-loop tracking, and closed-loop manipulation.
Per-frame evaluation cannot distinguish encoder quality (1.12×), but sequential tracking amplifies differences to 2× in tracking and 15× in downstream manipulation. The structured encoder learns to use vision exclusively for translation and dedicates one attention head to haptic tokens for rotation — without explicit supervision. The benefit persists even when haptic input is zeroed at inference (2.3× better than vision-only), revealing that structured haptic encoding serves as a training-time inductive bias.
Pose at frame t initializes frame t+1, so errors accumulate over time.
| Model | Pos. Error (cm) ↓ | Ang. Error (°) ↓ | Occlusion Robust |
|---|---|---|---|
| V-only (FP) | 8.11 | 54.9 – 79.4 | ✗ Fluctuates |
| Naive V+H | 8.11 | 47.3 – 53.7 | △ Partial |
| FingerMLP | 5.54 | 47.0 – 67.6 | △ Partial |
| 16-token | 5.81 | 69.2 – 69.3 | ✗ Collapsed |
| KineFuse (Ours) | 4.04 | 47.6 ± 0.1 | ✓ Constant |
KineFuse achieves 2× lower position error than vision-only and is the only model that maintains constant angular error (47.6°) regardless of occlusion level (0%–90%).
Task: in-hand tool reorientation to align tip with target (success/episode, RL policy).
| Model | 0% Occ. | 30% Occ. | 50% Occ. | 70% Occ. | 90% Occ. |
|---|---|---|---|---|---|
| V-only | 0.46 | 0.83 | 1.58 | 1.97 | 1.52 |
| KineFuse (Ours) | 4.61 | 4.72 | 4.47 | 5.10 | 3.81 |
| GT (upper bound) | 21.25 | — | — | — | — |
KineFuse achieves 3–15× higher manipulation success than vision-only across all occlusion levels.
A key finding: zeroing all haptic channels at inference produces tracking nearly identical to the full model, yet 2.3× better than V-only. This reveals that haptic signals serve as a structural regularizer during training — their kinematic structure reshapes how the fusion transformer attends to visual tokens. Once trained, the model can be deployed even when haptic sensors are temporarily unavailable.
@inproceedings{
anonymous2026kinefuse,
title={KineFuse: Kinematic-Aware Haptic Fusion for In-Hand Occluded-Object Pose Tracking},
author={Anonymous Authors},
year={2026},
note={Under Review}
}