KineFuse: Kinematic-Aware Haptic Fusion
for In-Hand Occluded-Object Pose Tracking

KineFuse title figure

TL;DR:

We propose a kinematic-aware finger-level haptic encoder that complements
a pretrained visual pose tracker under finger-induced occlusion.
KineFuse achieves 2× lower position error and 3–15× higher manipulation success than vision-only.

Abstract

Dexterous in-hand manipulation requires continuous 6D pose tracking, yet the manipulating fingers inevitably occlude the object from the camera. We study how to structure the sparse haptic signals already available on multi-fingered hands — proprioception, proximal force/torque, and binary contact — to complement a pretrained visual pose tracker under occlusion.

We propose a kinematic-aware finger-level encoder and systematically compare it against four alternative designs through three levels of evaluation: per-frame refinement, sequential open-loop tracking, and closed-loop manipulation.

Per-frame evaluation cannot distinguish encoder quality (1.12×), but sequential tracking amplifies differences to 2× in tracking and 15× in downstream manipulation. The structured encoder learns to use vision exclusively for translation and dedicates one attention head to haptic tokens for rotation — without explicit supervision. The benefit persists even when haptic input is zeroed at inference (2.3× better than vision-only), revealing that structured haptic encoding serves as a training-time inductive bias.


Method Overview

KineFuse augments a frozen FoundationPose visual backbone with a URDF-aware finger-level haptic encoder.


KineFuse framework

Pipeline

  1. Joint Tokenization: Each joint's τ-step history → shared MLP → joint tokens with finger-identity and position embeddings

  2. Intra-Finger Attention: Block-diagonal masked self-attention propagates sparse F/T cues within each finger

  3. Finger-Level Pooling: Learned cross-attention compresses 16 joint tokens → 4 finger tokens

  4. Inter-Finger Graph Attention: Graphormer layers with URDF-derived spatial bias capture inter-finger relationships

  5. Visuo-Haptic Fusion: Visual (400 tokens) + haptic (4 tokens) concatenated → separate translation and rotation heads

Encoder Comparison

We systematically compare five encoder designs with progressive kinematic structure.


Encoder ablation

Encoder Tokens Kinematic Bias Key Characteristic
(a) Naive 1 None Flat concatenation → single token
(b) FingerMLP 4 None Per-finger MLP + cross-finger attention
(c) 16-token 16 Partial Joint-level attention, no pooling
(d) KineFuse (Ours) 4 URDF-aware Intra-finger → pooling → inter-finger graph attention

Results

Sequential Open-Loop Tracking

Pose at frame t initializes frame t+1, so errors accumulate over time.

Model Pos. Error (cm) ↓ Ang. Error (°) ↓ Occlusion Robust
V-only (FP) 8.11 54.9 – 79.4 ✗ Fluctuates
Naive V+H 8.11 47.3 – 53.7 △ Partial
FingerMLP 5.54 47.0 – 67.6 △ Partial
16-token 5.81 69.2 – 69.3 ✗ Collapsed
KineFuse (Ours) 4.04 47.6 ± 0.1 ✓ Constant

KineFuse achieves 2× lower position error than vision-only and is the only model that maintains constant angular error (47.6°) regardless of occlusion level (0%–90%).


Downstream Manipulation Success

Task: in-hand tool reorientation to align tip with target (success/episode, RL policy).

Model 0% Occ. 30% Occ. 50% Occ. 70% Occ. 90% Occ.
V-only 0.460.831.581.971.52
KineFuse (Ours) 4.614.724.475.103.81
GT (upper bound) 21.25

KineFuse achieves 3–15× higher manipulation success than vision-only across all occlusion levels.

Tracking Drift Comparison

At 30% occlusion over 149 tracking steps:
· KineFuse (V+H): Stabilizes after step 20, maintains accurate pose overlay through step 35+
· V-only (FP): Diverges continuously, loses tracking by step 20


Tracking drift comparison

Distribution shift analysis

Policy Result

Policy result

Real-World Demonstration

We validate on a physical setup with a 16-DoF dexterous hand and RealSense D435i camera.
AprilTag-based ground truth comparison (open-loop tracking).


Real-world demonstration

Step Trans. Error (mm) ↓ Rot. Error (°) ↓
V-only KineFuse V-only KineFuse
011.91.46.72.5
6778.217.457.730.8
135108.235.9130.363.5
20295.358.1177.486.6

Structural vs. Informational Contribution

A key finding: zeroing all haptic channels at inference produces tracking nearly identical to the full model, yet 2.3× better than V-only. This reveals that haptic signals serve as a structural regularizer during training — their kinematic structure reshapes how the fusion transformer attends to visual tokens. Once trained, the model can be deployed even when haptic sensors are temporarily unavailable.


Hardware Platform

Hand: 4-fingered dexterous hand, 16 DoF (4 joints × 4 fingers)
Sensors: Joint encoders (pos/vel), 3-axis proximal F/T sensors at fingertips, binary contact indicators
Camera: Intel RealSense D435i (wrist-mounted)
Simulation: IsaacLab (Isaac Sim)

Hardware platform

BibTeX

@inproceedings{
      anonymous2026kinefuse,
      title={KineFuse: Kinematic-Aware Haptic Fusion for In-Hand Occluded-Object Pose Tracking},
      author={Anonymous Authors},
      year={2026},
      note={Under Review}
      }