KineFuse: Kinematic-Aware Haptic Fusion for In-Hand Occluded-Object Pose Tracking

Abstract

Dexterous in-hand manipulation requires continuous 6D pose tracking, yet the manipulating fingers inevitably occlude the object from the camera. We study how to structure the sparse haptic signals already available on multi-fingered hands — proprioception, proximal force/torque, and binary contact — to complement a pretrained visual pose tracker under occlusion.

We propose a kinematic-aware finger-level encoder and systematically compare it against four alternative designs through three levels of evaluation: per-frame refinement, sequential open-loop tracking, and closed-loop manipulation.

Per-frame evaluation cannot distinguish encoder quality (1.12×), but sequential tracking amplifies differences to 2× in tracking and 15× in downstream manipulation. The structured encoder learns to use vision exclusively for translation and dedicates one attention head to haptic tokens for rotation — without explicit supervision. The benefit persists even when haptic input is zeroed at inference (2.3× better than vision-only), revealing that structured haptic encoding serves as a training-time inductive bias.

Pipeline

Joint Tokenization: Each joint's τ-step history → shared MLP → joint tokens with finger-identity and position embeddings

Intra-Finger Attention: Block-diagonal masked self-attention propagates sparse F/T cues within each finger

Finger-Level Pooling: Learned cross-attention compresses 16 joint tokens → 4 finger tokens

Inter-Finger Graph Attention: Graphormer layers with URDF-derived spatial bias capture inter-finger relationships

Visuo-Haptic Fusion: Visual (400 tokens) + haptic (4 tokens) concatenated → separate translation and rotation heads

Encoder	Tokens	Kinematic Bias	Key Characteristic
(a) Naive	1	None	Flat concatenation → single token
(b) FingerMLP	4	None	Per-finger MLP + cross-finger attention
(c) 16-token	16	Partial	Joint-level attention, no pooling
(d) KineFuse (Ours)	4	URDF-aware	Intra-finger → pooling → inter-finger graph attention

Encoder

Tokens

Kinematic Bias

Key Characteristic

(a) Naive

None

Flat concatenation → single token

(b) FingerMLP

None

Per-finger MLP + cross-finger attention

Partial

Joint-level attention, no pooling

(d) KineFuse (Ours)

URDF-aware

Intra-finger → pooling → inter-finger graph attention

Sequential Open-Loop Tracking

Pose at frame t initializes frame t+1, so errors accumulate over time.

Model	Pos. Error (cm) ↓	Ang. Error (°) ↓	Occlusion Robust
V-only (FP)	8.11	54.9 – 79.4	✗ Fluctuates
Naive V+H	8.11	47.3 – 53.7	△ Partial
FingerMLP	5.54	47.0 – 67.6	△ Partial
16-token	5.81	69.2 – 69.3	✗ Collapsed
KineFuse (Ours)	4.04	47.6 ± 0.1	✓ Constant

Model

Pos. Error (cm) ↓

Ang. Error (°) ↓

Occlusion Robust

V-only (FP)

8.11

54.9 – 79.4

✗ Fluctuates

Naive V+H

8.11

47.3 – 53.7

△ Partial

FingerMLP

5.54

47.0 – 67.6

△ Partial

16-token

5.81

69.2 – 69.3

✗ Collapsed

KineFuse (Ours)

4.04

47.6 ± 0.1

✓ Constant

KineFuse achieves 2× lower position error than vision-only and is the only model that maintains constant angular error (47.6°) regardless of occlusion level (0%–90%).

Downstream Manipulation Success

Task: in-hand tool reorientation to align tip with target (success/episode, RL policy).

Model	0% Occ.	30% Occ.	50% Occ.	70% Occ.	90% Occ.
V-only	0.46	0.83	1.58	1.97	1.52
KineFuse (Ours)	4.61	4.72	4.47	5.10	3.81
GT (upper bound)	21.25	—	—	—	—

Model

0% Occ.

30% Occ.

50% Occ.

70% Occ.

90% Occ.

V-only

0.46

0.83

1.58

1.97

1.52

KineFuse (Ours)

4.61

4.72

4.47

5.10

3.81

GT (upper bound)

21.25

—

KineFuse achieves 3–15× higher manipulation success than vision-only across all occlusion levels.

Step	Trans. Error (mm) ↓	Rot. Error (°) ↓
0	11.9	1.4	6.7	2.5
67	78.2	17.4	57.7	30.8
135	108.2	35.9	130.3	63.5
202	95.3	58.1	177.4	86.6

Step

Trans. Error (mm) ↓

Rot. Error (°) ↓

V-only

KineFuse

V-only

KineFuse

11.9

1.4

6.7

2.5

78.2

17.4

57.7

30.8

135

108.2

35.9

130.3

63.5

202

95.3

58.1

177.4

86.6

Structural vs. Informational Contribution

A key finding: zeroing all haptic channels at inference produces tracking nearly identical to the full model, yet 2.3× better than V-only. This reveals that haptic signals serve as a structural regularizer during training — their kinematic structure reshapes how the fusion transformer attends to visual tokens. Once trained, the model can be deployed even when haptic sensors are temporarily unavailable.

@inproceedings{ anonymous2026kinefuse, title={KineFuse: Kinematic-Aware Haptic Fusion for In-Hand Occluded-Object Pose Tracking}, author={Anonymous Authors}, year={2026}, note={Under Review} }

KineFuse: Kinematic-Aware Haptic Fusion
for In-Hand Occluded-Object Pose Tracking

TL;DR:

We propose a kinematic-aware finger-level haptic encoder that complements
a pretrained visual pose tracker under finger-induced occlusion.
KineFuse achieves 2× lower position error and 3–15× higher manipulation success than vision-only.

Abstract

Method Overview

KineFuse augments a frozen FoundationPose visual backbone with a URDF-aware finger-level haptic encoder.

Pipeline

Encoder Comparison

Results

Sequential Open-Loop Tracking

Downstream Manipulation Success

Tracking Drift Comparison

At 30% occlusion over 149 tracking steps:
· KineFuse (V+H): Stabilizes after step 20, maintains accurate pose overlay through step 35+
· V-only (FP): Diverges continuously, loses tracking by step 20

Policy Result

Real-World Demonstration

Structural vs. Informational Contribution

Hardware Platform

Hand: 4-fingered dexterous hand, 16 DoF (4 joints × 4 fingers)
Sensors: Joint encoders (pos/vel), 3-axis proximal F/T sensors at fingertips, binary contact indicators
Camera: Intel RealSense D435i (wrist-mounted)
Simulation: IsaacLab (Isaac Sim)

BibTeX

KineFuse: Kinematic-Aware Haptic Fusion for In-Hand Occluded-Object Pose Tracking

TL;DR:

We propose a kinematic-aware finger-level haptic encoder that complements a pretrained visual pose tracker under finger-induced occlusion. KineFuse achieves 2× lower position error and 3–15× higher manipulation success than vision-only.

Abstract

Method Overview

KineFuse augments a frozen FoundationPose visual backbone with a URDF-aware finger-level haptic encoder.

Pipeline

Encoder Comparison

Results

Sequential Open-Loop Tracking

Downstream Manipulation Success

Tracking Drift Comparison

At 30% occlusion over 149 tracking steps: · KineFuse (V+H): Stabilizes after step 20, maintains accurate pose overlay through step 35+ · V-only (FP): Diverges continuously, loses tracking by step 20

Policy Result

Real-World Demonstration

Structural vs. Informational Contribution

Hardware Platform

Hand: 4-fingered dexterous hand, 16 DoF (4 joints × 4 fingers) Sensors: Joint encoders (pos/vel), 3-axis proximal F/T sensors at fingertips, binary contact indicators Camera: Intel RealSense D435i (wrist-mounted) Simulation: IsaacLab (Isaac Sim)

BibTeX

KineFuse: Kinematic-Aware Haptic Fusion
for In-Hand Occluded-Object Pose Tracking

We propose a kinematic-aware finger-level haptic encoder that complements
a pretrained visual pose tracker under finger-induced occlusion.
KineFuse achieves 2× lower position error and 3–15× higher manipulation success than vision-only.

At 30% occlusion over 149 tracking steps:
· KineFuse (V+H): Stabilizes after step 20, maintains accurate pose overlay through step 35+
· V-only (FP): Diverges continuously, loses tracking by step 20

Hand: 4-fingered dexterous hand, 16 DoF (4 joints × 4 fingers)
Sensors: Joint encoders (pos/vel), 3-axis proximal F/T sensors at fingertips, binary contact indicators
Camera: Intel RealSense D435i (wrist-mounted)
Simulation: IsaacLab (Isaac Sim)