New Advances in Human-Machine Interaction: The ViTaM System for Visual-Tactile Multimodal Fusion Captures the Complete State of Deformable Object Manipulation

Our team developed ViTaM, a system featuring a flexible, strain-insensitive MEMS tactile glove with 1152 force-sensing channels and a deep learning framework for visual-tactile data. It captures dynamic hand-object interactions, supporting applications in virtual reality and robotics.
New Advances in Human-Machine Interaction: The ViTaM System for Visual-Tactile Multimodal Fusion Captures the Complete State of Deformable Object Manipulation
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

Tactile perception is one of the key channels for acquiring environmental information, holding significant research value in areas such as human-machine interaction, virtual reality, telemedicine, and humanoid robotics. Particularly for humanoid robots, the ability to perceive objects being manipulated with dexterity, precision, and safety is crucial for bringing them closer to practical application in real-world scenarios. In daily life, objects can generally be categorized into rigid and deformable types. Humans effortlessly perceive and manipulate both categories. However, due to the challenges posed by deformable objects like clay and sponges—which have complex deformability, infinite degrees of freedom, and nonlinear models—robot manipulation of deformable objects remains a major technical challenge in the field. Only by effectively addressing these limitations can robots achieve comprehensive and versatile manipulation capabilities, paving the way for better human-machine integration and enabling humanoid robots to serve households in the near future.

To enhance the operational capabilities of humanoid robots, it is first necessary to comprehensively and thoroughly capture human dynamic operations through a human-machine interface (HMI) to learn from human experience. With advances in flexible electronics and artificial intelligence, HMI recognition has progressed from basic semantic results, such as recognizing gestures for specific letters, to identifying object types and locations. However, this remains at a superficial cognitive level compared to human understanding of objects in operation. One factor hindering the progress of human-machine interaction towards a more intuitive and natural experience is the current challenge of precisely capturing the force applied by humans on objects, especially for force sensing with deformable objects.

To address this challenge, our team has collaboratively developed a novel system: ViTaM (Visual-Tactile recording and tracking system for Manipulation). This system is equipped with a flexible, strain-insensitive MEMS tactile glove containing 1152 force-sensing channels, alongside a joint deep learning framework based on visual-tactile sequences for estimating dynamic changes in the hand-object state during manipulation. It not only records the interaction between the human hand and objects, but also leverages deep learning to accurately analyze these interactions, laying the groundwork for future applications in virtual reality, robotic manipulation, and large-scale AI models. Next, let’s delve into how this system operates and explore some of the innovative technologies behind it.

Fig. 1. The overview of our proposed ViTaM system

  • Two modes of human-machine interaction: non-forceful interaction and forceful interaction

In human-machine interaction, non-forceful interaction refers to methods where humans interact without applying any external force, such as gesture control, facial recognition, or voice commands. These interactions are already well-developed and can be implemented using technologies like inertial measurement units (IMUs), electromyography (EMG) sensors, or video.

In contrast to non-forceful interactions, forceful interactions involve humans manipulating objects by applying force, such as gripping, pinching, and pressing deformable objects. This type of interaction is essential for simulating real object manipulation, as it requires accurate sensing of the object's shape, state, and even deformation. Currently, most research remains focused on predicting an object's position and applied force through vision or simple pressure sensors, lacking in-depth exploration of interface forces with deformable objects.

  • The challenge of capturing interface forces on deformable objects

So, why is it so difficult to capture the interface forces on deformable objects? This is because, when the hand contacts a deformable object, not only is force applied, but the interface itself also experiences strain due to its stretchability. This strain is sometimes minimal and easy to overlook, yet it can interfere with the force-detection signals. Therefore, tactile sensors designed to sense the interface of deformable objects need to be capable of detecting both the pressure and deformation state in the contact area and adapting to the effects caused by the object’s surface deformation.

Traditional methods for strain interference resistance (or strain-insensitive) in tactile sensors use strain isolation or stress transfer strategies. However, these approaches cannot quantitatively assess the effectiveness of strain suppression or adapt to mechanical changes in actual measurements. Inspired by the closed-loop adaptive nature of human tactile perception, this research introduces a stretchable flexible tactile glove with up to 1152 tactile acquisition channels. This tactile glove is based on a composite film with negative/positive stretching-resistive effects, enabling it to measure normal forces while simultaneously monitoring strain levels. Through a closed-loop adaptive feedback system, the glove dynamically adjusts the sensing curve, ensuring higher accuracy in force perception on the interface when manipulating deformable objects, which will greatly enhance the accuracy of deformable object reconstruction.

Fig. 2. Design, fabrication, and testing of the tactile glove with the capability of strain interference suppression

  • Multimodal Integration of Vision and Tactile Sensing

While force perception is essential, tactile data alone cannot fully characterize the interaction between the hand and the object. For example, when you pinch modeling clay, your fingers only contact part of it; in these areas where visual access is blocked, tactile sensors can detect the clay's softness and deformation. The other part, however, is sensed through vision. Therefore, visual perception is equally indispensable for capturing the overall shape and dynamic changes of the object.

Fig. 3. The pipeline of the visual-tactile joint learning framework

The ViTaM system captures 3D point cloud data through a depth camera, integrating it with tactile information from the fingers to reconstruct the overall geometry of an object. Our deep learning framework encodes visual and tactile information separately and then combines these inputs to reconstruct the object’s fine-grained surface deformations and overall morphology. This integrated perception not only enhances the accuracy of sensing but also simulates the dynamic changes of the object, enabling machines to "perceive" the manipulated object in a human-like manner.

Fig. 4. Hand-object reconstructions based on the ViTaM system

  • Outlook

We believe that the ViTaM system, as a practical tool, achieves the fusion of cross-modal data features, addressing the challenge of deformable object tracking and 3D geometric reconstruction. This enhances the comprehensiveness of recognition results in human-machine interaction. It not only has a clear impact on developing smarter, more interactive virtual and augmented reality systems but also represents a key step toward advancing humanoid robots and intelligent agents in understanding and operational capabilities to a human-like level.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Human-Machine Interfaces
Technology and Engineering > Electrical and Electronic Engineering > Signal, Speech and Image Processing > Human-Machine Interfaces

Related Collections

With collections, you can get published faster and increase your visibility.

Cancer epigenetics

With this cross-journal Collection, the editors at Nature Communications, Communications Biology, Communications Medicine, and Scientific Reports invite submissions covering the breadth of research carried out in the field of cancer epigenetics. We will highlight studies aiming at the improvement of our understanding of the epigenetic mechanisms underlying cancer initiation, progression, response to therapy, metastasis and tumour plasticity as well as findings that have the potential to be translated into the clinic.

Publishing Model: Open Access

Deadline: Jan 31, 2025

Clinical trials

In this call for papers, Nature Communications invites submissions of papers reporting interventional trials, including those yielding negative results for which primary end-points were not met.

Publishing Model: Open Access

Deadline: Dec 31, 2024