To read this content please select one of the options below:

A novel multi-stream hand-object interaction network for assembly action recognition

Li Shaochen (School of Mechanical Engineering, Zhejiang University, Hangzhou, China)
Zhenyu Liu (School of Mechanical Engineering, Zhejiang University, Hangzhou, China)
Yu Huang (School of Mechanical Engineering, Zhejiang University, Hangzhou, China)
Daxin Liu (School of Mechanical Engineering, Zhejiang University, Hangzhou, China)
Guifang Duan (School of Mechanical Engineering, Zhejiang University, Hangzhou, China)
Jianrong Tan (School of Mechanical Engineering, Zhejiang University, Hangzhou, China)

Robotic Intelligence and Automation

ISSN: 2754-6969

Article publication date: 2 September 2024

Issue publication date: 18 November 2024

32

Abstract

Purpose

Assembly action recognition plays an important role in assembly process monitoring and human-robot collaborative assembly. Previous works overlook the interaction relationship between hands and operated objects and lack the modeling of subtle hand motions, which leads to a decline in accuracy for fine-grained action recognition. This paper aims to model the hand-object interactions and hand movements to realize high-accuracy assembly action recognition.

Design/methodology/approach

In this paper, a novel multi-stream hand-object interaction network (MHOINet) is proposed for assembly action recognition. To learn the hand-object interaction relationship in assembly sequence, an interaction modeling network (IMN) comprising both geometric and visual modeling is exploited in the interaction stream. The former captures the spatial location relation of hand and interacted parts/tools according to their detected bounding boxes, and the latter focuses on mining the visual context of hand and object at pixel level through a position attention model. To model the hand movements, a temporal enhancement module (TEM) with multiple convolution kernels is developed in the hand stream, which captures the temporal dependences of hand sequences in short and long ranges. Finally, assembly action prediction is accomplished by merging the outputs of different streams through a weighted score-level fusion. A robotic arm component assembly dataset is created to evaluate the effectiveness of the proposed method.

Findings

The method can achieve the recognition accuracy of 97.31% and 95.32% for coarse and fine assembly actions, which outperforms other comparative methods. Experiments on human-robot collaboration prove that our method can be applied to industrial production.

Originality/value

The author proposes a novel framework for assembly action recognition, which simultaneously leverages the features of hands, objects and hand-object interactions. The TEM enhances the representation of dynamics of hands and facilitates the recognition of assembly actions with various time spans. The IMN learns the semantic information from hand-object interactions, which is significant for distinguishing fine assembly actions.

Keywords

Acknowledgements

Funding: This work was supported in part by the Key Research and Development Program of Zhejiang Province under Grant 2022C01064, in part by the National Natural Science Foundation of China under Grant U22A600, Grant 52075480 and Grant 51935009, and in part by the Highlevel Talent Special Support Plan of Zhejiang Province under Grant 2020R52004.

Conflict of interests: The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Citation

Shaochen, L., Liu, Z., Huang, Y., Liu, D., Duan, G. and Tan, J. (2024), "A novel multi-stream hand-object interaction network for assembly action recognition", Robotic Intelligence and Automation, Vol. 44 No. 6, pp. 854-870. https://doi.org/10.1108/RIA-01-2024-0020

Publisher

:

Emerald Publishing Limited

Copyright © 2024, Emerald Publishing Limited

Related articles