MULTIMODAL INFORMATION PROCESSING AND BEHAVIOUR GENERATION OF HUMANOID ROBOTS BASED ON PALM 2 MODEL AND MULTIMODAL TRANSFORMER ARCHITECTURE

Jie Fang; Wenliang Ju; Feng Xiao; Rubing Huang; Yanjun Liu

doi:10.2316/J.2025.206-1233

MULTIMODAL INFORMATION PROCESSING AND BEHAVIOUR GENERATION OF HUMANOID ROBOTS BASED ON PALM 2 MODEL AND MULTIMODAL TRANSFORMER ARCHITECTURE

Jie Fang,∗ Wenliang Ju,∗∗ Feng Xiao,∗ Rubing Huang,∗ and Yanjun Liu∗

Keywords

Multimodal transformer, pathways language model 2 (PaLM 2), humanoid robots, behaviour generation, deep Q-network (DQN) algorithm

Abstract

In response to the low eﬃciency of behaviour decision-making and generation for humanoid robots, single data modality, and diﬃculty in eﬀectively integrating information from diﬀerent modalities, this article combines Pathways Language Model 2 (PaLM 2) and multimodal Transformer architecture to study the multimodal information processing and behaviour generation of humanoid robots. The experiment uses preprocessed visual, tactile, and auditory data. A multimodal Transformer architecture fuses the multimodal data, and a soft attention mechanism is used for preliminary feature fusion. Through the multi-head attention mechanism, the fused features are further processed and high- dimensional feature vectors are output. Then, the PaLM 2 model is used to understand natural speech instructions and the context is modelled to generate accurate task descriptions. Finally, the deep Q-network (DQN) algorithm generates behaviour for humanoid robots, and a straight -greedy strategy is adopted to select actions, improving the eﬃciency of multimodal information processing and behaviour generation. The experiment is based on the public normal robots behaviour dataset and data from humanoid robot companies. This article combines the multimodal transformer- pathways language model 2-deep Q-network (Transformer-PaLM 2-DQN) to achieve a generation eﬃciency of 92.67% for humanoid robots, which is 7.38% higher than deep deterministic policy gradient and 10.22% higher than single-mode visual information. The results show that using the PaLM 2 model and multimodal Transformer architecture with DQN to generate humanoid robot behaviour greatly improves generation eﬃciency and accuracy, integrates information from diﬀerent modalities, and adapts to the environment, promoting the widespread application of humanoid robots in today’s society. ∗ School of Computer and Information Technology, Anhui Vocational and Technical College, Anhui, China; e- mail: [email protected]; [email protected]; huang robin@ oxmail.com;[email protected] ∗∗ Institute of Statiscs and Applied Mathematics, Anhui University of Finance & Economics, Anhui, China; e-mail: [email protected] Corresponding author: Wenliang Ju

Important Links:

References
DOI: 10.2316/J.2025.206-1233
From Journal (206) International Journal of Robotics and Automation - 2025

Go Back