MULTIMODAL INFORMATION PROCESSING AND BEHAVIOUR GENERATION OF HUMANOID ROBOTS BASED ON PALM 2 MODEL AND MULTIMODAL TRANSFORMER ARCHITECTURE

Jie Fang,∗ Wenliang Ju,∗∗ Feng Xiao,∗ Rubing Huang,∗ and Yanjun Liu∗

Keywords

Multimodal transformer, pathways language model 2 (PaLM 2), humanoid robots, behaviour generation, deep Q-network (DQN) algorithm

Abstract

In response to the low efficiency of behaviour decision-making and generation for humanoid robots, single data modality, and difficulty in effectively integrating information from different modalities, this article combines Pathways Language Model 2 (PaLM 2) and multimodal Transformer architecture to study the multimodal information processing and behaviour generation of humanoid robots. The experiment uses preprocessed visual, tactile, and auditory data. A multimodal Transformer architecture fuses the multimodal data, and a soft attention mechanism is used for preliminary feature fusion. Through the multi-head attention mechanism, the fused features are further processed and high- dimensional feature vectors are output. Then, the PaLM 2 model is used to understand natural speech instructions and the context is modelled to generate accurate task descriptions. Finally, the deep Q-network (DQN) algorithm generates behaviour for humanoid robots, and a straight -greedy strategy is adopted to select actions, improving the efficiency of multimodal information processing and behaviour generation. The experiment is based on the public normal robots behaviour dataset and data from humanoid robot companies. This article combines the multimodal transformer- pathways language model 2-deep Q-network (Transformer-PaLM 2-DQN) to achieve a generation efficiency of 92.67% for humanoid robots, which is 7.38% higher than deep deterministic policy gradient and 10.22% higher than single-mode visual information. The results show that using the PaLM 2 model and multimodal Transformer architecture with DQN to generate humanoid robot behaviour greatly improves generation efficiency and accuracy, integrates information from different modalities, and adapts to the environment, promoting the widespread application of humanoid robots in today’s society. ∗ School of Computer and Information Technology, Anhui Vocational and Technical College, Anhui, China; e- mail: [email protected]; [email protected]; huang robin@ oxmail.com;[email protected] ∗∗ Institute of Statiscs and Applied Mathematics, Anhui University of Finance & Economics, Anhui, China; e-mail: [email protected] Corresponding author: Wenliang Ju

Important Links:



Go Back