JIITA, vol.8 no.1, p666-670, 2024, DOI: 10.22664/ISITA.2021.7.1.666
Hyunduk Kim1), Sang-Heon Lee1) and Myoung-Kyu Sohn1
1)Division of Automotive Technology, DGIST, Daegu, Republic of Korea
E-mail: hyunduk00@dgist.ac.kr, pobbylee@dgist.ac.kr, smk@dgist.ac.kr
Abstract: We present 3D facial landmarks detection and head pose estimation algorithm. To solve these two tasks simultaneously, we apply the multi-task learning technique. Our architecture consists of three components: multi-head to deal with different tasks, backbone to represent common features, and linear layer to output results. Recently, Vision Transformer (ViT) achieved excellent image recognition results. After that, many variations have been proposed to enhance the original ViT algorithm. Especially, MobileViT combines CNN and ViT to make a light-weight, general-purpose, and mobile-friendly vision transformer. To the real-time process, we apply MobileViT as a backbone network. Moreover, we employ PCGrad algorithm for stable convergence during training. To evaluate the performance of the proposed algorithm, we trained and tested on 300W-LP and AFLW200-3D datasets, respectively. In the experiments, we demonstrate the experimental results for comparing the accuracy and efficiency between MobileNetV3 and MobileViT.
Keywords : 3D facial landmarks detection, head pose estimation, multi-task learning, vision transformer
Fullpaper: