Detecting local audio-visual synchrony in monologues utilizing vocal pitch and facial landmark trajectories

Date of this Version

1-1-2009

Document Type

Conference Proceeding

Abstract

We describe a novel approach for determining the audio-visual synchrony of a monologue video sequence utilizing vocal pitch and facial landmark trajectories as descriptors of the audio and visual modalities, respectively. The visual component is represented by the horizontal and vertical displacement of corresponding facial landmarks between subsequent frames. These facial landmarks are acquired using the statistical modeling technique, known as the Active Shape Model (ASM). The audio component is represented by the fundamental frequency, or pitch, obtained using the subharmonic-to-harmonic ratio (SHR). The synchrony between the audio and visual feature vectors is computed using Gaussian mutual information. The raw synchrony estimates obtained using this method may contain spurious synchrony values due to over-sensitivity. A filtering method is employed for discarding synchrony values that occur during non-associated audio/visual events. The human visual system is capable of distinguishing rigid and non-rigid motion of an articulator during speech. In an attempt to emulate this process, we separate rigid and non-rigid motion and compute the synchrony attributed to each. Experiments are conducted on a dataset of monologue video clip pairs. Each pair is composed of an asynchronous and synchronous version of the video clip. For the asynchronous video clips, the audio signal is displaced with respect to the visual signal. Experimental results indicate that the proposed approach is successful in detecting facial regions that demonstrate synchrony, and in distinguishing between synchronous and asynchronous sequences. © 2009. The copyright of this document resides with its authors.

DOI

10.5244/C.23.10

This document is currently not available here.

Share

COinS