Author(s)
Arun M. Raghavan BS
Gavriel D. Kohlberg MD
Noga Lipschitz MD
Joseph T. Breen MD
Ravi N. Samy MD FACS
Affiliation(s)
University of Cincinnati College of Medicine
Abstract:
Educational Objective: At the conclusion of this presentation, participants should be aware of the potential benefits and structure of a visual speech recognition program for augmenting human speech perception. Objectives: Evaluate the accuracy and speed achieved by a visual speech recognition program (VSRP) based upon a long short term memory (LSTM) neural network. Study Design: Prospective study. Methods: A dual video/infrared camera was used to continuously track 35 points around the lips during speech in real time. A real time geometric transformation was implemented to normalize all tracked points to a common three dimensional axis. A VSRP consisting of three separate LSTM neural networks with Softmax classification layers was developed to identify 42 sentences from the Bamford-Kowal-Bench Speech in Noise (BKB-SIN) test using these data. Each neural network was put through a 10-fold cross validation on 2800 samples representing 14 sentence subsets of the 42 BKB-SIN sentences. The network input consisted of a sequence of data frames each consisting of 105 features, and each network had 800 hidden units. Classification time, defined as the time elapsed between the network receiving an input dataset and providing a classification result, was evaluated across all 2800 samples. Results: The VSRP achieved an average accuracy (across the three networks) on 10-fold cross validation of 75.90 ± 8.42 (± SD). The average classification time was 7.3 ± 2.3ms (± SE). Conclusions: The VSRP achieved a high level of accuracy across sentences taken from a common speech battery. Further evaluation is needed to demonstrate the use of this system in augmenting human speech perception. It may assist those with hearing loss, such as hearing aid or cochlear implant users.