A project by Michelle Cortese and Anne-Marie Lavigne
[TEMPORARY THESIS]
The voice is an interdisciplinary mechanism. A key piece of the human body whose study involves everything from semantics to biomechanics to Roman war history (origins of phonetic text) to acoustics. Many relevant fields have done their part to autonomously engage with the intricacies of the voice but its not often that a research project can weave the study of musculature, acoustics and the Gabor algorithm together with equal focus. We’re looking to combine physical and visual face mapping with the science and art of the voice. We will be monitoring the muscular, structural and vocal cord movements and positions of professional vocalists.
The Mechanism of the Human Voice
When The Mechanism of the Human Voice was written, in 1940, it can be surely said that the book was truly one of a kind. “Voice and speech are of such importance to mankind that one might well envisage an enormous literature dealing with the subject in all its complex details. Such a literature does exist, but the numerous writings are scattered throughout the pages of divers journals; indeed, the science and art of vocalization is so many-sided that few persons possess the knowledge necessary for a complete understanding of its many problems.” (Curry, Forward). Despite the book’s near-ancient status in the world of science, Curry raises an incredibly relevant and thought-provoking hypothesis: the science of the voice is an interdisciplinary study, a pursuit that encompasses everything from phonetics to musculature to semantics to acoustics to mechanics and beyond.
The Mechanism of the Human Voice breaks the voice down into anatomy of the vocal organs, general vocal acoustics, physiology of phonation and the mechanisms of speech. Additional, smaller chapters include explorations into hearing, singing, vocal disorder, personality and experimental study. Identifying the vocal cords as the primary component of frequency vibration of the voice in nearly every chapter, Curry still takes the time to provide a full anatomical explanation of the voice (face, neck, head, chest, etc.), how each element interacts as well as an exploration into frequency relationships and natural human ranges.
Curry’s most innovative offering lies in the Physiology of Phonation chapter: a truly interdisciplinary look at the mechanism of speech. This chapter introduces the idea that speech is both proprioceptive and extero-ceptive, meaning that it is product by internal neural sensations but also mediated by external factors. External factors like listening inform much of our vocal discussions, hence why perfecting intonation can be quite difficult for deaf speakers attempting to learn audible speech. Continuing on with the interdisciplinary nature of speech study, phonation duration is controlled by laryngeal muscular action and the state of the glottis in the expiratory breath. Curry argues that one can look to respiratory action in expiration to better understand the speech process. Vocal pitch is divided into the low-pitch range, the ‘covered’ voice, falsetto and the whistle voice (whistle of the vocal cords, not the lips); pitches are a product of the vocal cords, laryngeal muscles and glottis.
Curry, Robert Oswald Leonard. The Mechanism of the Human Voice. New York, Toronto: Longmans, Green & co., 1940.
Speech and Voice Science
Beginning with the physics of sound, extensively covering breathing, phonation, resonance, articulation (consonants VS vowels) and ending with research topics in production and perception; Behrman is beyond comprehensive. This book is less the pushy interdisciplinary effort of Curry and more of an eighth grade science text book’s take on the science of speech—simple, comprehensive and occupied by dozens upon dozens of diagrams illustrated by Maury Aaseng.
The beauty of Behrman’s simple diagram-laden take on the mechanics of speech is that she offers plenty of suggestions for continued scientific pursuit, particularly suggested instrumentation for monitoring every factor of speech she lists. Particularly notable is the entire chapter devoted to measurement and instrumentation of phonetics. “Measurement of [frequency] and intensity for clinical and research purposes can be divided generally into three categories of measurement; levels of habitual use of the voice, levels of maximum performance, and degree of regularity.” (Behrman, 181). Relevant instrumentations, to the visual, audible and physical facial tracking we will be tackling, include: microphone frequency tests, being aware of jitter, airflow measurement and intramural pressure, various types of vocal cord monitoring devices (stroboscopy, laryngeal imaging, etc.), and awareness of the four recognized vocal registers (noted above).
Behrman, Alison. Speech and Voice Science. San Diego: Plural Publishing, 2007.
3-D Facial Tracking
Facial recognition systems are “an important modality of modern human computer interaction”#. Most of the current research is done either for medical and diagnostic purposes, for artistic creation or animation production. We have identified three different approaches:
- Top-bottom or Analytic approaches: They use a combination of points related to the main features or organs of the face like the mouth, the eyes, the nose, eyebrows, etc. They are the fiducial points#. Those points are connected together and the distance and angle between those points are used to create the facial recognition image.
- Bottom-up: They use parts of the facial organs combined with to position of the organs. The tracking is then optimized further.
- Holistic approaches: Instead of using points or organs, they use the whole face to produce an image. In those cases, “the normalization on face size and rotation is a really important pre-processing to make the recognition robust”#.
The main issue faced by all of those approaches is that face recognition depends on the appearance of the points on the projective surface. The 3-D modeling of the face thus varies with the pose, the illumination and the expression of the subject. This is what researchers call the PIE problem.
Here is a review of 3 facial tracking systems. Two in medical research and one programmed by an artist.
Medical research
Facial recognition for medical purposes is important as it provides information on the states and on the physical condition of the patients. They can therefore enhance pain recognition. We have found two interesting articles stating the results of two different approaches addressing the PIE problem: a multi-camera system and a combination of 2-D and 3-D imagery.
A New Multi-Camera Based Facial Expression Analysis Concept by Niese, R., Al-Hamadi, A. and Michaelis, B. in Campilho, A. and Kamel, (Eds), ICIAR 2012, Part II, LNCS 7325, pp. 64-71
To avoid dealing with complex lighting methods, the model proposed here automatically adapts a generic model to the current face under observation. They combine live capture imagery of the subject with a mesh model that has been developed from a series of stereoscopic scans of several subjects. For every subject, the mesh model is adapted live to the person’s face features. The adaptation process requires a frontal image using information on the eyes, the lips and nose positions. Those are then aligned with the X-axis of the mesh modeling:
After the frontal image has been paired with the mesh model, a cross correlation in done to find the points on the images captured by the two other camer
This model reaches an average classification rate of 81.5%.
Pros and Cons
- This model does not not represent facial expression specific 3-D shape details but the general face form based on the positions of four features of the face.
- The projective properties of the image capturing device must be taken into account properly. The camera model parameters are gained in a calibration step that has to be very precise to avoid distortions.
- The results have shown a deviation 7 degrees for the rotations and 8cm for the translations of the head.
Combined Online and Offline information for Tracking Facial Feature Points by Wang, X., Zhang, Y. and Chunlei, C., in C.-Y. Su, s. Rakheja, H. and H. Liu (Eds), ICIRA 2012, Part I, LNAI 7506, pp.196-206.
This approach combines offline informations on the movement constraints in 3-D space with an online frame-to-frame (25 fps) imagery created using the Gabor wavelet algorythm. Both the offline and online methods are integrated with a bundle adjustment method. The tracking process is made of three steps:
- defining the facial points and construct the initial keyframe;
- estimate the current frame feature points and set the previous frames feature points;
- get the current frame’s feature points optimized by the integration tracking method.
They use 14 points obvious of the human face to locate the corresponding points produced by the algorithm. With the frame-by-frame method, they can predict feature points on the following frames.
They then take a 30 pixels x 30 pixels image of the area around every point and then then transform the pixels to get the image. They do so because “only using spatial and temporal continuity information between successive frames to track often leads to error accumulation and gradually causes the drift”#.
Pros and cons
- The system avoids the jitter and drift phenomenon.
- The application of the Gabor wavelet algorithm is very complex and require the usage of softwares we do not have access to.
Artistic creation
As soon as the kinect camera was launched by Xbox in 2010, digital artists started to use its depth lens to build interactive interfaces. Face OSC, a facial tracking system have been developed by the artist Kyle MacDonald based on the work of Jason Saragih. It is an add on in OpenFrameworks. Here is a video of MacDonald explaining the original code of Saragih.
Here is an interview with Kyle MacDonald explaining the algorithm of Face OSC.
FaceOSC is based on a deformable model fitting technique, taking form of the face and then pushing it until it fits a target (a photo or a camera feed using landmarks). the algorithm uses points in the face the create areas that will be then deformed to fit the model.
FaceOSC can be easily connected to digital audio interfaces.
Pros and Cons
- FaceOSC is application that can be paired with Processing, MAX/MSP and Ableton Live.
- Codes of templates are available online.
- The application does not track the details of the face, but the face in general.