Ron Cole, CU
Javier Movellan, UCSD
We envision a new generation of human computer interfaces that engage users in natural face-to-face conversational interaction with intelligent animated characters. These perceptive animated interfaces will incorporate virtual humans that interact with people much like people interact with each other during face-to-face conversational interaction. The interface will use language processing and machine perception technologies to locate, monitor and interpret the user’s speech, facial expressions, gaze and hand and body gestures. Lifelike computer characters, with personality and attitude, will orient to the user and provide real time feedback while the user is speaking through head nods, facial expressions and other behaviors) and interpret the speaker’s auditory and visual behaviors to infer the user’s intentions and cognitive state. The animated agents will produce natural andexpressive speech accompanied by contextually appropriate facial expressions and gestures consistent with the agent’s unique personality. We propose work to establish a vital research community to stimulate and enable research anddevelopment of perceptive animated interfaces.
Perceptive animated interfaces will be of great value to society, as they will revolutionize learning, training, interpersonal electroniccommunication, information access and retrieval and online transactions. Theadvent of intelligent animated agents will present unprecedented opportunities to engage and empower individuals to learn new skills, communicate more effectively, and increase their participation in the emerging information society. They can help people learn to read or to speak, and can liberate teachers from some routine teaching tasks and help them tailor the learning process to the specific needs of each student.
The invention of virtual humans within perceptive animated interfaces provides a new and exciting task domain for multidisciplinary researchleading to development of converging technologies to improve human performance.For example, inventing virtual humans and assessing their effectiveness indifferent tasks (e.g., Web guides, science tutors, job counselors or therapists) will require the integration of new ideas and technologies about the realization of personalities through communication behaviors, and new architectures that can handle real-time interaction between individuals and agents across modalities operating at different time scales. The proposed workshop provides anopportunity for computer scientists, cognitive psychologists, social psychologists, personality researchers and other interested researchers to brainstorm with program managers from the NSF (and perhaps other agencies) to brainstorm aboutresearch challenges, infrastructure needs and program models that can focusthe talents in cross-cutting efforts leading to perceptive animated interfaces.
While perceptive animated interfaces are currently science fiction, available tools and technologies exist today that could enable development and deployment of system prototypes in the next few years (Gratch et al., 2002; Cole et al., 2003). Indeed, initial efforts are currently underway todevelop first generation systems incorporating perceptive animated agents in the context of intelligent tutoring systems designed to teach children to read and learn from text (Cole et al., 2003). To accelerate progress, it is crucial to establish a community of researchers from all relevant disciplines that will work together to initiate and undertake the many tasks required to make these interfaces a reality. These activities include defining research goals and challenges, sharing knowledge about prevailing theories and methodologies in each discipline, proposing and designing system architectures for inventing and evaluating virtual humans, defining realistic and challenging task domains, building prototype systems as test beds for research, establishing evaluation criteria for measuring progress, and identifying and developing critical infrastructurethat enables this work, and is accessible and available to all.
A major goal of the proposed workshop is to assemblea team of interested colleagues who will work together to conceptualize and plan these tasks, and undertake some important first steps toward developing the knowledge base, research objectives and infrastructure needed to accelerate scientific research leading to initial systems incorporating virtual humans. The development of perceptive animated interfaces requires collaboration amongresearchers in many areas—psychologists, linguists, speech scientists, engineersand computer scientists with multidisciplinary expertise in human communication,interface design, speech and language technologies, dialogue modeling andmanagement, computer vision and computer animation. While individual researchers,research labs and existing research communities represent knowledge and skillsin each of these areas, no research community exists today that strives tofocus the necessary multidisciplinary resources on research and developmentof perceptive animated interfaces incorporating virtual humans.
We thus propose to begin efforts to establish this community by bringing together researchers to share their knowledge, expertise, tools and technologies and conceptualize and plan research and development activities leading to perceptive animated interfaces. We will seek researchers with complementaryinterests and demonstrated expertise in areas of human communication technology,human computer interaction, computer vision and animation, as well as researcherswho study human communication, personality, emotions and gestures. The workshoporganizers will also consider inviting individuals in other areas who maycontribute to the workshop, such as researcher who study empathy and socialresonance between therapists and patients.
Building systems that enable face-to-face communication with intelligent animated agents requires a deep understanding of the auditory and visual behaviors that individuals produce and respond to while communicating with each other. Face-to-face conversation is a virtual ballet of auditory and visual behaviors, with the speaker and behavior simultaneously producing and reacting to each other’s sounds and movements. While talking, the speaker produces speech annotated by smiles, head nods and other gestures, while the listener provides simultaneous auditory and visual feedback to thespeaker (e.g., “I agree,” “I’m puzzled,” “I want to speak.”). The listenermay signal the speaker that she desires to speak; the speaker continues totalk, but acknowledges the nonverbal communication by raising his hand andsmiling in a “wait just a moment” gesture. Face-to-face conversation is often characterized by such simultaneous auditory and visual exchanges, in which the sounds of our voices, the visible movements of our articulators, direction of gaze, facial expressions and head and body movements present linguistic information, paralinguistic information, emotions and backchannel cues, all at the same time.
Inventing systems that engage users in accurate and graceful face-to-face conversational interaction is a challenging task. The system must simultaneously interpret and produce auditory and visual signals. The system must interpret the user’s auditory and visible speech, eye movements, facial expressions and gestures, since these cues combine to signal the speaker’s intent—e.g., a head nod can clarify reference, while a shift of gaze can indicatethat a response is expected. Paralinguistic information is also critical,since the prosodic contour may signal that the user is being sarcastic. Theanimated agent must also produce accurate, natural, and expressive auditory and visible speech with facial expressions and gestures appropriate to the physical nature of language production, the context of the dialogue, and thegoals of the task. Most important, the animated interface must combine perceptionand production to interact conversationally in real time – while the animatedagent is speaking, the system must interpret the user’s auditory and visualbehaviors to detect agreement, confusion, desire to interrupt, etc., andwhile the user is speaking, the system must both interpret the user’s speechand simultaneously provide auditory and/or visual feedback via the animatedcharacter.
Developing such systems requires advances in speech recognition, natural language generation and synthesis, facial animation, recognition offacial expressions and gestures, dialogue interaction and imparting personalities to computer agents. As well, realizing these scenarios requires a deeper understanding of the nature of human communication and human computer interaction. Most importantly, achieving these advances in knowledge and technology requires a community of researchers willing to work in an interdisciplinary manner and willing to go beyond the boundaries of well-established research communities. Speech researchers, for example, need to go beyond their traditional area of expertise and interact with computer vision researchers, psychologists, and computer animators. The rudiments of such a community are already established but are in dramatic need for consolidation.
What is the current state of research and development of virtual humans and how effective are these perceptive animated agents inimproving human computer interaction? A vital and growing multidisciplinary community of scientists worldwide is addressing these questions, and significant efforts are underway to develop and evaluate virtual humans in various application scenarios. To date, researchers have generated powerful conceptual frameworks, architectures and systems for representing and controlling behaviors of animated characters to make them believable, personable and emotional (Albeck & Badler, 2002; Badler et al., 2002; Cassell, et al., 2001; Gratch and Marsella, 2001; Loyall, 1997; Marcella & Gratch, 2001). Gratch et al. (2002) and Johnson et al. (2000) present excellent overviews of the scope of enquiry and the theoretical, cognitive and computational models underlying current research aimed at developing believable virtual humans capable of natural face-to-face conversations with people.
Animated conversational agents have been deployed ina variety of application domains, including information kiosks, literacytutors, and language training. In pioneering work conducted over the past10 years at KTH in Stockholm, Joakim Gustafson (2002) and his colleaguesdeveloped a series of multimodal dialogue systems of increasing complexityincorporating animated conversational agents: (1) Waxholm, a travel planningsystem for ferryboats in the Stockholm archipelago (Blomberg et al., 1993;Bertenstam et al., 1995); (2) August, an information system deployed forseveral months at the Culture Center in Stockholm (Gustafson et al., 1999;Lundeberg & Beskow, 1999), in which the animated character moved itshead and eyes to track the movements of persons walking by the exhibit, andproduced facial expressions such as listening gestures and thinking gesturesduring conversational interaction; and (3) AdApt, a mixed-initiative spokendialogue system incorporating multimodal inputs and outputs, in which usersconversed with a virtual real estate agent to locate apartments in Stockholm(Gustafson & Bell, 2002). AdApt produced accurate visible speech, usedseveral facial expressions to signal different cognitive states and turntaking behaviors, and used direction of gaze to indicate turn taking andto direct the user to a map indicating apartment locations satisfying expressedconstraints. These systems produced important insights into the challengesof developing and deploying multimodal spoken dialogue systems incorporatingtalking heads in public places.
The poor quality of animated conversational agents developed to date is a major stumbling block to progress. Johnson et al. (2000) argue that it is premature to draw conclusions about the effectiveness of animated agents because they are still in their infancy, and “…nearly every major facetof their communicative abilities needs considerable research. For this reason,it is much too early in their development to conduct comprehensive, definitiveempirical studies that demonstrate their effectiveness in learning environments.Because their communicative abilities are still very limited compared towhat we expect they will be in the near future, the results of such studieswill be skewed by the limitations of the technology.” While significantprogress has been made in development and integration of core technologiesinto virtual humans since Johnson’s article, it is clearly the case thatmany of the technologies needed to enable virtual humans to behave like peopleare still in their infancy.
An Emerging Community: A number of research communities are emerging that focus on different aspects of the general problem of face-to-face communication with intelligent animated agents. For example, there are active (and largely independent) groups of researchers investigating audio-visual speech processing, face and gesture recognition, affective computing, and perceptive interfaces. These groups tend to get together via conference workshops or annual meetings that run in collaboration with larger, more established conferences. Example of such meetings include the International Conference on Face and Gesture Recognition, the Multisensory Research Conference, the NIPS workshop on affective computing, the ICCV workshop on Recognition, Analysis and Tracking of Faces and Gestures in Real-time Systems, the workshop on LifelikeComputer Characters, the AVSP (Audio Visual Speech Processing) workshop, andthe UIST annual workshop on Perceptive User Interfaces.
Despite initial efforts by various groups to addressareas of research related to research and development of virtual humans,the situation today is that researchers in different fields are working moreor less independently in separate areas within psychology, linguistics andcomputer science, studying problems related to human communication, expressionof emotions and gestures during communication, spoken dialogue systems, computervision, computer animation and gesture recognition. Together, these fieldsprovide a significant base of knowledge and methods that are critical todevelopment of virtual humans. It is therefore critically important to holdsworkshops to bring researchers from these diverse fields together to shareknowledge and help establish a coherent research community to share knowledgeand formulate a vision and plan for inventing virtual humans. A workshoporganized by Co-PI Jonathan Gratch provides an important first step in thisdirection, and the resulting journal article resulting from the workshopis an excellent starting point for conceptualizing this field.
To summarize, development of virtual humans is stillin its infancy. In the past ten years, a small but emerging community ofresearchers has made great progress towards identifying the scope of multidisciplinary research required and the key research challenges that need to be addressed, and by offering strong theoretical, conceptual and computational frameworks that provide a foundation for multidisciplinary research among computer scientists, cognitive scientists, psychologists and researchers in other disciplines. Although much innovative research has been conducted, we conclude that experiments investigating the efficacy of animated agents are limited today by constraints imposed by the state of the art of human communication technologies, including speech and language technologies, computer vision, and real time character animation. A grand challenge is to develop new architectures and technologies that will enable experimentation with perceptive animated agents that are more natural, believable and graceful than those available today. To meet this challenge, it is necessary to bring together a community of researchers that can work together to achieve a vision and research agenda for accelerating research and development of virtual humans.
Establishing a vital research community should be guided by prior efforts that produced successful outcomes, and leverage existing infrastructure. In this section, we describe program models and critical infrastructure already in place that can “jump start” research and development efforts in perceptive animated interfaces.
DARPA speech programs, which now span over three decades, provide important insights for establishing research programs and communities. These programs brought together researchers who worked together to define challenging tasks and conceptualize systems that could be developed in 3-4 years to achieve targeted levels of performance on the designated tasks. Theresearch community then worked together to identify the infrastructure neededto develop the proposed systems, and developed rigorous evaluation methodologiesand metrics to measure and compare performance of different systems, andmeasure progress over time.
One of the key lessons learned from these programs is the critical importance of infrastructure, and the remarkable amount of work required to produce it. In the area of speech recognition (which is just one component of a perceptive animated interface) infrastructure includes annotated speech corpora, pronunciation dictionaries, lexicons, and tools for training and evaluating speech recognition systems. Development of infrastructure in each of these areas represents many thousands of hours of work!
The good news is that significant infrastructure hasbeen developed in recent years that can be used today to enable researchand development of perceptive animated interfaces. This infrastructureincludes annotated speech corpora, research tools and research systems thathave been placed in the public domain for research use. These includethe Interactive Book Architecture (a Java-based extension of the Galaxy architecturedeveloped at MIT) that supports natural, mixed-initiative, spoken dialogueinteraction with animated characters, Interactive Book authoring tools, andthe CU Animate system developed at the University of Colorado. We brieflydescribe these systems to show that powerful tools that can be shared withother researchers are now available to support research and development ofperceptive animated interfaces.
Interactive Book Architecture & Runtime Environment. Under NSF ITR and IERI grants, researchers at CSLR have developed an authoring and runtime environment for developing Interactive Books that incorporate perceptive animated agents. The animated agents interact with students to help them learn to read and understand what they read. Interactive booksenable students to converse with animated characters, read aloud with immediatefeedback on the pronunciation of each spoken word, click on words to havethem pronounced or defined, and interact with media objects (text, objectsin illustrations, etc.) using speech, typing or mouse clicks. The interactivebook is displayed on the student’s client machine, but interaction with animatedcharacters and other media objects occurs via communication with and amongtechnology servers in the Interactive Book architecture. These technologyservers include:
Audio Server – Receives signals from microphone or telephone and sends them to the recognizer. Also sends synthesized speech to PC speakers or telephone.
Sonic Speech Recognizer – Takes signals from an audio server and produces a word lattice. Developed by Bryan Pellom at CSLR, it will be extended as part of the proposed work.
Phoenix Natural Language Parser – Takes word lattice from recognizer and produces the “best” interpretation of the recognized utterance.
Confidence Server – Takes hypothesis and semantic parse from the speechrecognizer and parser as input and annotates the words and concepts withlevels of confidence.
Dialogue Manager – Resolves ambiguities; estimates confidence in the extracted information; clarifies with user if required; integrates with current dialogue context; builds database queries; sends data to NL generation for presentation to user, prompts user for information;
Database / Backend Server – Receives SQL queries from Dialogue Manager;interfaces to SQL database; retrieves data from the web to enable learningtools to access online information;
Natural Language (NL) Generator – Constructs strings of words to speak back to the user based on the current dialog action;
Text-to-Speech (TTS) Synthesizer – Receives word strings from NL generation; synthesizes them to be sent to the audio server;
CU Animate Character Animation Server—Receives a string of symbols (phonemes, animation control commands) with start and end times from the TTS server, and produces visible speech, facial expressions and other gestures in synchrony with the speech waveform. (Descriptions of each of these modules and publications providing more detail are available on the CSLR Web site at http://cslr.colorado.edu)
MPL Face Tracker—tracks faces in real time (at 30 frames per second) under arbitrary illumination conditions and backgrounds (which may include moving objects). The face detector communicates the location of the user’s face tothe animation server, which, by triangulating between the user, camera andanimated agent, allows the animated agent’s eyes to track the user.MPL Emotion Monitor—a prototype system that classifies facial expressions into seven emotion dimensions: neutral, angry, happy, disgusted, fearful, sad, and surprised. The system will be integrated into interactive books inthe near future.
Interactive Book Authoring Tools- Interactive books provide a test bed for research, development and evaluation of perceptive animated interfaces with virtual humans. (They are currently being to teach children to read in schools in Boulder Colorado.) To facilitate application development, authoring tools have been developed to enable designers to create interactive learning experiences. Designers can create content by typing in text, or scanning textand illustrations. Once text and illustrations have been input, designers can orchestrate interactions between students, animated characters and various media objects using any of the technology servers in the Interactive Book architecture. Developers can cause animated characters to narrate their parts in a story using synthetic or naturally recorded speech, mark up text to control the character’s facial expressions and gestures while speaking, and design interactive spoken dialogues with characters to talk about the story or to test comprehension.
CU Animate. CU Animate is a toolkit designed for research,development and control of 3D animated characters for use in perceptive animatedinterfaces. Nine characters developed in 3D Studio Max have been importedinto CU animate. Each character was designed with a full body and fully articulatedskeletal structure, with sufficient polygon resolution to produce naturalanimation in regions where precise movements are required, such as lips,tongue and finger joints. CU Animate provides real time rendering ofanimated characters by controlling parameters and/or morphing between targetstates. In addition to providing a public domain platform for research,CU Animate provides a graphical user interface for designing arbitrary animationsequences. These sequences can then be tagged (or iconified) and insertedinto text strings, so that characters will say the text while producing desiredemotions and gestures.
UCSD Head Tracking System. A head tracking system has been developed byCo-PI Javier Movellan at UCSD and integrated into the Galaxy architecturefor distribution with CSLR toolkits. The head tracker, which accuratelylocates and tracks the user’s face in real time, represents an importantfirst step towards integration of state of the art computer vision technologyand computer animated interfaces. Once the location of the user’s face isknown, further research can be undertaken to interpret visible speech, etc.,and research advances can be integrated readily into working systems. Thesystem has been placed in the public domain.
The main goals of the workshop are (a) to understandprior work and research challenges required to develop perceptive animatedinterfaces and virtual humans, (b) determine practical steps and activitiesrequired; and (c) initiate a set of activities that will help establish avital community of researchers who will work together to accelerate progress.
The proposed workshop will be held in Boulder Colorado. It will be hosted by Ron Cole at CSLR at a venue in Boulder, perhaps the BoulderadoHotel. The workshop organizers will form an organizing committee, and workwith this committee to develop a detailed agenda, and to identify the keyissues that will form the topics for breakout sessions.
The workshop will address the following questions:
• What is the state of scientific knowledge about perception, production and interpretation of auditory and visual behaviors during face-to-face communication? How are these behaviors influenced by task domain, social influences,and other variables? What knowledge can be applied immediately to the designof perceptive animated interfaces? What scientific knowledge is missing,and what research is required to gain this knowledge?
• What are the capabilities and limitation of technologies and methodologies currently used in research, development and evaluation ofadvanced dialogue systems? What sorts of perceptive animated interfaces can these technologies support today? What key research breakthroughs are needed in speech and language technologies, what is required to achieve these breakthroughs, and how will these breakthroughs translate into more effective perceptive animated interfaces?
• What is the state of the art of computer vision technology relative to monitoring and interpreting visual behaviors to enable face-to-face communication with an animated character? What is the missing science? What key research breakthroughs are needed to enable perceptive interfaces? What research tools, corpora, and systems are currently available to enable research and development efforts? What new infrastructure is needed to conduct research? What effort and cost is required to develop this infrastructure?
• What is the state of the art of animation technology? What research and development activities are needed to produce natural and contextually appropriate facial expressions, eye movements, and hand and bodymovements in different tasks? What infrastructure is required to achievekey research breakthroughs?
• What architectures have been proposed or implemented to support real time dialogue interaction between users and virtual humans? Does the proposed system architecture and task domain enable researchers toachieve and measure key research objectives? How do we measure and evaluate the performance of these systems and system modules? How do we compare different systems? How do we measure progress over time?
• What systems could and should be developed to serveas test beds for research and development of perceptive animated interfaces? What task domain(s) should be selected?
• What resources—annotated corpora, research tools, etc.—are needed to study relevant communication behaviors between people or between people and machines, and to enable researchers to train and evaluate machine perception and generation algorithms? (In the appendix below, we explain the critical importance of developing corpora to enable research in perceptive animated interfaces, and provide examples of how development of corpora has accelerated progress in science.)
• What standards are required to assure interoperability of system components and real time interaction over communication channels?
• What metrics and methodologies are required to evaluate and compare systems and system components, and to measure progress within and between research and development sites over time?
• What concrete steps should the research community take to stimulate and sustain research, and to create a strong and enduring community that will realize the vision of perceptive animated interfaces?
A major goal of the workshop is to understand (and hopefully plan some of) the activities required to provide answers to the above questions. To this end, the workshop organizers will work with invitees both before theworkshop to develop literature reviews and position papers related to keyissues, and then to modify these position papers based on the results ofthe workshop. The workshop organizers will also work with the invitees to develop an agenda for the workshop that consists of focused breakout groups and other activities designed to optimize the outcomes of the workshop, which will be measured in terms of concrete proposals and plans to conduct activities that address the above questions. The workshop will produce a set of recommendations to the NSF on programs or initiatives that could be undertaken to support activities required to establish a vital community leading to perceptive animatedinterfaces and virtual humans. Given the new knowledge, technologies and systems, and the positive impact of these systems on society, we believe this an important and time-critical task.
Although we have asked for funding for a single workshop, the authors of this proposal believe that the objective of establishing a vital research community quickly—one that can develop critical infrastructure and initiate research and development efforts leading to systems that can be evaluated on common tasks—will be better served through a series of workshops. A single workshop can bring scientists together to conceptualize, plan and recommend future steps toward perceptive animated interfaces. A series of workshops could establish a field of research and produce concrete research results and even initial systems through collaborative activities. Forexample, audio and visual recordings of children’s speech and emotions (describedin the appendix below) could be annotated and shared by the community forresearch, development and comparison of kids’ speech recognition, face trackingand emotion classification algorithms. Research tools, systems (e.g.,UCSD MPL’s face tracking system; CSLR’ Communicator system and animation toolkit)and research collaborations initiated.
Outcomes of the proposed work will include publishedreports of the workshop reflecting the activities, plans and recommendationsof the participants, and a final report summarizing the accomplishments andrecommendations of the project. In addition, we expect to identify criticalinfrastructure needs, and to design and develop annotated corpora that canbe used to enable research on auditory and visual recognition and synthesis.Finally, we expect to formulate and initiate a plan to conduct collaborativework at multiple sites to develop some initial systems that will serve astest beds for research and development of perceptive animated interfaces.
Success in the proposed work will be realized by theexistence of a dedicated community of researchers who meet regularly, demonstratenew systems incorporating research breakthroughs, work with companies todevelop products that benefit society, and who share advances in scienceand technology both within and outside the research community.
Ron Cole, Javier Movellan and Jonathan Gratch will organize the workshop and take primary responsibility for authoring the final report. We will also work together to organize smaller satellite meetings at conferences that bring researchers with complementary interests together; and to organize special sessions at conferences to describe workshop recommendations and toengage other researchers. The authors of this report have organizedover twenty different workshops, including several workshops funded by theNSF focused on developing new research agendas and programs.
Ms. Terry Durham will organize the workshop. Ms. Durham has over ten years of experience organizing and running workshops, initially as Center Administrator at the Center for Spoken Language Understanding at OGI, and more recently as Center Administrator at CSLR.
Both during and after the workshop, the PI and co-PIs will work tirelessly to promote and facilitate collaborative research projects among members of the emerging perceptive animated interface community, and enlist colleagues to participate in development of critical infrastructure. To this end, both CSLR and UCSD will work hard to share expertise, tools, technologies and systems with other researchers.
We will work hard to identify university researchersin the U.S., and one or two researchers from the EU (e.g., Bjorn Granstromfrom KTH) who are leaders in areas related to research and development ofperceptive animated interfaces and virtual humans. Leading researchersfrom the commercial sector may be invited at their own expense. Wewill work to achieve a good balance between researchers who study human communication, including individuals who study expression and generation of emotions and gestures in different areas (e.g., there is a field of research that studies social resonance and empathy in therapeutic settings), and individuals who research and develop multimodal systems and component spoken dialogue, computer vision and computer animation technologies. Individuals invited to theworkshop will be selected from areas of communication, cognitive science, computer science and engineering, counseling, linguistics, psychology and speech science.
The speech recognition community provides a good model for these activities, as it has been successful at developing and establishing a large number of publicly available speech corpora for a large number of tasks, including media broadcasts, telephone conversations (in many languages) and spoken dialogue systems. The importance of speech corpora is underscored by the efforts of NSA, DARPA and NSF in establishing the Linguistic Data Consortium,and an NSF DARPA initiative to develop language resources. Progress in speechrecognition and understanding research over the years is tied directly tothe development of language resources resulting from these and other initiatives. (We note that PI Cole, while director of the Center for Spoken Language Understanding at OGI, planned and supervised development of twenty speech corpora that weremade available to university researchers free of charge.)
In those few cases in which effort has been devoted to developing corpora for computer vision research the results have been dramatic. One example is the work of Jonathon Phillips at NIST on the FERET database of static images for testing and evaluating face recognition systems. Because of the FERET effort, companies have been founded and commercial products developedfor computer vision-based person identification.
To enable development of perceptive animated interfaces, similar efforts must be undertaken to develop digital video databases of spontaneousfacial behavior. Corpus development efforts in this area typically involveTerabytes of data, at a cost which until very recently was prohibitive forthe average research laboratory. For this reason, the application ofmachine perception research to recognition of video sequences is at an impasse.An initial stage of early proof of concept developments which occurred inthe late eighties and early nineties has been followed by slow progress dueto the lack of realistic databases to train and evaluate different approaches.
For example, Eric Petajan’s dissertation in 1984 pioneered machine perception work on audio-visual speech recognition (combining information from the acoustic signal and movements of the lips and lower face region). Following this work a variety of approaches were developed in the early nineties by different research labs, establishing the seeds for an emerging community (Mase and Pentland, 1989; Sejnowski et al. 1989; Stork et. al. 1992; Bregler et al., 1993; Movellan 1994; Luettin et al. 1997). Such systems were promising but, due to the lack of standard databases, research never progressed beyond the proof of concept stage.
Work on automatic expression recognition from video followed a similar path. Early systems developed in the nineties established proof of concept using small locally developed databases (Mase, 1991;Yacoob & Davis, 1994; Rosenblum, Yacoob, & Davis, 1996; Essa & Pentland, 1997; Terzopoulus & Waters, 1993; Li, Riovainen, & Forscheimer, 1993; Cottrell & Metcalfe, 1991; Bartlett et al., 2000). While these systems demonstrated that automatic recognition of expressions is feasible, progress beyond these initial demonstrations has been slow due to the lack of realistic standard databases. In the absence of such databases it is almost impossible to compare different systems and to evaluate progress towards realistic applications.
In order to make serious progress on machine perception applied to video sequences of the human face and to produce accurate and gracefulanimation of conversational behaviors by animated characters, it is crucialto develop corpora of spontaneous facial behavior. Collaboration betweenmachine perception experts and behavioral scientists is needed to chooserecording situations that produce rich and interesting spontaneous behaviorsin a variety of tasks.
For both video and motion capture data, we will follow two independent paths. First, we will establish a special interest groupwithin the emerging perceptive interface community to conceptualize and proposea set of corpora and associated corpus development activities for video datacapture and motion capture. Activities and issues that will be addressedinclude identifying the behaviors to be measured, selecting tasks and designingprotocols to elicit these behaviors, selecting subjects, placement of camerasand motion sensors, transcribing data, representing and analyzing data, andestimating efforts and costs associated with these activities. Second, wewill design, collect, transcribe and analyze data to create some initialcorpora. We believe it is crucial to conduct actual video and motioncapture corpus development efforts so we can understand first hand the issuesand costs involved for these and future efforts. Moreover these effortswill produce useful public domain corpora for computer vision and animationresearch.
Video Data. As part of our collaborative efforts to develop perceptive animated interfaces to help children learn to read, Cole and Movellan are collecting audio and digital video data from 1000 children in first through fifth grade. The protocol for this corpus contains prompted speech, read speech,and spontaneous speech from children in kindergarten through grade 7.A variety of sub-tasks are included, such as individual phonetic sounds,isolated letters and numbers, isolated words, common human-computer relatedcommands, words related to mathematics and time, and grade-specific sentencesfor the prompted-speech section. Sections of the protocol are alsodesigned to elicit a range of expressions including concentration, confusionand stress. This corpus, which is huge by current computer vision standards,will serve as an ideal test bed to help establish transcription proceduresand conventions, and for training and evaluating facial recognition systemsfor head tracking, speech reading and facial expression recognition. The corpus will be offered to the research community as a possible starting point to develop transcription conventions and to train and compare face recognitionand visible speech recognition systems.
Motion Capture. Motion capture techniques are commonly used today to create 3D animation sequences in video games and motion pictures. Sensors are placed on the body, and X, Y, Z coordinates are recorded while actors produce specific movement sequences. The sequence of coordinates can then be mapped directly to corresponding points of 3D models to produce animated sequences that are extremely close to the original human behaviors. Motion capture has also been applied to animation of speech movements and facial expressions with great accuracy. See, for example, the video clips of 3D characters producing speech at www.pyros.com produced using motion capture.
A breakout group at the workshop will address issuesrelated to describing tasks and procedures for collecting and analyzing motioncapture data.