Several immersive projection displays have been developed in the world. By connecting these displays into the broadband network, users can communicate with each other with a high quality of presence. In this study, video avatar technology was developed to realize natural communications in the networked immersive projection environments. This method represents the user's avatar using the live video image in real-time. By using this method, several communication experiments -- such as a presentation, improvised acting, and simultaneous mutual communication -- were conducted. From these results, natural communications in the networked immersive projection displays were realized, and the effectiveness of the video avatar communications was evaluated.
Recently, immersive projection technologies, such as the CAVE system developed at the University of Illinois, have become very popular for virtual reality displays . This type of display generates a high quality of immersion, projecting high-resolution stereo images onto large screens that surround the viewers . For example, the five-screen immersive projection display system CABIN was developed at the University of Tokyo, and the fully immersive display COSMOS, which uses six screens, was developed at the Gifu Technoplaza . These displays were originally conceived as visualization environments for scientific data or design models. However, they can also be used in a communication environment by being connected into a network . The MVL (Multimedia Virtual Laboratory) Research Center was founded at the University of Tokyo and the Gifu Technoplaza, and the CABIN and COSMOS systems were connected to each other through the Japan Gigabit Network (see Figure 1).
Figure 1 Network Environment between CABIN and COSMOS
In the networked immersive virtual environment, users can share in a virtual world with a high-quality sense of "presence." Additionally, it is necessary to transmit high-presence images of the user in order to achieve natural communications in the shared virtual world. A video conference system can be used for communication, in which remote users talk to each other face to face. However, with this system, users cannot share the three-dimensional virtual world, because it can only transmit a two-dimensional video image. On the other hand, in the distributed virtual world constructed on the network, a communication method using an avatar is often used. However, it is difficult to transmit facial expressions or emotions using the avatar, because it is usually a computer graphics image .
In this study, a video avatar was developed by integrating the concepts of the video conference system and avatar technology. This method generates the user's avatar in the three-dimensional shared virtual world by using a video image in real-time. Users can communicate naturally with each other with a high sense of presence in a networked immersive virtual environment. This paper describes the method of making the video avatar, several communication experiments conducted using the video avatar, and the overall effectiveness of this method.
In order to realize natural communications in the networked immersive environment, video avatar technology was developed. The basic process of making a video avatar is as follows.
First, a video camera is placed in front of the user in the immersive projection display to record the user's image. The image taken by the video camera is captured by the graphics workstation, and the user's figure is segmented from the background by comparing the captured image and the background image without the user. In particular, when a blue backdrop is used, the chroma-key technique can be used to segment a clear image of the user. Figure 2 shows an example of the segmentation process for isolating the user's image.
Figure 2 Segmentation Process of Video Avatar
Since an electromagnetic sensor tracks the user's position in the immersive projection display, the segmented user image can be transmitted to the other site with positional data. Then, at the opposite site, the user's video image is superimposed onto the corresponding three-dimensional position in the shared virtual world. Therefore, the positional relationship between users in the shared virtual world can be represented. This video image is rendered as video texture in real-time. Thus, the video avatar is generated. By making a video avatar in both sites and transmitting the images to each other, the remote users can communicate mutually in the shared virtual world using their own avatars.
In the above-mentioned method, though the video avatar can be placed at the correct three-dimensional position, it cannot represent three-dimensional body motion because it is a two-dimensional plane image. For example, when the user points at an object in the shared virtual world, accurate information about the fingertip position cannot be transmitted to the other users. In order to represent three-dimensional body movements, it is necessary to make the video avatar using a three-dimensional model.
Therefore, in this study, a stereo camera was used to generate a video avatar that incorporates a three-dimensional shape model. By using a stereo camera, the distance from the camera position to the object can be measured according to the triangulation algorithm. For the stereo camera, a Triclops Stereo Vision System supplied by Point Grey Research Inc. was used. Since this camera consists of two pairs of stereo camera modules along the vertical and horizontal baselines, it can measure distance relatively accurately. The resolution of the measured depth value was about 5 cm. By calculating the distance for all the pixels of the captured image, the image depth can be generated. Figure 3 shows an example of a captured image and the generated depth image.
Figure 3 Depth Image Generated by Stereo Camera
Then, from the depth image, the three-dimensional position of each pixel is calculated in the virtual world coordinate system. By connecting each pixel with triangular meshes, a surface model can be generated. Thus, a 2.5-dimensional video avatar, which has a surface model for the front side, is generated by texture mapping the segmented user's image onto the surface model . In this system, the resolution of the depth image was 160 by 120, and the refresh rate of generating the 2.5-dimensional video avatar was about 10 Hz, using a Pentium III 700MHz PC.
Though the 2.5-dimensional video avatar makes use of a surface model for the front side, there is no shape to its other aspects. Figure 4 shows the appearance of the 2.5-dimensional video avatar as seen from various other directions. When the user sees the video avatar from a direction close to the video camera that is used to take the user's likeness, it is visualized with quality image. However, when the user's viewpoint moves away from the camera position, the avatar's image becomes distorted. From the example shown in figure 4, we can see that when the viewing direction is away from the camera position by more than thirty degrees, the avatar's image appears unnatural because of distortion.
Figure 4 Appearance of a 2.5-Dimensional Video Avatar
Therefore, in this study, several cameras were placed in front of the user, and the nearest camera to the other user's viewpoint was selected and used. Though at any instant the generated video avatar uses only the surface model for the front side, this method can represent the total three-dimensional shape by switching between several cameras according to the other user's viewpoint. In the field of computer vision research, a method has been developed of constructing a complete three-dimensional structure using multiple cameras . However, this method cannot be used in real-time, because of the large amount of calculation necessary. The important feature of this method is in realizing real-time construction of a stereo model by dividing cameras into several pairs of stereo camera units.
Figure 5 illustrates the concept of representing the 2.5-dimensional video avatar by switching several stereo cameras. In the actual system, these cameras were placed in the corners of the display space, and the selected camera was switched according to the positional relationship between users in the shared virtual world. In this system, by transmitting the stereo video avatar mutually between the networked immersive projection displays, natural communication using the three-dimensional information can be realized.
Figure 5 Representation of Stereo Video Avatar by Camera Switching
Several communication experiments were conducted using the video avatar technology in networked virtual environments. First, the video avatar was used for a presentation of the virtual world. This experiment was performed at the opening ceremony of the MVL Research Center. The immersive projection display COSMOS at the Gifu Technoplaza and the "blue-back" studio at the University of Tokyo were connected through the satellite communication network. The bandwidth of the network was 6 Mbps, and the NTSC video image and the voice signal were transmitted simultaneously through the MPEG2 encoder and decoder.
At the University of Tokyo, a video camera was placed in the blue-back studio to film the user. This image was transmitted to the COSMOS site, and the plane video avatar was generated using the chroma-key method. This video avatar was superimposed on the virtual world displayed in the COSMOS, so that the user in the COSMOS was able to communicate face-to-face with the video avatar in the three-dimensional virtual world.
In this experiment, since the virtual world was simulated only at the COSMOS site, the video avatar was transmitted only from the Tokyo blue-back studio to the COSMOS. At the COSMOS site, the video camera was placed at the entrance of the display space, and the behavior of the users communicating with the video avatar was collected from behind. This video image was sent to the University of Tokyo, and the user in the blue-back studio was able to perform while looking at his own likeness.
Figure 6 shows the construction of the communication system used in this experiment, and figure 7 shows the appearance of the users' communication in the COSMOS site. In this experiment, the molecular simulation data were visualized in the COSMOS, and the researcher at the University of Tokyo explained the simulation method and the calculation results by entering into the virtual world as the video avatar.
Figure 6 System Construction for Presentation Experiment
Figure 7 Presentation Using Video Avatar
In this system, though the transmission of the video avatar was one-way, the users in COSMOS felt a high degree of presence and were able to communicate with the video avatar quite naturally. From this experiment, we can understand that this type of communication is effective in fields of application in which the information flow is mostly one-way, such as presentations or education.
Next, the video avatar was used to create an improvised acting scenario between remote users. This experiment was performed at the VR Culture Forum held in Yakushima. The conference hall in Yakushima and the immersive projection display CABIN at the University of Tokyo were connected via two 128 kbps ISDN lines. One line was used to transmit the video image, and the other line was used for telephone voice transmission and the computer data. For the transmission of the video image, an NTT Phoenix video conference system was used.
In the conference hall, the dancer played out her part on the stage without stage setting. Her performance was filmed by the video camera and the video image was transmitted to the CABIN at the University of Tokyo. Next, a plane video avatar was generated and superimposed on the virtual world displayed in the CABIN. The user in the CABIN was able to communicate with the dancer's video avatar in the three-dimensional virtual world, and created improvised acting by using body actions such as waving the hand.
The video camera was also placed at the entrance of the CABIN and the user's acting was filmed from behind. This video image was sent back to the conference hall and projected onto the large screen. The dancer on the stage was able to correct her actions and perform her part while looking at the projected scene. Thus, the audience in the conference hall saw improvised acting created by remote users.
Figure 8 shows the system construction used in this experiment, and figure 9 shows an example of the scenes created. In this experiment, though it was difficult for the dancer in the conference hall to act while looking at her own image, the user in the CABIN was easily able to stage a performance by communicating with the video avatar in the virtual world. In addition, since the bandwidth of the network was insufficient, there was some time delay in the response of the avatar's action. However, from this experiment, the possibility of an interesting area of application for the video avatar was demonstrated.
Figure 8 System Construction for Improvised Acting Experiment
Figure 9 Improvised Acting with Video Avatar
Finally, an experiment in mutual communication using a 2.5-dimensional video avatar was conducted between the CABIN at the University of Tokyo and the COSMOS at the Gifu Technoplaza. In this experiment, SGI ONYX2 graphics workstations were used to generate computer graphics images for the CABIN and the COSMOS. These were connected through the 155 Mbps ATM of the Japan Gigabit Network, and the computer graphics data were transmitted between each site. In addition, MPEG2 encoder and decoder were also connected to the network, and the users' voices were transmitted through the MPEG2 coding.
Figure 10 shows the system construction used in this experiment. In the CABIN and the COSMOS, two stereo video cameras were placed in the corners of the display spaces, and the blue sheet was hung on the back screen. Then the images of the users in the CABIN and the COSMOS were filmed from the front. These images were immediately captured by the graphics workstations and the users' figures were segmented from the background using the chroma-key. The data of the generated 2.5-dimensional video avatar were then transmitted mutually to the other site and superimposed on the shared virtual world. Users in the CABIN and the COSMOS were able to communicate with each other looking simultaneously at the other user's image using the video avatar. In this experiment, the bandwidth used to transmit the data of video avatar captured by two stereo cameras was about 40 Mbps.
Figure 10 System Construction for Mutual Communication Experiment
As for the contents of the virtual world, the wide area of the virtual town was shared between remote users in the immersive projection displays. In this world, the users were able to communicate freely using their video avatars. For example, one user navigated for the other user while walking together in the virtual town, or the two users arranged to meet in the virtual town, confirming their locations within it. Figure 11 shows examples of the users' communication in the virtual town.
Figure 11 Mutual communication using video avatar
In this experiment, by transmitting the 2.5-dimensional video avatar mutually within the shared virtual world, effective communication was realized. However, since the users met in wide scale in the virtual world, various unusual situations occurred. For example, one user often turned his back toward the other user or followed after the other user in the virtual town. It is necessary to take the user's image not only from in front, but also from various lateral directions. Although two cameras were placed at the front corners of the field of view in this experiment, the use of more cameras surrounding the user to capture the user's image from various directions would be effective.
In this study, in order to realize a high-presence communication in the networked virtual world, video avatar technology was developed. This method represents the user's stereo avatar in a shared virtual world, using the live video image in real-time. Using this method, several communication experiments -- such as a presentation, improvised acting, and simultaneous mutual communication -- were conducted. From these results, we can say that natural communications using immersive projection displays can be realized, and the effectiveness and range of possibilities enabled by video avatar communications in several fields of application were demonstrated. Future work will include improving the precision of the generated video avatar and applying this technology to the collaborative works using the international networks.
We thank Dr. Tom DeFanti and Dr. Dan Sandin of the University of Illinois at Chicago and Dr. Hideaki Kuzuoka of the University of Tsukuba for their useful discussions. We also thank members of Hirose Laboratory in the University of Tokyo and Gifu MVL Research Center for their help in the communication experiments.