Recent rapid developments in microprocessor technology gave us more opportunity for scientific, computationally challenging problems. As CPU performance increases, we can make the problems bigger in both temporal and spatial resolution.
In addition, recent rapid developments in the latest network technologies facilitate huge amount of scientific data transmission. Quality of Service (QoS) technology assures network parameters such as jitter, bandwidths and so on. In other words, such technology has the potential to solve the geographical distribution problem of both user and resources.
One of the scientific problems that require both computational solution and geographical solution is the medical problem. The recent development in medical imaging technologies enhances the spatio-temporal resolution of medical images. While it results in more useful information for diagnosis, the analysis time increases. Longer analysis time prevents us from detecting symptoms early, which results in an inefficient diagnosis. In other words, efficient diagnosis requires a computational solution.
On the other hand, a geometrical solution is also required, partly because universal medical service is now demanded. Universal service aims to be a homogeneous service, regardless of the patient's location. Today, however, the quality of medical services is different between rural and urban regions because fundamental medical infrastructures are gathered in urban regions. In order to improve this situation, a geographical solution is essential, as well as a computational solution.
From both medical and social perspectives, medical problems are becoming increasingly important. In these days, advanced countries are confronted with a shift to an aging society. Health problems will increase as the population ages, resulting in higher costs of healthcare . Hence, more efficient diagnosis without compromising quality of services is required.
One of the most serious and severe problems demanding such services is diagnosis of brain disorders. As the population ages, the number of brain disorders such as dementia, cerebrovascular disease, epilepsy and other problems is increasing. In diagnosis of brain disorders, early detection of symptoms is very effective from a therapeutic viewpoint.
Evaluation of brain function has great importance for both clinical and scientific purposes. The brain is a complex organ in comparison with other organs, and it has parts whose mechanisms are still unknown.
A highly sophisticated medical instrumentation system, MEG (Magnetoencephalography) is supposed to be an effective and useful brain imaging technology that enables us to measure brain data accurately and harmlessly [2,3]. However, MEG has been used primarily for scientific studies. This is partly due to the costs of purchase and maintenance. Also, the amount of data that we have to analyze and the difficulty of analysis are other limiting factors. Although the analysis is a data-intensive and computation-intensive task, the process has been conducted with a single processor.
For clinical applications, quick data analysis is required. This means a new approach to data-intensive, computation-intensive analysis is required. The new approach must replace traditional computation methods based on a single processor.
To date, the demand for solving computation-intensive problems has produced several methods for processing with multiple processors such as SMP (symmetric multi-processing), MPP (massively parallel processing) and so on. These methods are very effective for a certain degree of problem size.
High-performance distributed computing systems, or computational grids, enable us to run parallel applications on a heterogeneous collection of computers such as SMP and MPP . The most notable characteristic of the computational grid is that we can utilize high-performance machines that may reside in different administrative domains. This means the computational grid has the potential to bring us far more computation power by integrating diverse computational resources on the Internet.
Our goal in this research is to develop the diagnosis supporting system with MEG modality using computational grid technology. Our system is designed to use far more computation power, and connect specialized and diverse resources such as MEG and supercomputers on the Internet.
For our goal, we have combined the Globus system [4,5] and local parallel programming environment based on MPI (Message Passing Interface) [6,7] within one computer. The Globus system enables us to construct a computational grid environment on the Internet, while MPI provides a parallel programming environment based on message passing methods within a single node, such as a cluster machine or SMP machine. We are planning to realize a grid-enabled system, which has enormous computation power and the capability to solve the geographical distribution problem through the combination of Globus and MPI.
In this paper, HP Exemplar was utilized for constructing and simulating our system. In simulation, we investigated whether our system can solve the problem of geographical distribution. Furthermore, we have explored the possibilities of future remote diagnosis.
The rest of sections are organized as follows. In the next section, we outline the design of our proposed system. In section 3, coarse-grained parallelism with Globus is described in detail. In section 4, fine-grained parallelism with MPI is described in detail. In section 5, we describe the simulation on Exemplar. In section 6, we review the results of simulation. In section 7, we conclude this paper.
In general, analysis of brain data is performed by comparing and investigating the correlation among brain data collected with brain imaging technologies. For the comparison and investigation of correlation between brain data, frequency analysis such as Fourier transform is often used. The comparison and investigation is a time-consuming task, as brain data are measured at multiple points. After that, evaluation of brain function must be performed based on the result of correlation analysis. In other words, diagnosis of brain disorders requires three steps; data collection, data analysis, and evaluation of brain function. This means that seamless processing from data collection to evaluation is necessary for efficient diagnosis.
For the investigation and comparison of brain data, wavelet cross-correlation analysis has been adopted in this research. This analysis can extract frequency information while preserving time information. Therefore, it is effective for non-stationary analysis such as the evaluation of brain function [8,9]. This analysis, however, needs more computation than the analysis based on Fourier transform. In addition, in order to locate the source of brain disorders, our analysis needs to be performed for each pair of measurement points. As a result, diagnosis of brain disorders is time-consuming task for a single workstation. The analysis time must be reduced for seamless processing.
Moreover, for seamless processing, brain data needs to be shared seamlessly among three processes. However, each process needs specialized resources; so the three processes are often distributed, which prevents efficient data sharing. Actually, brain data has been recorded on optical disk and passed from person to person by hand. Sharing data by hand not only wastes time but also may cause loss of data consistency or data itself.
Our fundamental approach to the solution of these two major problems, that is, the reduction of analysis time and the integration of data acquisition and analysis process, is to combine three distributed processes on the Internet. We are planning to construct a diagnosis supporting system on the Internet. Furthermore, our system is designed to enable us to not only integrate three processes, but also to draw computation power from a heterogeneous collection of computers.
Figure 1 illustrates our system design. Our system consists of three parts: data acquisition, computation, and I/O. This system design is based on three processes essential for our diagnosis of brain disorders. In other words, these parts are assumed to be located at a diagnosis organization such as a hospital, an organization equipped with high-performance computers, and an organization equipped with MEG, respectively. Our system integrates these three organizations and enables specialists in the diagnosis organization to access remote MEG data in the data collection organization without being aware of the network. Each function of three processes is explained as follows.
In our analysis, as a means for brain data measurement, MEG with 64 sensors is assumed to be utilized. Accordingly, for the comparison and investigation of correlation between brain data, at most 2016 pairs of wavelet cross-correlation analysis need to be performed. As each wavelet cross-correlation analysis has no relationship with the others, we distribute the workload of at most 2016 wavelet cross-correlation analysis with coarse-grained parallelism, and then distribute the workload of wavelet cross-correlation analysis itself with fine-grained parallelism. Figure 2 shows that our two-stage parallelism is very effective in comparison with traditional serial computation. These two parallelisms are described in detail in the following two sections.
3. I/O part: This part works as user interface. The analysis request from the user is sent to the computation part or data acquisition part. After analysis, the results sent from the computation part are received, and the I/O part visualizes the results.
The most important characteristic of our system is that it is capable of accessing remote data and drawing enormous computational power from multiple multi-processor machines on the Internet. This means that seamless data sharing among three organizations and reduction of analysis time can be realized.
Computational grid technology is being viewed as a critical element of future high-performance computing environments that will enable entirely new classes of computation-oriented applications. In other words, computational grid technology enables pervasive and consistent access to high-performance computational resources that may reside in different administrative domains, despite geographical distribution of both resources and users.
In this research, Globus grid toolkit has been deployed as the underlying computational grid technology. Globus provides various software components essential for computational grid environment, such as resource management, security, communication, and so on. These software services facilitate building and running parallel applications on a wide-area network. We have been developing our system by utilizing the services Globus provides, especially, GRAM (Globus Resource Allocation Manager) and Nexus communication services.
GRAM is a service for resource allocation and process management and provides a standard network-enabled interface to local resource management systems such as Load Sharing Facility (LSF) or Portable Batch System (PBS). This interface enables us to submit jobs to remote machines through GRAM API, despite the heterogeneity of resource management policy.
On the other hand, Nexus is a service for communication and facilitates us to express diverse communication models, ranging from point-to-point message passing to multicast communication, despite network heterogeneity [10,11]. A communication link in Nexus is formed by binding two basic abstractions, that is, startpoint and endpoint. Furthermore, Nexus supports RSR (Remote Service Request) mechanisms. RSR enables us to run the arbitrary module, which we defined in advance, on a remote machine by specifying Handler ID.
Figure 3 shows a system design based on these mechanisms. This design is based on a client-server model. After the server is invoked by GRAM, the server forms a communication link by using Nexus API. In our design, communication between one server and one client is bisectional. Hence, each server has both startpoint and endpoint linked to the client's endpoint and startpoint. The client has both startpoint and endpoint linked to multiple servers (Figure 4).
Our system has been implemented on this communication model. These bisectional communication links enable us to interact with and synchronize client and servers. In detail, the interaction and synchronization are realized by the use of Nexus RSR mechanism. Both client and servers have multiple RSR modules, and interact with each other. Figure 5 illustrates how this mechanism works.
(Steps 7, 8, and 9 work the same as 4, 5, and 6, respectively.)
Nexus and GRAM enable us not only to distribute the workload to heterogeneous collections of machines on the Internet, but also to interact and synchronize them. However, our goal is to construct a system that enables seamless and quick processing from data collection to visualization for diagnosis. Therefore, the analysis time within such a machine, that is, the analysis time for wavelet cross-correlation itself, must be reduced. In the next section, this further reduction is described.
With coarse-grained parallelism, the workload of our analysis is distributed to diverse computational resources. In general, this heterogeneity prevents our uniform design for fine-grained parallelism. In parallel machines, several methods for parallel processing are made available. For example, OpenMP, MPI, or PVM are examples of such methods. However, it is impossible to implement fine-grained parallelism for each architecture type in case 100 or 1,000 computational resources are planned to be used for coarse-grained parallelism in the future. Hence, a uniform method for fine-grained parallelism is required.
For realizing uniform fine-grained parallelism, the workload of wavelet cross-correlation analysis itself must be distributed to multiple processors within a single node with a uniform method. MPI has been utilized for this purpose in this system. MPI provides API for parallel programming based on message passing, which is a paradigm used widely on certain classes of parallel machines, especially those with distributed memory. In other words, MPI enables us to run parallel applications without being aware of the difference of machine architecture.
Wavelet cross-correlation analysis is based on wavelet transform. The wavelet transform of signal f(t) is defined as follows:
where g(t), a, and b is mother wavelet, scale parameter, and shift parameter, respectively. In this research, gaussian wavelet has been employed as a mother wavelet. Gaussian wavelet is suitable for spatio-temporal analysis if it is used for continuous wavelet transform . Gaussian wavelet is defined as follows.
This wavelet transform is performed for each measuring point over changing the parameters in both frequency and time. This operation makes mother wavelet scaled of shifted. Fig. 6 shows how gaussian wavelet as mother wavelet is scaled by parameter a.
Gaussian wavelets in Fig. 6 have no relationship with each other. We distribute the computation workload by using this locality. Fig. 7 illustrates how our analysis is performed with fine-grained parallelism. The workload is equally distributed according to the number of processors. Hence, the acceleration of analysis time critically depends on CPU parameters such as load average or number of processor, since almost no interaction is required between distributed problems.
The simulation for investigating the effect of two-stage parallelism on a computational grid was performed at the Computation Center, Osaka University, Japan. For the coarse-grained distribution of the workload, a 48-processor HP Exemplar was utilized. The Exemplar can be viewed as three different computers since it consists of three nodes, each of which is separately controlled by an operating system within the node. Only two nodes, however, were utilized as server in this simulation because of administrative policy constraint.
In this simulation, the workload is distributed to two nodes through GRAM API. GRAM deployed on the Exemplar is layered on LSF and simple fork scheduling. LSF is a middle ware for the workload distribution and batch scheduling. LSF was used for resource allocation management and process management.
For fine-grained parallelism, HP MPI, which is an implementation of the MPI standard, was utilized. Within each node, at most 16 processors were simultaneously used.
Furthermore, gang scheduling was used for preemptive co-allocation of processors. The performance of processors that participate in this fine-grained parallelism is a critical factor. The degradation of processor performance causes the decrease of whole system performance because the result of parallelized computation must be gathered at one point. Gang scheduling enables our MPI application to run on multiple processors simultaneously, preempting execution of other applications.
In this simulation, 2.4-second data (sampling interval 4 ms) was used and then 50 pairs of wavelet cross-correlation analysis were performed within each node. During this simulation, the load average of node (a) was a little higher than that of node (b).
Figure 8 shows the analysis time with fine-grained parallelism in these two nodes. The analysis time is the average of 30 measurements. In addition, the variance is very small. According to this graph, with an increase of CPU, the analysis time decreases.
Figure 9 shows the acceleration of analysis time, denoting analysis time of single processor as 1.0. In this simulation, the decrease of acceleration with an increase of processor is observed.
In this simulation, at most approximately 11 times acceleration have been observed only with the use of fine-grained parallelism on each node. In other words, through the use of both coarse-grained parallelism with two nodes and fine-grained parallelism, the analysis time has been accelerated approximately 22 times from serial computation.
The acceleration efficiency decreases gradually with the increase of processors (Figure 9). This seems to come from the overhead of I/O processing. Though our MPI program has no communication between processors during the computation, the initialization and finalization for broadcasting MEG data or gathering analysis results needs to be performed. For this purpose, MPI_Bcast or MPI_Gather, which HP MPI provides as basic collective primitive, was utilized. These collective operations generally load system, especially when the degree of parallelism is high.
This phenomenon is also explained by the system overhead of context switch. Though gang scheduling can preempt execution of other applications, it produces overhead time. This overhead seems to become big, especially in high-load state. This also explains the decrease of acceleration in using more than 14 processors, shown in Figure 9.
Interestingly, however, at most approximately 22 times acceleration was observed by combining coarse-grained parallelism and fine-grained parallelism on a computational grid. We consider the case of 64-sensor MEG, where investigations of all (2,016) combinations are required, for example. If this is performed in traditional fashion with single computer, it takes approximately 700 minutes to analyze 2.4-second data. In practice, 10-minute data needs to be analyzed for adequate diagnosis. In this case, one-hour measurement with MEG is performed. Hence, for clinical use of our wavelet cross-correlation, the analysis needs approximately 140,000 minutes (98 days). However, our simulation results show that the analysis time will be reduced to 4.5 days. Furthermore, if more machines can be used for coarse-grained parallelism, the analysis time would be dramatically reduced according to the number of servers. We believe the use of infinite resources on the Internet would have the potential to reduce our analysis time to approximately zero.
However, there are several problems to be solved in constructing the system running on a computational grid. One of them is quality of the network, such as bandwidth, latency, jitter, and so on. These parameters may be critical factors for deciding system performance, especially when the system needs the large amount of data transmission quickly and stably. Our system is not an exception, and will be affected by these parameters. As our system, however, needs the transmission of data just before or after computation on server, the effect by these parameters can be minimized. In general, however, network parameters are serious issues for constructing grid-enabled applications.
For the efficient operation of the system, users must know the network status. In order to address this need, Globus grid toolkit provides Gloperf network performance tool  that monitors and measures available network bandwidth. We can know current network status using Gloperf. Moreover, future advances in network technology such as QoS or transmission technology would lead us to the solution of these problems.
Reliability is also an important factor in building and running parallel applications on a computational grid. In Globus grid environment, in order to use the diverse resources on computational grids, Globus security policy needs to be satisfied. For this purpose, Globus provides GSI (Globus Security Infrastructure). GSI maps authenticated Globus credentials into locally recognized credentials such as Kerberos tickets, or local user names and password. This process performs the lookup of information on diverse resources on computational grid. In our simulation, it took much time and, what is worse, the process sometimes failed due to server disorder. This is explained by the fact that now there is only one server. Hence, this problem would be solved.
Our simulation indicates that a computational grid has the potential to bring us enormous calculation power and to integrate geographically distributed resources. Especially, the solution for geographical distribution problems may lead us to the possibilities to remote diagnosis in the future. In addition, the realization of network-transparent access to remote data would lead to medical databases such as electrical diagnosis. In this research, MEG could not be connected to the Internet because of security problems and regulatory restrictions. However, we think that the medical use of computational grid technology would greatly enhance the quality and efficiency of services in brain function diagnosis as well as medical services.
In our research, the solution for the scientific computationally challenging problems has been explored. As an example, a diagnosis supporting system for MEG has been developed. Our approach that takes maximum advantage of computational grid technology was described in this paper.
In order to investigate the effect of two-stage parallelism with Globus and MPI, the simulation was performed with an HP Exemplar at the Computation Center, Osaka University, Japan. In this simulation, at most 22 times acceleration of analysis time was achieved by using a total of 32 processors. This simulation results shows that the computational grid technology not only enables us to solve computation-intensive problems, but also leads us to the solution of geographical distribution problems of both users and resources.
The medical application of computational grid technology would lead to efficient and high quality medical service. Furthermore, such technology would enable remote diagnosis. In addition, computational grid technology would be effective for other scientific problems demanding computational solution.
This work was supported in part by Research for the Future Program of Japan Society for the Promotion of Science under the Project "Integrated Network Architecture for Advanced Multimedia Application Systems" (JSPS-RFTF97R16301).
 Mizuno-Matsumoto Y., Date S., Tamura S., Sato Y., Zoroofi R.A., Tabuchi Y., Shimojo S., Kadobayashi Y., Tatsumi H., Nogawa H., Shinosaki K., Takeda M., Inouye T. and Miyahara H.: Integration of signal processing and medical image for evaluation of brain function on Globus., Proceeding of Internet Workshop '99 (IWS'99), IEEE press., 241-246, 1999.
 MEG, http://www.ctf.com
 S. Sato. Magnetoencephalography. Advances in Neurology Vol54 Raven Press, NY, 1990.
 I. Foster, C. Kesselman. The Globus Project: A Status Report. Proc. IPPS/SPDP'98 Heterogeneous Computing Workshop, 4-18,1998.
 I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. Int. J. Supercomputer Applications, 2: 115-128,1997.
 MPI, http://www.mpi-forum.org
 Marc Snir, Steve W. Otto, Steven Huss-Lederman, David W. Walker, and Jack Dongarra. MPI---The Complete Reference: Volume 1, The MPI Core, 2nd edition. MIT Press, Cambridge, MA, 1998.
 Mizuno-Matsumoto Y., Tamura S., Sato S., Zoroofi R.A., Date S., Tabuchi Y., Shinosaki K., Ukai S., Ishii R., Inouye T., Tatsumi H., Kadobayashi Y., Shimojo S., Takeda M., Miyahara H.,: Wavelet-crosscorrelation analysis: Non-stationary analysis for neurophysiological signals. IEEE Trans. Biomed. Eng. (revised version submitted) 1999.
 Mizuno-Matsumoto Y., Tamura S., Sato Y., Zoroofi R.A., Yoshimine T., Kato A., Taniguchi M.,, Takeda M., Inouye T., Tatsumi H., Shimojo S., and Miyahara H.: Propagating process of epileptiform discharges using wavelet-crosscorrelation analysis in MEG. Recent Advances in Biomagnetism. Yoshimoto T. (Eds.), Tohoku Uni. Press, Sendai, 782-5, 1999.
 I. Foster, C. Kesslman, R. Olson, et al. Nexus: An interoperability toolkit for parallel and distributed computer systems. Technical Report ANL/MCS-TM-189, Argonne National Laboratory, 1994.
 I. Foster, C. Kesselman and S. Tuecke. The Nexus task-parallel runtime system Proc. 1st Intl Workshop on Parallel Processing, Tata McGraw Hill, 457-462, 1994.
 H. Kikuchi, M. Nakashizuka, H. Watanabe, et al. Fast wavelet transform and its application to detecting detonation. IEICE Trans. Fundamentals., Vol. E75-A, No. 8, pp. 980-987,1992.
 Craig A. Lee, Rich Wolski, Carl Kesselman, Ian Foster. A Network Performance Tool for Grid Environments (Submitted to Supercomputing '99)