Communication Software
We are running a 16 node cluster with Myrinet interconnects at our institute, using the ParaStation communication and administration software. With ParaStation (now sold by ParTec AG, a spin-off from the Department of Computer Science), we achieve high communication throughput between processes of a parallel application. Since the communication layer so far merely focused on MPI programs, in 2002, we added a network driver that simulates Ethernet connections. With this new module, applications can establish standard TCP/IP connections over high-speed communication hardware without recompilation. The module was evaluated in cooperation with the group for elementary particles and computer aided physics of the University of Wuppertal. On their cluster ALiCE, they are using the parallel file system PVFS for very I/O intensive quantum chromodynamics simulations. By employing our new module for all PVFS communication, they achieved very good performance. The results will be published later this year.
Parallel Programming Environments
In the project "parallel and distributed programming of clusters in Java", we are exploring the advantages of writing parallel applications in Java with respect e.g. to the usage of the cluster resources. We developed JavaParty, a domain specific language for programming cluster applications. JavaParty extends Java by adding transparent remote objects. It is used by our partners in the RESH project sponsored by the DFG, as well as by many external users.
JavaParty realizes a distributed object space through remote method invocation (RMI). Multiple threads are executed in parallel to jointly solve a problem. In remote calls, a thread's point of execution moves to a different node, thereby creating a distributed thread. Using standard libraries for remote method invocations, deadlocks can occur in synchronization operations, because monitors are no longer reentrant, and remote monitor acquisition is impossible. Both problems were solved by adding support for transparent distributed threads to KaRMI, a fast implementation of RMI for clusters. With transparent distributed threads, Java monitors are reentrant even in recursive remote invocations. Remote monitor acquisition is realized through a combination of an enhanced KaRMI API and a program transformation. Additionally, the application gains full control over its threads as signals are forwarded to the current point of execution.
We further designed a concept for checkpointing parallel and distributed systems. By regularly saving the state of a program to persistent storage medium, any need for recalculating data obtained so far can be avoided in the case of a system crash. This is especially important for long running simulation programs. The checkpointing process should be transparent to the application and cost-effective with respect to cluster resources (computing time, main memory and storage, communication bandwidth). Checkpointing a distributed application is more complex than checkpointing a single process because of dependencies induced by communication operations. Existing strategies focus on message passing systems. A JavaParty extension for distributed checkpointing, however, could address all these problems at the language level, thereby making further optimizations feasible.
Additional information about JavaParty and KaRMI can be found on this website: https://svn.ipd.kit.edu/trac/javaparty/wiki/JavaParty.
Parallel Filesystems
Clusterfile is a parallel file system for clusters of computers. In 2002, we focused on broadening the application area. The goal of an early design was to efficiently use the internal parallelism of applications. Internal parallelism emerges from the I/O access of multiple processes of the same application. On the other side, external parallelism arises from concurrent access of different applications. Our extensions address external parallelism by introducing not only application-specific, but also system-wide optimizations.
We implemented the file system partially in user level, partially in the Linux kernel. Based on a kernel module supporting the VFS (Virtual Filesystem Switch) interface, Clusterfile may be mounted in the local directory tree of any cluster node. Metadata is managed through the cooperation of the kernel module with a central server. We introduced collective I/O operations in order to optimize simultaneous access from many nodes to the same file. Furthermore, we implemented an MPI I/O interface for Clusterfile, which we currently compare with other MPI I/O implementations.
In the future, we plan to increase application performance and scalability by introducing cooperative caching. This is joint work with the subproject Scalable Servers on Cluster of Computers. Furthermore, we will improve scalability of metadata management by decentralization. Currently, we are studying various policies, such as metadata distribution or replication over the cluster nodes.
Scheduling Policies on Clusters
Gang Scheduling coordinates process task switches on multi-processor systems to enhance performance. It runs in parallel groups of processes that communicate intensively with each other in order to avoid task switches that result from one process waiting to communicate with another one (process thrashing).
While Gang Scheduling is state of the art on classical parallel computers, it creates new challenges on cluster computers. Applications couple the operating system kernels only loosely, making them running almost independent from each other. Process coordination across host boundaries may be vulnerated by high priority tasks inside the kernel (such as hardware interrupts or swapping). Communication latency is high as compared to integrated parallel computers, hence limiting the precision of coordination.
During the period of report, a mechanism for remotely triggering a process group oriented task switch has been implemented in the Linux kernel based on ICMP packets. For validation, the kernel was instrumented and analysis tools were developed, such that the effect of the meachanism could be proved, as expected. Based upon this work, we will develop various scheduling policies, evaluate and optimize them.
Scalable Servers on COTS Clusters
This project aims at using clusters of computers as a powerful platform for developing scalable servers. Our work focuses on two aspects. First, we try to develop efficient mechanisms for load balancing and cooperative caching among the cluster nodes. Second, we are interested in achieving a performant trade-off between the hard-to-reconcile goals of load balancing and data reference locality.
In this regard, we developed Cluster Aware Remote Disks (CARD). CARDs are disk drivers in the kernel that operate on cooperative caching algorithms. We designed and developed such an algorithm, the Home-Based Serverless Cooperative Caching (HSCC). For further information on CARDs, HSCC and their evaluation, see http://www.ipd.uka.de/RESH/publ.html.
Furthermore, we designed and developed Home-Based Locality-Aware Request Distribution (HLARD), a request distribution policy that combines HSCC with TCP connection endpoint migration. By migrating a TCP connection endpoint, a server machine physically moves the endpoint to another server in the cluster. The client is totally oblivous to the procedure. HLARD distributes incoming requests according to the locality of the requested data as advertised by HSCC.