In an investigation of models for the support of nested data parallel computation, we have concluded that a multi-threaded model of execution, tailored to a SPMD computational model, can provide a natural expression of this paradigm. The machine we have designed is a distributed memory multiprocessor SPMD machine whose nodal model of execution is based on a paradigm of self-scheduling threads. This paradigm is similar to others proposed in recent literature in both the parallel hardware community [1] and the parallel software community [2].
Our implementation is defined in two stages. The first is an abstract machine -- essentially a virtual multi-computer. The second is a concrete realization of that machine using the active message library of the CM-5, which we have used as a target for experiments in compilation and performance analysis of parallel functional languages. The abstract machine design allows straightforward instantiation on other platforms -- the main requirement is a message passing library with which threads can be initiated on remote nodes.
At the top-most level of abstraction we define our machine in terms of an abstract machine "program" P and a set of "processing nodes" Ni. An abstract machine program is defined a collection of disjoint "threads" each made up of a sequence of abstract machine instructions. Threads are analogous to traditional procedural units of activation; the instructions in the block are executed in sequence until a distinguished final instruction is reached.
Each node of our machine holds its own distinct copy of the abstract machine program P, and executes threads from that program in a manner that is independent of the state of any other node. That is, our abstract machine is SPMD. Nodes additionally maintain a computational state that is defined as a tuple incorporating a host id, a set of thread activations, a set of result values (which are pairs (key, value)), the node's memory pool, and a currently executing activation.
During the course of thread execution, nodes may engage in communication operations with other nodes. All such operations within our model are framed in terms of asynchronous messages which alter the state of the receiving node in a manner that is invisible to the receiving node's executing activation. Most available distributed memory architectures support this style of communication and hence are described by our model. In this sense our model is architecture-independent.
The abstract machine supports a single form of data aggregate, the one-dimensional vector. More complex data structures may be constructed as a nesting of such vectors. Every such vector is by definition a partitioned aggregate, that is, its component elements are spread among the various nodes of the abstract machine. These elements are stored in the nodal memory pool. The fashion in which this partitioning is performed is encapsulated within a partitioning function which is associated with the vector, and defines the mapping of elements of the aggregate to nodes of the abstract machine.
In traditional SPMD machines, the flow of control between code units is deterministic and fully specified by the instructions within the code. A code unit becomes active on node Ni, after which time instructions from that unit are executed until such time as the unit terminates or it enters a request that a new code unit be activated. In the latter case, the currently executing code unit's execution is suspended, and the execution of the new activation commences. Once the activation it entered has terminated, the original code unit is resumed.
In our abstract machine the model of execution is significantly different. A thread which has become active continues its execution until such time as it explicitly requests suspension or termination. In the course of its execution an activation may request the activation of other threads, however such activations do not occur immediately. Rather, such a request causes an entry to be added to the node's "task pool", which maintains records of all activations, both active, suspended and awaiting execution. Whenever the node becomes idle, through either the termination or voluntary suspension of the presently executing activation, an activation is chosen non-deterministically from eligible activations in this pool.
Possible activation states are "pending", "active" and "suspended". A pending activation is one that has been entered by another activation but has to date not been scheduled for execution. There can be at most one activation with state "active" within a task pool. A "suspended" activation is one which voluntarily suspended its execution. Upon suspension, an activation specifies a "key" (or identifier) which denotes the conditions under which its execution may continue.
Each node maintains a "result pool", a set of tuples, the first element of which is a key, the remaining elements arbitrary values. Its purpose is as a medium of communication and synchronisation between activations. An activation may add a result tuple to either the pool of the node on which it is presently executing, or to another node's pool through a communication operation.
The result pool of a node plays an important role in the scheduling of activations. Activations which have voluntarily suspended their execution may only be re-scheduled when one or more results with the same key as that which on which they suspended exist within the node's pool. The semantics of resumption are that exactly one result which bears the appropriate key is removed from the pool and returned as the result of the call to the suspension function when the code stream in the resuming activation continues.
As can be seen above, an important factor in the scheduling and communication functions of a node are the identifiers known as keys, defined to be elements of an infinite "key-space". It is the responsibility of the abstract machine program to institute a protocol of key management which ensures that results intended for receipt by a particular suspended activation are never intercepted by another.
In addition to traditional algorithmic control constructs (for, while, if) and the standard imperative operation of assignment, the abstract machine has a number of special operations that deal with threads and communications. Instructions are provided to enter an activation, to activate a thread on another node, to deposit values in either the host result pool or that of another node, to suspend and terminate activations on keys, to allocate storage, and to generate unique identifiers.
In terms of this instruction set it is possible to define partitioning-independent nested data parallel operations. At the most abstract level such operations are modelled as a set of thread activations, one for each element of the source vector. Each such activation runs on the node owning the corresponding vector element ("Owner Computes") and accepts the element as input.
We have expressed a number of important data parallel primitives in such a fashion: thread synchronisation, distributed vector allocation, and second-order data-parallel operators map, reduce, and scan [3]. The threaded nature of the abstract machine permits such descriptions to take advantage of latency-hiding. In the near future, we plan to use the virtual machine in a wider context, exploring its realization on alternative message passing systems, and as a target for other language models -- for example, the "wait-by-necessity" construction of Eiffel// is readily supported by our key-based suspension mechanism.
[2] David E. Culler, Seth Copen Goldstein, Klaus Erik Schauser, and Thorsten von Eicken. TAM --- a Compiler Controlled Threaded Abstract Machine. Journal of Parallel and Distributed Computing}, 18:347--370, 1993.
[3] Dean Engelhardt, D and Andrew L Wendelborn, "A Partitioning-Independent Paradigm for Nested Data Parallelism", International Journal of Parallel Programmimg, vol. 24, no. 4, August 1996.