Data Partition and Migration for High Performance Computation in Distributed Memory Multiprocessors

Data-partition and migration for efficient communication in distributed memory architectures are critical for performance of data parallel algorithms. This research presents a formal methodology for the process of data-distribution _and redistribution using tensor products and stride permutations as mathematical tools. The algebraic expressions representing data-partition and migration directly operate on a data vector, and hence can be conveniently embedded into an algorithm. It is also shown that these expressions are useful for a clear understanding and to efficiently interleave problems that involve different data-distributions at different phases. This compatibility made us successfully utilize these expressions in developing and demonstrating matrix transpose and fast Fourier transform algorithms. Usage of these expressions for data interface generated efficient parallel implementation to solve Euler partial differential equation. An endeavor to minimize communication cost using expressions for data-distribution disclosed a routing scheme for Fourier transform evaluation. Results promised that for large parallel machines, this scheme is a solution to today's problems which feature enormous data. Finally, a unique data-distribution technique that effectively uses transpose algorithms for multiplication of two rectangular matrices is derived. Performance of these algorithms are evaluated by carrying out implementations on Intel's i860 based iPSC/860, Touchstone Delta, Gamma, and Paragon supercomputers.


Preface
The demand for high speed computers has been more than existing computing power at any time in the computer era. Even very impressive electronic components could not satisfy today's thirst for performing enormous number of calculations involved in most of the practical applications. With these challenges, parallel processing is the way to achieve desired computing speeds. A parallel computer consists of a collection of processing units that assist together to solve an application. Architects of parallel computers have the freedom to select number of processing units, to link processors through various interconnections, to have shared or distributed memory, to design synchronous or asynchronous operations, etcetera.
For academic researchers, access to supercomputers is still limited. Nonetheless, usage of supercomputers by the community of scientists is increasing every year, and research projects performed on these became more ambitious and sophisticated.
To solve problems once thought impractical, supercomputers have become friendly tools.
This dissertation addresses aspects in parallel systems which have distributed memory and feature independence from underlying interconnection network. The problems studied in this dissertation are based on mathematical tools which can represent algorithms for parallel systems. Examples are used as often as possible to illustrate these tools. Distributing the problem onto processors is modeled using these tools while they were proven to be helpful to optimize old solutions as well as to derive new solutions. A list of references to publications where related problems and algorithms are treated is provided at the end.

Introduction
Many scientific computations such as engineering, energy resource, medical, military, artificial intelligence, and basic research areas demand fast processing computers to achieve required computational performance. Without the existence of superpower computers , the study of many of these applications and the efforts to meet today's challenges could hardly be realized. Since device characteristics are approaching the physical limit, parallel processing is the only way to improve computing power further in order to meet its ever increasing demand . Research in cost-effective, highspeed, massively parallel, and reliable supercomputers has become a very active field in computer engineering.
Two distinct and important parallel computer architectures are shared-memory, and message-passing systems [1,2]. A shared-memory machine has a single global memory accessible to all processors such as IBM RP3, Encore Multimax, Cray X-MP, and many workstations. Its important feature is that communication between nodes is done by reading from and writing into the shared-memory. However, the sharedmemory builds a barrier for increasing number of processing elements. Messagepassing systems, also known as distributed memory system, allocate a stipulated amount of memory to each processing element but data does not form a single address space [3]. Communication among processors is done through message-passing.
Intel iPSC, nCube, and Caltech Mark II hypercube belong to this category. In this architecture, if an application shares data at distinct nodes, a programmer specifically commands to port data from one node to another. Since no resources such as data, cache, CPU time, etc., are globally shared besides the links (a link is accessible to a very limited number of processors), message-passing systems are scalable and preferred by researchers for solving larger problems.
One important characteristic of message-passing machines (also known as distributed memory systems) is that there is a significant timing difference between local and remote data accesses. Remote data access involves message-passing among processors. This message-passing process takes a significant amount of total execution time of a computational procedure. The amount of remote data accesses needed to accomplish a given computation mainly depends upon how data are initially allocated to processors. We refer to this initial data allocation as data-partition. Efficient data-partition in distributed memory systems is essential for achieving high performance of data parallel programs. An optimal data-partition for one individual algorithm (computation module) may not be optimal for others. Therefore, optimized data-partition for applications that involve a number of computation algorithms or modules present an interesting challenge to parallel processing researchers.
Optimization of data-partition is achieved mainly by constraining to a rule that number of message-passings should be minimum for a given system architecture.
The system architecture of a distributed memory system is mainly determined by its interconnection structure. Various interconnection networks and evaluation of their performance have been reported in the literature [4,5,6,7] . Some well known network schemes for message-passing systems are ring, tree, mesh, hypercube and star connections. Processors that are directly connected are called neighbors. Processors that are not connected through a direct link are called non-neighbors . Distance between processors is calculated based on the minimum number of processors through which they are connected. It was also observed that the farther the node is, the longer it takes to port data. Hence, the performance of an implementation can be optimized by minimizing the distance a node has to travel to get data. Extensive efforts have been put forth in minimizing communication distance among processors in a research field called parallel programming [8,9].
Recently, a very important technological advancement has occurred in interconnection networks to minimize communication cost known as wormhole routing [10,11].
In this scheme, a message is divided into a number of flits (flow control digits). Once the header flit of a message acquires a channel, it governs the route of that message and the remaining flits follow in a pipelined fashion. This pipelined nature of wormhole routing procedures resulted in a network latency that is relatively insensitive to the distance between processors [12], which in turn has made communication cost Because of the importance of data-partition in data parallel programs, we focused our attention on the study of partitioning an application and migrating the necessary data among processors. This dissertation formulates the data-partitioning schemes and derives variants of migrating schemes in distributed memory multiprocessor systems using tensor algebra. The essential feature of this formulation is that datapartition and migration are represented using simple tensor algebraic expressions.
Therefore, they can form parts of an algorithm that is already written in tensor algebraic notation. Furthermore, by using tensor notation and stride permutations, our formulation is simple and compact without having to deal with complicated indices in complex data structures. Such a clear mathematical representation of storage schemes helps parallel programmers greatly to look into inherent structure( s) of an algorithm and the associated communication cos~.
With the newly proposed formulations, optimal data-migration at interfaces between computation modules become straightforward algebraic manipulations. This research demonstrates the manipulations of fast Fourier transform (FFT) algorithms for efficient implementations. These algorithms are applied to an application that solves Euler partial differential equation using wavelet-Galerkin method and achieved a significant improvement in the overall performance. Then, we have designed an efficient two-dimensional FFT algorithm for distributed systems using algebraic expressions for mesh-division data-partitioning. This design is shown to be a solution to the problems featuring huge data size, large machines and higher dimensionality. Optimal data-partition is considered also for matrix multiplication algorithms In order to demonstrate the usefulness and significance of our data-partition expressions, we carried out implementations of our algorithms and applications on Intel's supercomputers: iPSC / 860, Paragon, Gamma, and Touchstone Delta.
This dissertation is organized as follows: Chapter 2 reviews the tensor algebraic notation, and related theorems that are required· for a better understanding of the concepts in our contributions. It also presents a survey of the existing literature that is related to this dissertation. Chapter 3 presents definitions for three data-allocation schemes in tensor algebraic notation. Demonstration of these expressions is presented for the case of matrix transpose algorithms using all the three distinct data-allocation schemes . This chapter also presents tensor product formulation of two-dimensional discrete Fourier transform (2D-DFT) for row-division data-partition using the transpose algorithms. Computation of vorticity and stream functions in two-dimensional fluid turbulence using 2D-DFT is carried out in Chapter 4 to demonstrate the usage of data-allocation expressions to efficiently interface two computation modules that are efficient for two different data-partition schemes. A new and highly efficient approach to evaluate large 2D-DFTs using large parallel computers is derived using tensor algebra in Chapter 5. Chapter 6 considers the matrix multiplication algorithms for distributed memory systems via matrix transpose algorithm by presenting a unique data allocation scheme for rectangular multiplicands. Chapter 7 discusses future research and possible extensions of the methods developed in this work to other algorithms.

Chapter 2
Preliminaries and Related "Work

Introduction
In a distributed environment, implementation procedure for most of the applications involves dividing the main computational task into (a) local tasks that depend upon data residing at a node 's local memory, and (b) global tasks that depend upon data residing at more than one node. Such an identification and separation gives an estimation of the degree of node balance in an implementation and also the inherent message-passing overheads. Tensor algebra is a mathematical language that aids to identify, express, and analyze these tasks in an algorithm. Two most important operations in tensor algebra are tensor products, and stride permutations. In the following sections, we introduce these operations and demonstrate their importance with respect to parallel machines. This notation is used in Chapter 3 to develop a set of formal definitions for data-partition schemes. Later parts of Chapter 3 use the same notation to express matrix transpose algorithms, and multidimensional fast Fourier transform (FFT) calculations with an emphasis on their implementation aspects for a distributed memory system. Also, Chapters 4 and 5 deal with variants of FFT algorithms using the same notation.
Sections 2.2 through 2.5 review the necessary notation and relevant theorems to this dissertation from tensor algebra. Survey of the literature related to datapartitioning schemes is presented in Section 2.6 while that for FFT algorithms and matrix computations are presented in Sections 2. 7 and 2.8, respectively. Section 2.9 gives a brief description of the machines on which experiments in this dissertation are conducted. Section 2.10 concludes the chapter.

Operators Mat and Vect
The inverse operation of Mat, VeciKJ, forms a linear array according to columnmajor scheme as follows.

Stride Permutation
Stride permutations are natural way of representing data-shuffling operations. We use P(Lx, S) to represent a stride permutation operation on a vector of length Lx with stride S. Let x be an L 5 S-element vector and Lx = L 5 S. Then, the stride permutation, y = P(Lx, S)x, performs the following operations. The first Ls elements of y are obtained by picking up elements of x starting at x 0 and then each Sth element of x: that is, { xo, xs, . .. X(L.-1)S}. The next Ls elements of y are obtained in the same way starting at X1 of x: { X1, xs+i, . .. , X(L.-l)S+l}, and so on. Therefore, the stride permutation operation, P(Lx, S), is an Lx x Lx size matrix that is filled with zeros and ones.

Tensor Product
Tensor product is a binary operator between two matrices of any size. Given two matrices A and B of sizes MA x NA and MB x NB , respectively, a new matrix, C, dimensioned MAME x NANB can be generated by tensor product of A and Bas: where a (i,j) is the element at ith row and jth column of A, and a(i ,j)B is scalar matrix multiplication. In other words , tensor product is a list of all possible combinations of multiplications of one matrix's elements with the other's. which can also be expressed using operators Mat and V ect as (2.8) On a ]-processor architecture, this representation gives a mechanism of simultaneously operating matrix A on different parts of input data by different processors.

Example 2.3 Consider a 4-processor machine and the following operational matrix
A to be operated on vector x: Therefore, these t ensor products with prior identity matrices can be used to determine the existence of local (parallel) tasks . In general , on a k-processor distributed memory machine, execution of (IJ 0 A) would imply k local tasks , where J = nk and n is a positive integer greater than zero.
If an identity matrix appears on the right-hand side of a tensor product, it is performed in a natural way for vector computers, that is, performing an operation

Some Useful Theorems
A number of properties that tensor products hold in combination with stride permutations will be useful in developing variants of a parallel algorithm. We will present these properties here without proof. Interested readers can find their proofs in [13,14] . Similar to the notation in algebra of matrices, a complex tensor product formulation should be read from right to left.  (2.15) This theorem is quite often used to derive parallel or vector computations when identity matri ces appear in the product.

Theorem 2.4 Commutative Law:
Interchanging the order of tensor product parameters results in permutations.

Existing Data-Partition Representations
It is well known that data-distribution in distributed memory multiprocessors is essential to achieve high performance of data-parallel programs. Extensive research has been reported on data-decomposition optimization for distributed memory machines [15,16,17,18,19]. Research in this area can be crudely classified into two categories. One aims at finding optimal data-partitioning schemes for parallel loop constructs as part of compiler. It has been shown that the problem of finding an optimal data-partition is NP-complete [17,20,15]. Therefore, researchers have to rely on heuristic methods [20,21,22,16,23]. The other effort aims at special-purpose implementations and a large work force for developing optimal implementation of individual algorithms is reported [24,25,26].
Typically, an application requires a number of computation modules linked together to accomplish a specific computation. Global optimization depends not only on optimal implementation of the computational modules, but at least equally on the interface between these implementations as determined by the data partition and migration across processors.
In this dissertation, we present a systematic formulation for data-partition and migration on distributed memory multiprocessors in terms of tensor product notation and stride permutations [27]. Data-partition and migration are represented using simple tensor algebraic expressions highlighting the computational and communication complexity of parallel algorithms. Therefore, optimal data-partition and migration at interfaces between different algorithms becomes straightforward tensor algebraic manipulations with the aid of well-established theorems in this field. Furthermore, due to the conciseness of the underlying algebra, definitions are simple and compact without having to deal with complicated indices in complex data structures.
In order to demonstrate the significance and usefulness of our framework , we have carried out experiments on existing distributed memory multiprocessors such as Intel's Paragon, and Touchstone Delta. Initially, our formal definitions are incorporated in three application problems: matrix transpose algorithm, two dimensional discrete Fourier transform algorithm, and solution of Euler partial differential equation using wavelet-Galer kin approach. Then, simple algebraic manipulations on these expressions are carried out to derive optimal data-partition and migration schemes.
Experimental timing results on these machines show that such simple algebraic manipulations result in performance improvement ranging from 30% to 600%.

Existing Multidimensional FFT Algorithms
The Fourier transform of large multidimensional data sets is an essential computation in many scientific and engineering fields , including seismology, meteorology, x-ray crystallography, radar, sonar and medical .imaging. Such fields require multidimensional arrays and large . data set for complete and faithful modeling. The development of powerful parallel computers has given scientists a means of studying problems with greater complexity and higher dimensionality. Classically, a set of data is processed one dimension at a time, permitting control over the size of the computation and calling on well-established one-dimensional programs. Multidimensional processing offers a wider range of possible implementations as compared to one-dimensional processing, due to the greater flexibility of movement in the data indexing set. This increased freedom, along with large sized data sets typically found in multidimensional applications, places intensive demands on the communication aspects of the computation. Therefore, parallel programmers are facing greater challenges to develop efficient parallel FFT algorithms with minimum communication overheads.
Because of its inherent algorithm structure, FFT lends itself naturally to parallel computation . There is a substantial amount of literature in parallelizing FFT, [28, 29 , 30 , 31, 32 , 33 , 34] to mention a few. In [28, 29 , 30 , 31] implementations of one-dimensional parallel FFTs on various multiprocessors were studied. In [32], implementation of high radix FFT on Boolean cube networks such as the Connection Machine was considered. Swarztrauber [34] investigated parallel FFT on general hypercube multiprocessors . He has derived an unordered parallel FFT algorithm on hypercube multiprocessors that has a minimum number of parallel data transfers between neighboring processors. This is performed by computing a one-dimensional FFT, which spans over processors. On the other hand, an existing two-dimensional FFT [35] in which FFTs on each dimension are computed by collecting all the required data and hence a computation is always within a node.
In Chapter 5, we presented an approach for computing multidimensional DFT on distributed memory systems that effectively utilizes the fact that today's distributed memory systems use wormhole routing for interprocessor communications. An approach to extend the algorithm for three or more dimensional problems using stride permutation and tensor product matrices has been presented that facilitates finding an efficient data-partitioning and network setup on distributed memory multiprocessors . Data-partitioning scheme is suitable and should be aimed at boundary value problems in fluid dynamics, finite element analysis etcetera. Results showed that our algorithm is more than six times as fast as the existing algorithm for certain cases.
Moreover, higher the parallelism is , the better the performance of new algorithm will be. Given the fact that physical limits on memory exist at each processor, our new algorithm is a solution to today's large problems that involve multidimensional Fourier transform computations on massively parallel machines.

Survey of Matrix Algorithms
Many applications have numerical solutions in which computational burden is reduced partly or fully to matrix operations. One of the most elementary operations involving matrices is multiplication of two matrices. However, since matrix multiplication requires substantially more data movements than most other operations, algorithms that address efficient data movements are crucial to the effective implementation on concurrent computers.
For shared-memory systems, efficient parallel matrix computations are discussed in [36] with a theoretical package for parallel random access machines (PRAM). Without any optimization techniques, scalar operations in multiplication of two N x N matrices is in order of O(N 3 ). However, Strassen [37] discovered an algorithm that only uses O(N 10 g2 7 ) scalar operations. Tensor product formulations of Strassen's algorithm are presented in [38,39] along with capability to translate their mathematically equivalent tensor product formulations onto shared-memory architectures.
Among algorithms constructed with an underlying topology in mind, Gentleman [40] has shown that for a mesh topology at least 0.35N routing steps are needed to compute product of two N x N matrices while O(log N) routes are necessary for a hypercube topology. For vector processors, Hayes [41] presented a technique at an element-by-element level for matrix-vector multiplication to solve systems of equations using iterative methods. She carried out implementation results on CYBER 205 demonstrating effectiveness for irregular problems.
For distributed memory systems, communication between nodes is done through message passings. If two or more nodes try to access same data, the requests will be queued and each node will get the message in a different time slot. This fact is pronounced in [42]. However, it is assumed in [43,44] that simultaneous memory requests by all processors can be served within the same time slot. In Chapter 6, we reviewed Fox et al' s broadcast-and-shift matrix multiplication algorithm [25 , 45] for message-passing architectures. This algorithm assumes mesh-division data-partitioning for the underlying data. Multiplication algorithms in [42, 25 , 45] require data movement of multiplicands irrespective of the size of the resulting matrix. Recently, Johnsson proposed an algorithm [46] to minimize the communication time in matrix multiplication. This algorithm is an evolution of considering two extreme cases of multiprocessor algorithm presented in [25,45] but still requires data movement of multiplicands while results are accumulated in place.
Most existing research that has been reported in the literature for parallel matrix multiplications concentrates on mapping of the algorithm onto different topological structures. With the development of wormhole routing, algorithm performance becomes more sensitive to amount of data movements than the topological structures of the parallel machines being considered. Furthermore, all existing parallel matrix algorithms requires moving one or both multiplicands. In Chapter 6, we present an efficient algorithm [4 7] that requires no access of multiplicands by other processors due to the consideration of the unique data decomposition strategy. Only partial results need to be moved among processors. When compared to the algorithms in [25 , 45 , 46] , messages in this algorithm are comparably shorter for the case of rectangular arrays. Performance improvements up to 440% have been observed over the algorithm in [25 , 45] in actual implementations on Intel's Paragon and iPSC/860.

Experimental Environment
Intel's concurrent supercomputers are the cost-effective solution for large-scale applications. Experiments in this dissertation are performed on iPSC/860, Touchstone Delta, and Paragon supercomputers. All these systems consist of a set of processing nodes, I/O nodes, peripheral units, and a front-end processor. Each processing node uses one or more of the i860 multiprocessor. Message passing among nodes does not require any " store and forward" because of the Direct-Connect Module™ (DCM). With the DCM, one can view these systems as an ensemble of fully connected nodes with a uniform message latency. This means that programmers do not have to structure their application 's communication according to the underlying topology (physical connections between nodes) . On each node, a node system software runs to provide message-passing capabilities, memory management, and process management.
The iPSC/S60 uses hypercube topology for physical connection to link 64 nodes .
Each node is a processor/memory pair with memory size SM bytes. The runtime software on iPSC is NX operating system. The Touchstone Delta uses mesh topology for physical connection to link 512 nodes . Again , each node is a processor /memory pair with possible memory sizes SM bytes and 16M bytes. The runtime software on Touchstone Delta is NX/1 operating system. The Paragon also uses mesh connectivity to connect 64 nodes. However , each node has two iS60 processors one to perform computations and another to perform the necessary communication instructions. Hence, Paragon provides higher communication bandwidth than iPSC or Delta. Memory assignment for each can be 16M bytes or 32M bytes. Also , Paragon uses a more advanced runtime software called OSF /1 provided by Open Software Foundation.
The runtime environment consists of a set of user interface commands that can be issued at the UNIX prompt and a set of system calls that are available to host and node programs. The most common programming model used with these supercomputers is the "single program, multiple data" (SPMD) model. In this model, the same program runs on each node in the application, but each node works on only part of the data. Any requirements of the data for a node from another node is obtained using their corresponding runtime software. Due to the underlying assembly coded routing-scheme and uniform message latency characteristics, we assumed in our analysis that a message-passing between two nodes, irrespective of the nodes, will have same communication time.

Conclusion
Notation involved in a mathematical language to express, to segregate an application, and to allocate tasks on parallel systems is explained with relevant theorems .
Survey of existing data-partition schemes, existing multiprocessor Fourier transform and matrix multiplication algorithms is presented. A brief description about the platforms on which experiments in this dissertation are conducted is presented.

Storing Data in Distributed Memories
Most large scale applications of scientific computing involve manipulations of data that are expressed in terms of matrices and vectors. This is natural because matrix notation gives a compact way to express computation. Moreover, storing matrices or vectors in the memory of a computer system is the first step of any computation.
Different ways of storing data may result in different algorithmic structures as well as different computational performance. While methodology and algebraic formulations for storing matrices in a linear memory space of a single processor system exist, such as row-major and column-major, there is neither a formal and commonly agreed way of addressing data stored in distributed memory multiprocessor systems, nor an agreed formal description for various storage schemes. Programmers for parallel machines usually organize data in a way based on their convenience and efficiency of a specific algorithm. As a result, data-allocation and -partition in parallel processing are very diversified. Therefore, there is a need for a unified approach for formalizing data-allocation and -partitioning in parallel machines, and for a clear and convenient mathematical representation of various data-storage schemes. In parallel computers, particularly in distributed memory multiprocessors, communication costs are directly related to various data-storage schemes. Clear representation of storage schemes helps parallel programmer greatly to look into structures of implementations and communication costs associated with algorithms.
Consider a message-passing multiprocessor system with k processors labeled from 0 to k-1, where k = ki k2. We would like to partition and store a two-dimensional (2D) matrix, denoted by A onto this system. For the purpose of simplicity and clarity of our presentation, we present only the cases [27] where the data can be evenly divided into k subsets and concentrate on our main interest of algebraic representation of partitioning the matrix and storing them into processors' memories.
on a vector a that is formed as VectMN(A) .
We use bold faced "P" (P) with appropriate subscript to represent our datapartition definitions while italic "P" ( P) to represent operation of stride permutation that explained in Section 2.3.  In case of row-division 1 1 2 on the right-hand side of P R (8,8,4) represents moving vectors of length 2 according to the permutation matrix P (32,4).

Definition 3.2 Column-Division: Let
When this permutation is applied 1 resulting data a"t processor-0 is shown with dottedline. For column-division data partitioning 1 since input permutation is an identity matrix 1 no action needs to be performed 1 and the vector a is just segmented into four parts for allocating to four processors. For mesh-division data partitioning 1 1 2 on the left-hand side of PM (8,8,2,2) represents an action to divide the vector a into two equal sets and perform the vector-stride action P(8, 2) ® 1 4 on each set. However 1 this vector-stride further divides each set into eight small subvectors of length 4 and shuffl,e them according to the permutation P (8,2). Once again 1 data residing at processor-0 after the action of input permutation is shown with dotted-line.

General Usage of Data-Partition Definitions
Consider any computational procedure that is expressed by an operational matrix % (8,8,4)= P (32,4l@l 2 :.·· · · · · ' . ;. . -· :t. · · =· · :!: . . . -:t . . <~·· · Pc (8,8,4) p : 16   b =Ga. (3.29) This equation ignores the underlying data-partition necessary to carryout the computation in distributed memory multiprocessor system. To bring out the data-partition, let a(= Q 1 a) be a desired data partition of a among the processors where Q 1 is one of the data partition schemes (PR, Pc, or PM) defined above. If one expects the output data to be in a particular partition after the computation, then resultant data is of the form b where b = Q 2 b and Q2 is also one of the definitions PR, Pc, or PM defined above. Then equation (3.29) can be rewritten as: ( 3.30) Therefore, G = Q 2 G Q1 1 is the actual-operational matrix that takes into account the complexity of considered data partition.

Moving Data Among Distributed Memories
Once input data is partitioned among the processors, data migrations at the interfaces between individual algorithms may be necessary in order to achieve global optimal performance of an application. One frequently used data migration in numerical applications is well known matrix transpose. Let a= VectMN(AM x N ), and (3.31) Hence P(M N, M) is the operational matrix for transpose algorithms, that is, . When data partition schemes are to be incorporated, the actualoperational matrix becomes G (see equation (3.30) ). That is, (3.32) and the equation (3.31) becomes b Ga, (3.33) where G = Q 2 P(M N, M)Q1 1 . In the following, we will show how to derive the operational matrices, G, required to transpose a matrix using the data-partitions defined in previous section (assume Qi = Q2 for simplicity) and discuss their implementation aspects via the tensor product formulations.

ROW-DIVISION
For row-division data partition, we have . (3.34) According to Definition 3.1, we have   Table 3.2: Pseudo-code for message passing in transpose algorithms for either rowdivision or column-division partitions Therefore, the actual-operational matrix in equation (3 .33) for row-division partition can be expressed as two stages: (3.38) The first stage, P( k2 , k) ® IMN/kz, is a global-task that involves message-passings among processors since the expression does not contain an identity matrix, Ik , on left-hand side. The size of each message being passed is (MN/ k 2 ) which is ( 1 / k )th of the size of the data set residing at a processor. This is reflected in the above tensor product expression by IMN/k2. The factor P(k 2 , k) in the expression indicates that each processor has (k -1) subblocks to send out. Such message passings are carried out in ( k -1) stages with one sub block being kept within a processor. The pseudo-code implementation for this stage is shown in Table 3.2.
The second stage, I k ® Ik ® P( MN/ k2, M / k) , represents a local-task due to the identity matrix Ik on the left-hand side. Each processor performs the parallel-stride Then, we can obtain expression for G as:

COLUMN-DIVISION
Therefore, the actual-operational matrix in equation (3.33) for column-division partitioning can be expressed as three stages: (3.42) The first stage, Ik 0 P(N, k) 0 IM/k, represents local data permutations without message-passing due to the prior identity Ik . Each processor performs the vector- The second stage, P(k 2 , k) ® IMN/k2 , is a global-task that is similar to messagepassing stage explained in row-division transpose algorithm. Hence the total communication is again ( k -1) messages, each message is of length (MN/ k 2 ).
The final stage, Ik 0 P( MN/ k, M / k) , is a and simple-stride permutation stage with stride ( M / k) local to each processor. All processors carry out the same operation in parallel without communication.

MESH-DIVISION
For mesh-division partition, we have (3.43) 29 According to Definition 3.3, we have (3.45) Then we can obtain expression for G by decomposing G = P(M N , M) as follows.  (3.48) Therefore, the actual-operational matrix in equation (3.33) for mesh-division partition can be expressed as two stages in two different ways (equations (3. 4 7) and (3.48)). In case of (a) , the first stage, P(k , ki) 0 IMN/ki is a global-task involving messagepassings since there is no prior identity matrix. In fact, it is a single message-passing The second stage, I k 0 P(M N/ k, M / k 1 ), represents that each processor executes a local simple-stride permutation because of prior identity matrix Ik. In fact, if we consider data at each processor to be a matrix of size M / k 1 x N / k 2 , then action to be performed in this stage is k local matrix transposes that are performed simultaneously on k processors.

An Example
As an example of applying our definition of data partitioning and migration, this section presents a typical two-dimensional (2D) fast Fourier transform (FFT) algorithm in a distributed memory system using our new formalism [48].
The summation form of the 2D-DFT on a matrix X of size M x N is given by:  (3.49) while tensor product representation of equation (3.49) can be written as: ( 3.50) where F J is a J x J matrix with ith row , jth column entry equals to exp(-j27rik/ J) , , and G is the operational matrix.
To compute the equation ( If it is required that the Fourier transformed data be in the same data-partition scheme as the original data, then input data is x = PR x and output data is y = PR Y · In such a case, equations (3.5~1) can be rewritten as : (3.52) Note that if we used second parallelization (b) , we would have obtained (3.53) In the following , we will see how we utilize our new definitions on data partition and migration to maximize the parallelism and minimize the communication cost while computing equation (3 .52).
Consider row-division partitioning and start with the first stage operator (the right most factor) of equation (

Work
Data organization is the key to successful parallelization of data parallel programs.
As indicated in the introduction, there are two tracks of efforts in data-partition and migration in distributed memory multiprocessors: automatic data-partitioning for general loop constructs as part of compiler and optimal partitioning for a specific algorithm. In this section, we briefly summarize the existing works in this field as related to our work presented in this dissertation. For more comprehensive review of previous work in data-partitioning and redistribution, readers are referred to [17,16,19].
Ramanujam and Sadayappan [16] studied compile-time techniques for datapartitioning in distributed memory systems. They presented an analysis of communication-free partitions with a nice geometric demonstration. The research work performed by Li and Chen [20] focused on minimizing data movement among processors due to cross-references of multiple distributed arrays (alignment of multiple data structures). They have also presented a method of automatically generating efficient message-passing routines in parallel programs [20]. Gupta and Banerjee introduced the notion of constraints on data-partitioning to obtain good performance.
In [23], a compiler algorithm was described to automatically finds optimal parallelism and optimal locality in general loop nesting. All these studies aimed at optimizing data-partition and data alignments as part of compiler. It is known that such optimization problem is NP-complete. A number of heuristics have been proposed [20,21,22,16,49,15].
The use of tensor product notation to describe parallel algorithms has a long history beginning with Pease [50]. Johnson et al [33] presented a comprehensive discussion on how to use tensor notations to design, modify and implement FFT algorithms on various computer architectures. Attempts to derive variants of FFT algorithms keeping the underlying architecture in mind have proven successful [24,13].
Huang, Johnson and Johnson [38] have recently used tensor notations for formulating Strassen 's matrix multiplication algorithm. Using the tensor representation, they derived three variant programs and compared their performance characteristics for shared memory multiprocessors.

Kaushik, Huang, Johnson and Sadayappan have proposed a very nice approach
for data redistribution in distributed memory systems, which appeared recently in [19]. While their approach also utilizes the tensor notation as a tool, our work differs in several aspects. First of all , our definitions are expressed in matrix forms while theirs are in terms of indices (tensor bases). With their model one can estimate communication cost of a computation precisely while with our formulations one can easily manipulate the communication structures of a computation to achieve optimal performance. Deriving variants of an algorithm using our definitions are relatively simple because the data communication is easily visible. Secondly, all the definitions presented in [19] such as cyclic, block, and block cyclic can be defined using our formulations as evidenced in Section 3, whereas some of data-partitions such as mesh-division cannot be easily expressed using the notations in [19]. In addition, our representation acts directly on data vector a( 0 : N -1) t o achieve the required data-partition and migration scheme while their representation presents ways to manipulate data indices from one distribution to the other (redistribution). Unlike their representation, we can embed our expressions for data distribution into an algorithm.
As a result , global optimization of an application consisting of several computation modules become straightforward by just manipulating the algebraic expression at the interfaces between individual algorithms.

Conclusion
This chapter introduced a unified approach for formalizing data-allocation andpartitioning in parallel machines, and for a clear and convenient mathematical expressions of various data storage schemes. These expressions are successfully used in expressing, deriving, and implementing matrix transpose algorithms. In turn, expressions derived for transpose algorithms are used in representation of existing multiprocessor two-dimensional FFT algorithm. Representation for existing threedimensional FFT algorithm at Intel's supercomputers is presented in Appendix A.

Introduction
In this chapter we consider the implementation of an application in which we solve Euler partial differential equation (PDE) for two-dimensional case using wavelet-Galerkin method [26,51,52]. The two most important computation modules in this solution require two different data-partitions for their optimal implementation. Such a situation demands switching data-partiti9n that might involve message-passing stages among processors. This chapter evidences benefit of the algebraic expressions defined in Chapter 3 in such a situation.
First module, Helmholtz, involves two-dimensional filtering with forward and inverse two-dimensional Fourier transform (2D-FFT) techniques. It is seen in previous chapter that efficient multiprocessor 2D-FFT algorithm exists for row-division or column-division data-partitioning.
The second module computes Jacobian that consists of numerous small intranode matrix multiplications. The module Jacobian requires boundary data from other nodes, but if we neglect for the moment the necessity for neighboring spatial regions to exchange data, choice of any data-partitioning shows ideal concurrency, with no sequential dependence of one processor's calculation on other's. Departure from ideal speedup in evolution of Jacobian arises because the elements on four sides, considering mesh-division data-partition, or the elements on two sides if we consider either row or column-division data-partition, of any particular processor require boundary elements from its geometrically neighboring processors.
Therefore, Jacobian is optimal for mesh-division data-partition considering the fact that size of data to be shared by other processors is less compared to the size of data to be shared by other processors in either row-or column-division data-partitions.
To make use of the peak performances of these modules individually, switching between row-division and mesh-division data-partitions would be an overhead. This chapter ·deals with the minimization of this overhead by combining the data-partition expressions with the 2D-FFT algorithms.

Section 4.2 presents a brief description of an algorithm to solve a PDE usmg
wavelet-Galerkin method (see flowchart shown in Figure 4.2). Readers interested in details of wavelet-Galerkin method can refer to [26,52). Then, Sections 4.3.1 and 4.3.2 derive two variants of two-dimensional FFT algorithms that start from mesh-division data-partitioning but go through column-division partitioning to retain the efficiency of message-passing in row-or column-division partition, and come back to mesh-division data-partitioning to compute Jacobian. Section 4.4 presents implementation results of these FFT algorithms compared to the FFT algorithm seen in Section 3.3 for row-division partition. It also presents results on overall performance on application which show up to 43.613 reduction in time.

Brief Description of Application
Since we are not interested in exact computational details here but switching datapartition among the modules, we present only the required details and concentrate on data structure compatibility.  The module Jacobian results in the so called wavelet-Galerkin operator 8 that depends on vorticity function field, C, stream function field, W, and wavelet base matrices !1 100 and n°0 1 . The vorticity and stream function fields are assumed to be periodic wrap around square matrices, and C(p, q) and W(p, q) are m x m matrices each, for an odd integer m, with entries from C and W centered at the (p, q) element. Then the evaluation of (p, q) element of wavelet-Galer kin operator 8 is expressed as Optimization in Jacobian is based on observations such as, if we know !V 0°C (p, q), then n1ooc(p, q + 1) = n 100 { C(p, q)b + S(p, q + 1)} where fJ is the matrix that shifts the columns to the left by one and assigns the null column to the last one and S(p, q+ 1) is the matrix with null columns except the last one which is the last column of C(p, q + 1). This process reduces the evaluation of n 10°C (p, q + 1) to a sequence of shifts and one matrix-vector multiplication. Similarly, ir(p, q )n°0 1 can be reduced to a sequence of matrix-vector multiplications and shifts. The module Helmholtz performs two-dimensional filtering of vorticity function with the Laplacian matrix, b.-1, via discrete Fourier transform methods.

Switching Between Data-Partitions
When an algorithm is to be designed in row-division for a data that is already partitioned in mesh-division, it is necessary to apply inverse mesh-division datapartitioning operation (see equation (3.28)) followed by forward column-division data-partitioning operation (see equation (3.24)): Pc P!) = P!) because, Pc is an identity matrix. Similarly, when the output data of an algorithm is in columndivision but the required partition is mesh-division, then inverse column-division data-partitioning operation (see equation (3.27)) should be followed by forward meshdivision data-partitioning operation: PM P(/ = PM because, P(/ is an identity matrix. Assume that data is a two-dimensional array of size M x N, and its meshdivision partition is performed on ks x ks grid, where k; = k.

Algorithm-1
Starting at the equation similar to equation (3 .53) for computing 2D-FFT using mesh-division data-partitioning (substitute PM in place of PR in equation (3.53)), we derive complexity of P"i} as follows:

P"i) = [Ik.®P(N,N/ks) ® IM/k,]
by equation ( Hence complexity of P"i} involves two stages. The first stage, Z 1 , involves (ks -1) number of communications with respect to each processor. This is nothing but transpose algorithm within ks-processors that belong to a column of processors. Similar transpose algorithms are performed simultaneously at ks number of columns of processors where each column of processors consists of ks processors. The second stage, Z 2 , is a local vector-stride data-shuffling.
From the above discussion on P"i} , it can be easily found that PM for meshdivision would also have same complexity (use (AB)-1 = B-1 A-1 and theorem 2.

Algorithm-2
This method differs from algorithm-1 in the way we restructure at the output. So, we present here a different decomposition of last stage, [PMP(M N, N)], of algorithm-1, first we derive a variant for P (M N, N).    Table 4.5: Two-dimensional double-precision complex FFT implementation results for (1) iP SC /860 library code, (2) Interface routines appended at input and output, for such communications involving more than two nodes are not different from one-to-one communications, but depend upon the communications system that is present in the machine.  Shown results are averaged over entire computational procedure and up to 43.6I % reduction in time is achieved. Hence, we conclude that the choice of data-partition and efficient manipulation of our algebraic definitions for data-partition indeed helps to improve the overall performance of an application.   This distinctive feature of the new algorithm effectively utilizes the important architectural property of most today's distributed memory multiprocessors -wormhole routing for interprocessor communications. By using the algebra of stride permutations and tensor products as a mathematical tool, we are able to derive and formulate an efficient data-partition and communication scheme that reduces communication cost from 0( k) required for the best known FFT to 0( Vk) on an k-processor machine. The data-partition considered here (mesh) is natural and efficient for solving discretized boundary value problems such as partial differential equations and finite element analysis discussed in Chapter 4. To evaluate the actual performance of our new algorithm in comparison with other existing parallel FFT algorithms, we have carried out implementation experiments on the Intel's Touchstone Delta. Experimental results show that our algorithm is highly efficient and runs up to 6 times faster than the existing algorithm on a 128 or 256 nodes machine for complex data 51 1 1 size ranging from 16K to lM points. What is more interesting is that the finer the parallelism is , the better the new algorithm performs than the existing [35] ones presented in Section 3.3.

.Effect of Varying Data Structures on Overall Performance: Results and Conclusion
Consider distributed memory multiprocessor environment where communication is done through message-passing. Traditionally, research in this field has concentrated on localizing communications so that data messages are passed only between neighboring processors, processors that are directly connected by a physical link.
The reason for such efforts is that most distributed shared memory multiprocessors are not fully connected like hypercube, mesh, or rings. Each processor is connected only to a few neighboring processors and communication between nonneighboring processors has to go through one or several intermediate nodes. It was believed that an algorithm that allows communication to be done only among neighboring processors will minimize communication cost. However, in most today's messagepassing multiprocessors , wormhole routing techniques are used to route messages among processors [12,53). The pipelining nature of wormhole routing makes the network latency insensitive to path length. In other words, communication latency is virtually independent of the physical distance between two communicating processors. Therefore, neighboring communication is no longer the key factor in an algorithm design and the traditional way of designing parallel algorithms may no longer give the optimal performance. Our objective here is to show that much more performance gains are possible by exploiting the new architectural features such as wormhole routing techniques in multiprocessor systems.
Using the algebra of stride permutations and tensor products as mathematical tools , we have seen that one can easily manipulate the communication structure of an algorithm and derive a structure best tailored to the underlying architecture and develop an optimal communication structure. Such association between tensor notation and algorithm design provides a new way of understanding and developing efficient parallel algorithms. Our new parallel FFT algorithm that arises from such mathematical manipulation minimizes the total number of messages that have to be passed among processors in the course of the FFT computation. This reduction of the total number of messages comes at the expense of increase in the length of each message. Since the network setup time in wormholE: routing plays more significant role in latency than message length due to pipelining, our algorithm shows significant better performance than existing parallel algorithms [35].
Using equation (1.1) and with the explanation in Section 3.3, one can estimate the total communication cost of the existing algorithm [35] on a k-node machine. Fourier transforming a two-dimensional data of size kN 1 x kN 2 involves inter-processor communication cost given by (5.66) The chapter is organized as follows. In the following, we analyzed the performance bottlenecks of existing FFT algorithm to find where improvements are possible. Section 5.2 presents our new algorithm using tensor notations, and proves its validity.
Experiments and performance measurements will be presented in Section 5.3. Finally, Section 5.4 concludes our approach.

New Approach
Having analyzed the communication cost of the existing FFT algorithm for rowdivision partition, we focus our effort at reducing the cost expressed in equation (5.66) [54]. It is clear that the parameter is is a major factor contributing to the total cost compared to the parameter ie. Our main objective is to minimize the role of network setup time in communication overhead. The main idea is similar to the classical divide-and-conquer strategy in algorithm designs. Suppose that the size of the machine can be represented by two factors, i.e., k = k 1 k2. With proper data-partitioning and allocation, we reduced the network setup time from 0( k 1 k 2 ) to O(k1 + k 2 ) with an increase of a constant factor for the coefficient of ie (see equation (5.77) in the next section). Let the input data to be transformed be a two-dimensional array with size kN 1 x kNz . If this computation is to be carried out on k-processor machine, then each processor would be assigned with kN 1 N 2 length sub-vectors. Such sub-vectors are obtained by tiling the two-dimensional array into k 1 x k 2 blocks of size kzN 1 x k 1 Nz.
These ki x k 2 blocks of size k 2 N 1 x k 1 N 2 are shown with dotted lines in Figure 5.5.
We then allocate each sub-block to a processor. This process is typical in finite element analysis where finer the grid, higher the complexity of computation. With k1 X kz blocks, we imagine associated processors are being arranged in k 2 columns , each column consisting of k1 processors. These processors are numbered in antilexicographic manner, that is, processors in the first column are numbered from 0 to (kl_ 1), those in the second column are numbered from k 1 to (2k 1 -1), and so on.
Similar to V ect operation in Section 2.2, we form a single vector x out of the input matrix by placing column-( i + 1) down the column-i, 1 :::; i :::; ( kN 2 -1) (columnmajor). Then, shuffled vector x built with sub-vectors are assigned to processors 0 to ( k -1 ). The tensor product and stride permutation representing all the above data-partitioning is seen in equation (3.25) that can be represented as: x(O: J -1) x(J: 21 -1) x((k2 -1)J: k2J -1) where J = k2kfN 1 N2 and x(x 1 : x2) represents sub-vector off formed with elements with subscripts from x 1 to x 2 . In the above matrix representation, each row denotes the operation that is being performed on an entire column of k 1 processors.
Recall that a tensor product with post identity matrix, lk 2 N 1 , represents operations on vectors of length k 2 N 1 , and with the operational matrix being P(kf N 2 , k 1 ), it represents picking up vectors of length k 2 N 1 with stride k 1 . Tensor product with prior identity matrix, Ik 2 in equation (5.67), represents that similar operations are to be performed on each of the k 2 columns of processors. Extension of such a datapartition and allocation scheme for higher dimensional problems can be done in a similar way although the geometrical representation would be more complicated. This is clear from zoomed part of Figure 5.5 in which we detailed the required partition for implementation with another grid of size k 2 x k 1 at each node for representing two-dimensional algorithm making a six-dimensional indexing instead of our tensor product formulation presented below. Each block in this grid would consist N 1 x N 2 elements.
With the data allocation made in above for computing two-dimensional DFT, the computation proceeds in stages that are explained in the following. Stride permutations and tensor product are aid to clearly visualize the complexity and feasibility of execution on parallel machines.
Rearrange I: (5.68) This stage involves simultaneous message-passing among all the processors that belong to the same column. In fact, any rearrangement that is not preceded by Ik would result in inter-processor communications. However, lk 2 preceding the above expression indicating that k2 columns of processors are doing intra-column message-passings in parallel. Actual implementation involves message-passing from a node sender to a different node receiver.
1. Node sender calling a procedure that sends a vector to node receiver with parameters (a) vector's name, (b) size, (c) address of the node to which data should be directed to, which is receiver in this case, and ( d) a user defined message number that should be same for all the global communications being performed at that time.
2. Node receiver employing a routine for an asynchronous receive that initiates the receipt of a message from a process.
3. Node receiver utilizing a message wait routine to block further execution of any instructions that are dependent upon data in transit until the transfer is complete.
Hence, this stage of global data shuffling involves a complexity of ( ki -1) number of messages to be passed by each processor, each message being of length k 2 N 1 N 2 .
Rearrange II: (5.69) Compute I: Rearrange III: (5.71) In Rearrange II, the tensor products have densely packed knowledge about the implementation aspects. First of all, the occurrence of the identity matrix h on the left indicates parallel computations. Therefore, operation at a node is independent of any other node. The actual operations performed in parallel are data shuffling to obtain a complete column of the input matrix in order at each processor. As a result of this rearrangement, processor-0 contains the first N 2 columns in the natural order, processor-I contains the next N 2 columns in the natural order, and so forth.
After N 2 columns are obtained at each processor, Compute I performs N 2 number of kN 1 -point one-dimensional Fourier transforms on columns, using an efficient onedimensional transform routine called cfft1d developed by Kuck & Associates Inc. , for complex numbers. It is evident from theorems 2.9 and 2.5 in Section 2.5 that Rearrange III is the inverse operation of Rearrange II. It scatters back the transformed vectors of length k 2 N 1 with stride k 1 to prepare for the global communications in the next stage. This step is once again independent of the data at other nodes and can be executed in parallel.
Rearrange IV: (5.72) This stage is exactly the same as Rearrange I that has a complexity of ( k 1 -1) global communications , each message has the length k 2 N 1 N 2 .
Rearrange V: (5. 73) With h preceding this operation, this is nothing but local transpose of the matrices of size k 2 N 1 x ki Nz at each processor. This transpose would prepare the data to perform similar stages as above on the next dimension.
Following six stages are identical to the above six stages except that they are being performed on the second dimension . However, just for the purpose of valida-

Proof
If we consider data x of size kN 1 x kN 2 arranged on a k 1 x k 2 grid to be Fourier transformed to data y, then y = PM (kNi,kN2,ki ,k2) [IkN 1 ® Fk N 2 ] Using equations (3.47) and (3.48) to expand matrix transposes P(k 2 N 1 N 2 , kN 2 ) and P( k2 N 1 N 2 , kN 1 )

Performance Evaluation and Comparison
In this section, we evaluate the performance of the new approach developed in the previous section. We will also compare its performance with the existing algorithm described in equation ( One can find a direct correlation between the theoretical estimation of performance in equation (5. 77) and the results in Table 5. 7. It can be seen that as the machine size increases performance of new approach increases. This is because the complexity of network startups in the existing algorithm is 0( k 1 k 2 ) while that of the new approach is O(k 1 + k 2 ). In general, for an n-dimensional DFT computation on ( k 1 k 2 .•. kn)-processor machines, order of network startups for existing algorithm would be O(IIi= 1 ki) while that of new approach would be O(L:i= 1 ki)· Therefore, the data-partitioning scheme and communication setup described in new approach would be especially useful in the problems with the combination of huge data size, large machines, and higher dimensionality.

Conclusion
In Section 5.2, we presented an approach for computing multidimensional DFT on distributed memory systems that effectively utilizes the fact that today's distributed memory systems use wormhole routing for interprocessor communications. An approach to extend the algorithm for three or more dimensional problems using stride permutation and tensor product matrices has been presented that facilitates finding an efficient data-partitioning and network setup on distributed memory multiprocessors. Data-partitioning scheme is suitable and should be aimed at boundary value problems in fluid dynamics , finite element analysis etcetera. Results showed that our algorithm is more than six times as fast as the existing algorithm for certain cases .
Moreover, higher the parallelism is, the better the performance of new algorithm will be. Given the fact that physical limits on memory exist at each processor, our new algorithm is a solution to today's large problems that involve multidimensional Fourier transform computations on massively parallel machines .

Chapter 6 Parallel Matrix Multiplication
Algorithm For Rectangular Arrays

Algorithm
This section reviews the broadcast-and-shift matrix multiplication algorithm that is presented in [25]. Related research can be found in [55,56,57]. Throughout this chapter, we consider computing C, where C=AB, I 11 I I I I I on concurrent processors. Let k be the number of processors in a distributed memory system. Assuming k to be a square of an integer, k = k;, we can use square sub block decomposition (mesh-division) as shown in Figure 6.6. Then, multiplicand matrices A and B are distributed piecewise in two-dimensions. Resultant matrix C is also expected to be distributed in the same fashion ·for further processing steps. Denote a processor belonging to ith row and jth column by p(i,j), 0 ~ i, j ~ (ks -1). Similarly, denote the sub blocks of matrices A, B, and C in processor p(i,j) by A (i,j), B(i,j), and cCiJ), respectively. The multiplication with respect to blocks can be written as  where 11 = (i + l) (mod ks), and ks is the number of processors in a row. Equation (6.80) represents operations to be performed at processor p(i,j). These operations are divided into ks stages of operations, one stage for each l, 0 ~ l ~ (ks -1), in the summation. Consider dividing each stage into two tasks: (a) task that involves message-passings to obtain A (i,li) and B(li.i) at processor p(i,j) and (b) task that involves computation of the product A (i,li) B(li,j) at processor p(i,j) and accumulation of the result to C(i,j). To compute equation (6.80) in p(i,j) for j = 1 ···ks, all the ks processors belonging to the same row, with same index i, should obtain A (i,l 1 ), for all values of 11 where 1 1 = (i + l) (mod ks), and 0 ~ l ~ (ks -1) . This is done by broadcasting from processor p (i,li) to all the processors in same row for each stage.
To obtain B(li,i) at processor p(i,j) for each l, one needs to shift subblocks of B up after each stage, as shown in Figure 6.6. Once sub blocks A (i,l 1 ) and B(li,i) are obtained at processor p(i,li), they are multiplied and accumulated to C(i,j). Note that no movement of the data at the resulting matrix, C, is necessary. All the message Passings are within multiplicands A and B. processors is clearly (ks -1). Each shifting passes one message. Therefore, the total number of communications is given by [ks(ks -1) + (ks -1)] = (k -1) . However, the sizes of the messages vary with respect to the sizes of matrices A and B . If multiplicands A and B are of sizes N1 x N 2 and N 2 x N 3 , respectively, then total communication cost can be written as:

Two Extremes of Broadcast-and-Shift Algorithm
When either row-division (a set of complete rows is allocated to each processor) or column-division (a set of complete columns is allocated to each processor) dataallocations are considered for matrices A and B, either broadcasting A or shifting B can be eliminated. showing the communication complexity, respectively. In case of row-division partitioning, matrix-vector multiplication would be efficient while in the case of columndivision structure, vector-matrix multiplication could be efficient. This can be clearly seen in the following evaluation of communication costs for these extreme cases. On a k-processor machine with matrices A and B of sizes N 1 x N 2 and N 2 x N 3 , respectively, broadcast-and-shift algorithm for these decompositions maps into k stages of shifts for row-division (see Figure 6.7(a)) or k stages of broadcasts for column-division (see Figure 6.7(b)) unlike ks stages of broadcasts and shifts for mesh-division. Then, communication cost for row-division decomposition would be a result of shifts in B: (6.82) which is in the order of 0( k) while that for column-division decomposition can be derived from broadcasts in A as: (6.83) which is in the order of O(k 2 ). However, when row-division partitioning is adopted for A and column-division decomposition is used for B, both the broadcast and shift communications are traded with communications of partial results of C that need to be accumulated. This case is studied in next section and compared to mesh-division broadcast-and-shift algorithm. ii_ A (2,1) 1(1) !t A(l,2)1(2)  Consider two multiplicand matrices A and B of sizes Ni x N2 and N2 x N3 , respectively, and C =AB. Let matrix A be decomposed using column-division while matrices B and C be decomposed with row-division. It is acknowledged that the data-partition considered here is not uniform for all the matrices A, B, and C.
For embedding this algorithm into any computation would need necessary messagepassings overhead to switch between data-partitions. Association of parts of each matrix to each node on a k-processor machine would result processor-i to contain 3 , an NifkxN 3 • m ia y, i wou seem t at w en eac no econtains parts of matrices A and B of sizes Ni x N2/k and N2f k x N 3 , resulting matrix that obtained by their multiplication at each node would be of size Ni x N 3 , which is as large as the entire resulting matrix that is supposed to be residing at all the k processors. However, using matrix transpose technique and further dividing the problem at each node, we can decompose algorithm into k successive compute, communicate, and accumulate stages, which would require a storage of size just Ni/k x N 3 . Total number of communications is exactly same as that required in broadcast-and-shift matrix multiplication algorithm but message-length varies as the sizes of matrices A and B depart from being square matrices. Moreover, restriction on broadcast-andshift algorithm on number of processors to be square is no more applicable to this new approach. Figure 6.8 demonstrates the new approach for matrix multiplication using columndivision for left multiplicand A and row-division for right multiplicand B on a 4processor machine. On a k-processor machine, suppose that processors are numbered as p(o) ... p(k-i). Denote the part of matrix A that is allocated to processor i as A (i). Similarly, denote the part of matrix B that is allocated to processor i as B(i).
Then, resulting matrix C can simply be expressed as If we break-up the computations into k stages, then we can form k communication stages followed by computation stages. This is accomplished by breaking-up associating matrix A at jth processor into k subblocks as A(i,j) for 0:::; j:::; (k -1). Then

Performance Evaluation
We have seen in Section (6.2) that the communication overhead in broadcast-andshift algorithm for running on k processors in equation (6.81). If we derive messagepassing overhead inherent in the new approach in analogous fashion, then for k processors, implementation of multiplication of matrices of sizes N 1 x N 2 and N 2 x N 3 would involve a communication cost that can expressed as: (6.86) To compare the two algorithms, new approach performs better than broadcast-andshift algorithm if imesh > . tnew ks(ksl)(N1N2/k) + (ks -l)(N2N3/k) > (k -1) (N1N3/k) N2 N3 (6.87) >  ] Note that both the algorithms have communication complexity that are in the order of O(k) on a k-processor machine. Hence, the only difference arises from the sizes of the multiplicands and the resulting matrix. Clearly, above inequality says that if N 2 is larger than N 3 , then messages in new approach will be shorter than that of communications in broadcast-and-shift algorithm. Tables 6.9, and 6.10 present results of actual implementations demonstrating the validity of the inequality (6.87). Moreover, unlike broadcast-and-shift algorithm which is optimum for square number of processors [25], performance of new approach changes uniformly with increasing number of processors. This is demonstrated with our results in Table 6.5.

Conclusion
It is observed that variation of data-partitioning presents significant improvements in the performance of matrix multiplication algorithms for rectangular arrays. A clear analysis of an existing multiplication algorithm for distributed systems seeking minimum communication resulted in an efficient and new approach. Also, a quest to overcome the hurdles faced in memory requirements while developing this algorithm resulted in efficient utilizitation of the block-transpose algorithm seen in Chapter 3.

Conclusions and F'uture Research
It is well known that data-distribution in distributed memory multiprocessors is essential to achieving high performance of data parallel algorithms. The central feature of most implementations of these algorithms is the manner in which the expensive interprocessor communication is minimized.
We defined a set of expressions for partitioning data in multiprocessor environments using tensor products and stride permutations. Unlike the existing datapartitioning schemes, this representation can form a part of any algorithm that can be represented using tensor products and stride permutations. Hence, using the well established theorems in tensor algebra, one can eastly manipulate the algorithm to clearly visualize the data migration stages.
The expressions defined for data-partitions have been used to demonstrate the representation of existing matrix transpose and two-dimensional FFT algorithms. For a practical application in which switching data-partitions was needed, manipulation of algorithms and derivations to interface among them proven to be successful by using our definitions for process of data-partition. This is seen in computing vorticity and stream functions via a partial differential equation solver that used wavelet-Galerkin method. Then, a variant for routing scheme in two and three-dimensional FFTs is derived that is applicable for today's large computers to solve huge data sizes. Finally, a data-allocation scheme is presented for matrix multiplication via transpose 76 11 II I 111 I algorithms for distributed systems. It is seen that such a distribution is efficient for rectangular matrices.
Tensor products and stride permutations have been extremely useful to transform an algebraic expression that is derived on paper into implementations on parallel machines. Algorithms derived and discussed in this dissertation were implemented on Intel's Paragon, Touchstone Delta, Gamma, and iPSC/860.
An immediate direction for future work based on this research is to introduce notation and develop relevant theorems required for parallel algorithms that cannot be represented using tensor products and stride permutations alone. Another research direction is to solve applications that have several computational modules which are efficient for distinct data-partition schemes to prove usefulness of our definitions for interfaces.

Appendix A
Tensor Product Representation of

3D-FFT
If we consider rows, columns, and depths as three axes of a three dimensional array, this appendix describes an algorithm in which domain decomposition strategy involves partitioning depth as shown in Figure  . I