This document is intended as a summary of my experiences with the intent of lessening the learning curve of parallel programing. If any serious errors are found in this document then they can be reported to me. I would like to acknowledge Dr. Eric Tittley who allowed me to use his computers.
My task was to rewrite Dr. Tittley's model, Hydra, to make it run in parallel. Consequently, there is section devoted specifically to my changes to Hydra that should be ignored by laymen. Hydra is a Fortran program written for a UNIX machine. For an annotated list of resources concerning MPI and other pertinent topics, see the Links section.
It is considered a good programming habit to demonstrate a complex idea in a simplistic manner before attempting to implement the idea on the project program. The rewrite of Hydra was modeled after a demonstration program that I wrote, entitled Layers (because of the 3D aspect of Hydra's loops). The program fills a 3d array with dummy integers and has the sum of each layer and its two neighboring layers computed. Each node computes a different layer. Many examples in this document make reference to layers. A Matlab version of layers is provided for those of you that need something to compare the Fortran to (the Matlab version runs on one CPU).
Note: Text in this document that can be run is written in Fixed font.
TopThere are two different ways to run an MPI program. The man file on mpirun says:
"There are two forms of the mpirun command -- one for programs (i.e., SPMD-style applications), and one for application schemas (see appschema(5)). Both forms of mpirun use the following options by default: -c2c -nger -w."
(SPMD stands for Single Processor / Multiple Data.) The man file on appschema says:
"The application schema is an ASCII file containing a description of the programs which constitute an application. It is used by mpirun(1) ..... to start an MPI application. All tokens after the program name will be passed as command line arguments to the new processes. Ordering of the other elements on the command line is not important."
In other words, an application schema is not necessary. If the code for all of the nodes is contained within the same source files then the application can be launched with one line. Layers, for example, can be launched with the following command:
mpirun -O N layers
If an app schema were to be used for this application then the run commands would change as follows:
Command
line:
mpirun -O N app_schema
App
schema:
n0-3 layers
App schemas are only necessary when the code for each node is separated. In this case, the run commands would change to something like this:
Command
line:
mpirun -O N app_schema
App
schema:
n0 layers_master_code
n1-3 layers_slave_code
Though mpirun is used to execute the application, there are several commands which must be executed before and after mpirun. It is highly recommended that these commands be put into a script. An example for Layers would be
RUNME:
#!/bin/csh
source
/etc/proto/LAM
recon -v hostfile
lamboot -v hostfile
mpirun
-O N layers
lamclean -v
lamhalt -v
Hostfile should be a text file with addresses of each of the computers that will make up the LAM. I'm too lazy to provide a full explanation of all of the flags in this example. E-mail me if you want more information about this script.
Note: This section is taken directly from Using MPI: Portable Parallel Programming With the Message-passing Interface as listed in the Other Resources section.
After a few lines of variable definitions, we get to three lines that will probably be found near the beginning of every Fortran MPI program:
call MPI_INIT( ierr )
call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
The call to MPI_INIT is required in every MPI program and must be the first MPI call. It establishes the MPI ''environment." Only one invocation of MPI_INIT can occur in each program execution. Its only argument is an error code. Every Fortran MPI subroutine returns an error code in its last argument, which is either MPI_SUCCESS or an implementation-defined error code.
All MPI communication is associated with a communicator that describes the communication context and an associated group of processes. It is suggested that simple programs use the default communicator, predefined and named MPI_COMM_WORLD, that defines one context and the set of all processes. MPI_COMM_WORLD is one of the items defined in mpif.h.
The call MPI_COMM_SIZE returns (in numprocs ) the number of processes that the user has started for this program. Precisely how the user caused these processes to be started depends on the implementation, but any program can find out the number with this call. The value numprocs is actually the size of the group associated with the default communicator MPI_COMM_WORLD. We think of the processes in any group as being numbered with consecutive integers beginning with 0, called ranks. By calling MPI_COMM_RANK, each process finds out its rank in the group associated with a communicator. Thus, although each process in this program will get the same number in numprocs, each will have a different number for myid.
The line call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr) sends the value of n to all other processes. Note that all processes call MPI_BCAST, both the process sending the data (with rank zero) and all of the other processes in MPI_COMM_WORLD. The MPI_BCAST results in every process (in the group associated with the communicator given in the fifth argument) ending up with a copy of n. The data to be communicated is described by the address (n), the datatype (MPI_INTEGER), and the number of items (1). The process with the original copy is specified by the fourth argument (0 in this case, the master process, which just reads it from the user).
Separating the sending and receiving aspects of MPI_BCAST is done using the following two functions.
|
MPI_Send(address, count, datatype, destination, tag, comm) |
MPI_Recv(address, maxcount, datatype, source, tag, comm, status) |
|---|---|
|
(address, count, datatype) describes count occurrences of items of the form datatype starting at address |
(address, maxcount, datatype) describe the receive buffer as they do in the case of MPI_Send. It is allowable for less than maxcount occurrences of datatype to be received. The arguments tag and comm are as in MPI_Send, with the addition that a wildcard, matching any tag, is allowed. |
|
destination is the rank of the destination in the group associated with the communicator comm |
source is the rank of the source of the message in the group associated with the communicator comm, or a wildcard matching any source |
|
tag is an integer used for message matching |
status holds information about the actual message size, source, and tag |
|
comm identifies a group of processes and a communication context |
|
|
The source, tag, and count of the message actually received can be retrieved from status. |
|
Asynchronous networks are designed for tasks that can only be broken up into different size pieces. Hydra is one such program. Therefore, Hydra and Layers both use an asynchronous design. The pseudocode is as follows.
|
Master Pseudocode |
Slave Pseudocode |
|---|---|
|
Give a job to each of the slaves |
|
|
Receive answers from each of the slaves, and assign each one a new job when they report back. |
Accept jobs from the master. Receive a new job from the master after reporting the answer. The master will tell you when to stop. |
|
Towards the end, if a slave reports in for a new job and there are no jobs left then tell it that it is done. |
|
|
Master Pseudocode |
Master Code |
|---|---|
|
Give a job to each of the slaves |
if ( numsent .lt. (numprocs-1) ) then
call MPI_SEND(buffer, 3*(GRID_SIZE**2),
& MPI_INTEGER, numsent+1, z,
& MPI_COMM_WORLD, error )
numsent = numsent + 1
|
|
Receive answers from each of the slaves, and assign each one a new job when they report back. |
call MPI_RECV(ans, 1, MPI_INTEGER,
& MPI_ANY_SOURCE, MPI_ANY_TAG,
& MPI_COMM_WORLD, status, error )
sender = status(MPI_SOURCE)
ans_tag = status(MPI_TAG)
sums(ans_tag) = ans
call MPI_SEND(buffer, 3*(GRID_SIZE**2),
& MPI_INTEGER, sender, z,
& MPI_COMM_WORLD, error )
|
|
Towards the end, if a slave reports in for a new job and there are no jobs left then tell it that it is done. |
call MPI_SEND( 1.0, 1, MPI_INTEGER, sender,
& 0, MPI_COMM_WORLD, error )
|
Note that the termination code (a zero) is sent in the tag.
|
Slave Pseudocode |
Slave Code |
|---|---|
|
Accept jobs from the master. |
call MPI_RECV( buffer, 3*(GRID_SIZE**2), MPI_INTEGER,
& master, MPI_ANY_TAG, MPI_COMM_WORLD, status,
& error )
|
|
Report the result of the jobs. |
call MPI_SEND( ans, 1, MPI_INTEGER, master,
& status(MPI_TAG), MPI_COMM_WORLD, error )
|
|
Accept new jobs. |
20 continue
...
go to 20
|
|
The master will tell you when to stop. |
if ( status(MPI_TAG) .ne. 0 ) then
|
The advantage of this program structure is that every node is always working. Thus, while one node may do three jobs and another may do twelve, all the nodes have done the same number of computations.
As stated earlier, the parallel version of Hydra is modeled after Layers, with no known deviations from the general design. Therefore, the only thing left to specify is the data that is transmitted between the master and slave nodes. At the beginning of each call to shsph, the a, dn, h, ds, and de arrays are BCAST from the master node to each of the slaves. During each job, the nodes perform computations on elements of these arrays. Though I had time to make sure that my revisions to shsph.F run, I did not have time to make sure that they did not change the output of shsph.F If there is an important array, other than these five, that I failed to handle then it will be show up as numerical errors in the elements of the five aforementioned arrays.
Access dates are provided as a clue to whether or not the content is current.
Message Passing Interface (MPI)
http://www.mpi-forum.org/ (5.29.03) The official MPI website. This site contains standards documents, errata, and archives of the MPI Forum. It is recommended that beginners stay far away from here because of the severe lack of tutorials, FAQs, etc.
Fortran
http://gcc.gnu.org/onlinedocs/g77/ (5.29.03) GNU is one of the few remaining places left to find a Fortran compiler. Therefore, if one is to nominate a website as being the official website for Fortran, then this would have to be it. This specific link is titled "Using and Porting GNU Fortran," a thorough reference guide for the language.
http://www.ictp.trieste.it/~manuals/programming/sun/fortran/f77rm/index.html (5.29.03) As it is titled, this is another good "FORTRAN 77 Language Reference" It only differs from the GNU documentation in the way the pages are layed out. This is provided by Sun.
Network Topology
http://www.webopedia.com/quick_ref/topologies.asp (5.29.03) A typical LAN is described in this document as a mesh. However, notice that a typical master-slave program will behave as a star network, even though the cables are arranged as a mesh.
http://www.numa.uni-linz.ac.at/Staff/haase/Lectures/parvor_e/parvor.html (5.29.03) Title: Parallelization of Numerical Algorithms This appears to be the work of a university student. Therefore, the quality is very good and very comprehensive. Consider this to be a free textbook for beginners.
Gropp, William., Ewing Lusk and Anthony Skjellum. Using MPI: Portable Parallel Programming With the Message-passing Interface. Cambridge, Mass. MIT Press, 1999.
This book is modeled after many introductory computer language books, complete with examples in C, C++, and Fortran. It is available at the UMBC library or at http://www.netlibrary.com/
program layers
* This program fills a 3d array with dummy integers and has the
* sum of each layer and its two neighbors computed. Each node
* computes a different layer.
implicit none
include 'mpif.h'
* finished with includes, begin variables
* declare constants
integer GRID_SIZE
integer master
* set constants
parameter ( GRID_SIZE = 50 )
parameter ( master = 0 )
* declare counters
integer x,y,z, i,j,k
* declare MPI variables
integer numprocs, myid, error
integer sender, ans_tag, ans
integer status(MPI_STATUS_SIZE)
* declare program variables
* the grid should be visualized with z being the height,
* x being the width, and y being the length
integer grid(1:GRID_SIZE, 1:GRID_SIZE, 1:GRID_SIZE)
integer buffer(1:3, 1:GRID_SIZE, 1:GRID_SIZE)
* sums is a array indicating the status of each layer (z value)
* of the grid
integer sums(1:GRID_SIZE)
integer numsent
logical received_all
* declare custom functions
integer wrap
external wrap
* set program variables
do z = 1, GRID_SIZE
do x = 1, GRID_SIZE
do y = 1, GRID_SIZE
grid(z,x,y) = z + x + y
enddo
enddo
enddo
do z = 1, GRID_SIZE
sums(z) = -1
enddo
numsent = 0
received_all = .false.
***** finished with varibles, begin instructions
***** begin MPI initialization instructions
call MPI_Init( error )
call MPI_Comm_size( MPI_COMM_WORLD, numprocs, error )
call MPI_Comm_rank( MPI_COMM_WORLD, myid, error )
***** end MPI init instructions, begin program instructions
print *, "Hello from node ", myid, " of ", numprocs
***** begin instructions for the manager (master node, node 0)
if ( myid .eq. master ) then
do i = 1, 3
do z = i, GRID_SIZE, 3
do j = 1, 3
do x = 1, GRID_SIZE
do y = 1, GRID_SIZE
buffer(j,x,y) = grid(wrap(z-2+j, GRID_SIZE),x,y)
enddo
enddo
enddo
* send three layers of grid to each of the other processes
* tag the transmissions with the middle layer number
if ( numsent .lt. (numprocs-1) ) then
call MPI_SEND(buffer, 3*(GRID_SIZE**2),
& MPI_INTEGER, numsent+1, z,
& MPI_COMM_WORLD, error )
else
call MPI_RECV(ans, 1, MPI_INTEGER,
& MPI_ANY_SOURCE, MPI_ANY_TAG,
& MPI_COMM_WORLD, status, error )
sender = status(MPI_SOURCE)
ans_tag = status(MPI_TAG)
sums(ans_tag) = ans
call MPI_SEND(buffer, 3*(GRID_SIZE**2),
& MPI_INTEGER, sender, z,
& MPI_COMM_WORLD, error )
endif
numsent = numsent + 1
enddo ! loop over z
enddo ! loop over i
if ( numsent .lt. GRID_SIZE ) then
print *, "Error: did not send out all the data"
else
10 continue
call MPI_RECV(ans, 1, MPI_INTEGER,
& MPI_ANY_SOURCE, MPI_ANY_TAG,
& MPI_COMM_WORLD, status, error )
sender = status(MPI_SOURCE)
ans_tag = status(MPI_TAG)
sums(ans_tag) = ans
call MPI_SEND( 1.0, 1, MPI_INTEGER, sender,
& 0, MPI_COMM_WORLD, error )
received_all = .true.
do z = 1, GRID_SIZE
if ( sums(z) .eq. -1 ) then
received_all = .false.
endif
enddo
if ( received_all .eqv. .false. ) then
goto 10
endif
endif
print *, "z: sum"
do z = 1, GRID_SIZE
print *, z, ": ", sums(z)
enddo
***** end manager instructions, begin slave instructions
else
20 continue
call MPI_RECV( buffer, 3*(GRID_SIZE**2), MPI_INTEGER,
& master, MPI_ANY_TAG, MPI_COMM_WORLD, status,
& error )
if ( status(MPI_TAG) .ne. 0 ) then
ans = 0
do z = 1, 3
do x = 1, GRID_SIZE
do y = 1, GRID_SIZE
ans = ans + buffer(z,x,y)
enddo
enddo
enddo
call MPI_SEND( ans, 1, MPI_INTEGER, master,
& status(MPI_TAG), MPI_COMM_WORLD, error )
goto 20
endif
endif
***** end program instructions, begin exit instructions
call MPI_Finalize( error )
stop
end
******************** wraps the indexer ********************
integer function wrap( x , GRID_SIZE )
integer x, GRID_SIZE
if ( x .lt. 1) then
wrap = GRID_SIZE
else if ( x .gt. GRID_SIZE) then
wrap = 1
else
wrap = x
endif
end
grid = [1:50; 1:50; 1:50];
wrap = [50 ...
1 2 3 4 5 ...
6 7 8 9 10 ...
11 12 13 14 15 ...
16 17 18 19 20 ...
21 22 23 24 25 ...
26 27 28 29 30 ...
31 32 33 34 35 ...
36 37 38 39 40 ...
41 42 43 44 45 ...
46 47 48 49 50 ...
1 ];
for i = 1:50
for j = 1:50
for k = 1:50
grid(i,j,k) = i + j + k;
end
end
end
for n = 1:50
ans = 0;
for i = n : n+2
for j = 1:50
for k = 1:50
ans = grid( wrap(i) , j , k ) + ans;
end
end
end
ans
end