How to Change Hydra From a Sequential Program to a Parallel Program
By Alec Colvin

  1. Introduction
  2. How to Start an MPI Application
  3. Basic MPI Commands
  4. The Structure of Layers
  5. Hydra Specific Information
  6. Links
  7. Other Resources
  8. Appendix A: layers.f
  9. Appendix B: layers.m

Introduction

This document is intended as a summary of my experiences with the intent of lessening the learning curve of parallel programing. If any serious errors are found in this document then they can be reported to me. I would like to acknowledge Dr. Eric Tittley who allowed me to use his computers.

My task was to rewrite Dr. Tittley's model, Hydra, to make it run in parallel. Consequently, there is section devoted specifically to my changes to Hydra that should be ignored by laymen. Hydra is a Fortran program written for a UNIX machine. For an annotated list of resources concerning MPI and other pertinent topics, see the Links section.

It is considered a good programming habit to demonstrate a complex idea in a simplistic manner before attempting to implement the idea on the project program. The rewrite of Hydra was modeled after a demonstration program that I wrote, entitled Layers (because of the 3D aspect of Hydra's loops). The program fills a 3d array with dummy integers and has the sum of each layer and its two neighboring layers computed. Each node computes a different layer. Many examples in this document make reference to layers. A Matlab version of layers is provided for those of you that need something to compare the Fortran to (the Matlab version runs on one CPU).

Note: Text in this document that can be run is written in Fixed font.

Top

How to Start an MPI Application

There are two different ways to run an MPI program. The man file on mpirun says:

"There are two forms of the mpirun command -- one for programs (i.e., SPMD-style applications), and one for application schemas (see appschema(5)). Both forms of mpirun use the following options by default: -c2c -nger -w."

(SPMD stands for Single Processor / Multiple Data.) The man file on appschema says:

"The application schema is an ASCII file containing a description of the programs which constitute an application. It is used by mpirun(1) ..... to start an MPI application. All tokens after the program name will be passed as command line arguments to the new processes. Ordering of the other elements on the command line is not important."

In other words, an application schema is not necessary. If the code for all of the nodes is contained within the same source files then the application can be launched with one line. Layers, for example, can be launched with the following command:

mpirun -O N layers

If an app schema were to be used for this application then the run commands would change as follows:

Command line:
mpirun -O N app_schema

App schema:
n0-3 layers

App schemas are only necessary when the code for each node is separated. In this case, the run commands would change to something like this:

Command line:
mpirun -O N app_schema

App schema:
n0 layers_master_code
n1-3 layers_slave_code

Though mpirun is used to execute the application, there are several commands which must be executed before and after mpirun. It is highly recommended that these commands be put into a script. An example for Layers would be


something like this:

RUNME:
#!/bin/csh
source /etc/proto/LAM
recon -v hostfile
lamboot -v hostfile
mpirun -O N layers
lamclean -v
lamhalt -v

Hostfile should be a text file with addresses of each of the computers that will make up the LAM. I'm too lazy to provide a full explanation of all of the flags in this example. E-mail me if you want more information about this script.

Top


Basic MPI Commands

Note: This section is taken directly from Using MPI: Portable Parallel Programming With the Message-passing Interface as listed in the Other Resources section.

After a few lines of variable definitions, we get to three lines that will probably be found near the beginning of every Fortran MPI program:

call MPI_INIT( ierr )
call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )

The call to MPI_INIT is required in every MPI program and must be the first MPI call. It establishes the MPI ''environment." Only one invocation of MPI_INIT can occur in each program execution. Its only argument is an error code. Every Fortran MPI subroutine returns an error code in its last argument, which is either MPI_SUCCESS or an implementation-defined error code.

All MPI communication is associated with a communicator that describes the communication context and an associated group of processes. It is suggested that simple programs use the default communicator, predefined and named MPI_COMM_WORLD, that defines one context and the set of all processes. MPI_COMM_WORLD is one of the items defined in mpif.h.

The call MPI_COMM_SIZE returns (in numprocs ) the number of processes that the user has started for this program. Precisely how the user caused these processes to be started depends on the implementation, but any program can find out the number with this call. The value numprocs is actually the size of the group associated with the default communicator MPI_COMM_WORLD. We think of the processes in any group as being numbered with consecutive integers beginning with 0, called ranks. By calling MPI_COMM_RANK, each process finds out its rank in the group associated with a communicator. Thus, although each process in this program will get the same number in numprocs, each will have a different number for myid.

The line call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr) sends the value of n to all other processes. Note that all processes call MPI_BCAST, both the process sending the data (with rank zero) and all of the other processes in MPI_COMM_WORLD. The MPI_BCAST results in every process (in the group associated with the communicator given in the fifth argument) ending up with a copy of n. The data to be communicated is described by the address (n), the datatype (MPI_INTEGER), and the number of items (1). The process with the original copy is specified by the fourth argument (0 in this case, the master process, which just reads it from the user).

Separating the sending and receiving aspects of MPI_BCAST is done using the following two functions.

MPI_Send(address, count, datatype, destination, tag, comm)

MPI_Recv(address, maxcount, datatype, source, tag, comm, status)

(address, count, datatype) describes count occurrences of items of the form datatype starting at address

(address, maxcount, datatype) describe the receive buffer as they do in the case of MPI_Send. It is allowable for less than maxcount occurrences of datatype to be received. The arguments tag and comm are as in MPI_Send, with the addition that a wildcard, matching any tag, is allowed.

destination is the rank of the destination in the group associated with the communicator comm

source is the rank of the source of the message in the group associated with the communicator comm, or a wildcard matching any source

tag is an integer used for message matching

status holds information about the actual message size, source, and tag

comm identifies a group of processes and a communication context


The source, tag, and count of the message actually received can be retrieved from status.



Top


The Structure of Layers

Asynchronous networks are designed for tasks that can only be broken up into different size pieces. Hydra is one such program. Therefore, Hydra and Layers both use an asynchronous design. The pseudocode is as follows.

Master Pseudocode

Slave Pseudocode

Give a job to each of the slaves


Receive answers from each of the slaves, and assign each one a new job when they report back.

Accept jobs from the master. Receive a new job from the master after reporting the answer. The master will tell you when to stop.

Towards the end, if a slave reports in for a new job and there are no jobs left then tell it that it is done.




Master Pseudocode

Master Code

Give a job to each of the slaves

 if ( numsent .lt. (numprocs-1) ) then
  call MPI_SEND(buffer, 3*(GRID_SIZE**2),
&  MPI_INTEGER, numsent+1, z,
&  MPI_COMM_WORLD, error )
  numsent = numsent + 1
     

Receive answers from each of the slaves, and assign each one a new job when they report back.

 call MPI_RECV(ans, 1, MPI_INTEGER,
& MPI_ANY_SOURCE, MPI_ANY_TAG,
& MPI_COMM_WORLD, status, error )
 sender = status(MPI_SOURCE)
 ans_tag = status(MPI_TAG)
 sums(ans_tag) = ans

 call MPI_SEND(buffer, 3*(GRID_SIZE**2),
& MPI_INTEGER, sender, z,
& MPI_COMM_WORLD, error )
     

Towards the end, if a slave reports in for a new job and there are no jobs left then tell it that it is done.

 call MPI_SEND( 1.0, 1, MPI_INTEGER, sender,
& 0, MPI_COMM_WORLD, error )
     



Note that the termination code (a zero) is sent in the tag.



Slave Pseudocode

Slave Code

Accept jobs from the master.

 call MPI_RECV( buffer, 3*(GRID_SIZE**2), MPI_INTEGER,
& master, MPI_ANY_TAG, MPI_COMM_WORLD, status,
& error )
     

Report the result of the jobs.

 call MPI_SEND( ans, 1, MPI_INTEGER, master,
& status(MPI_TAG), MPI_COMM_WORLD, error )
     

Accept new jobs.

20    continue
...
      go to 20
     

The master will tell you when to stop.

if ( status(MPI_TAG) .ne. 0 ) then
     



The advantage of this program structure is that every node is always working. Thus, while one node may do three jobs and another may do twelve, all the nodes have done the same number of computations.

Top


Hydra Specific Information

As stated earlier, the parallel version of Hydra is modeled after Layers, with no known deviations from the general design. Therefore, the only thing left to specify is the data that is transmitted between the master and slave nodes. At the beginning of each call to shsph, the a, dn, h, ds, and de arrays are BCAST from the master node to each of the slaves. During each job, the nodes perform computations on elements of these arrays. Though I had time to make sure that my revisions to shsph.F run, I did not have time to make sure that they did not change the output of shsph.F If there is an important array, other than these five, that I failed to handle then it will be show up as numerical errors in the elements of the five aforementioned arrays.

Top


Links

Access dates are provided as a clue to whether or not the content is current.

Top


Other Resources



Gropp, William., Ewing Lusk and Anthony Skjellum. Using MPI: Portable Parallel Programming With the Message-passing Interface. Cambridge, Mass. MIT Press, 1999.

This book is modeled after many introductory computer language books, complete with examples in C, C++, and Fortran. It is available at the UMBC library or at http://www.netlibrary.com/

Top


Appendix A: layers.f

      program layers
*     This program fills a 3d array with dummy integers and has the
*     sum of each layer and its two neighbors computed.  Each node
*     computes a different layer.
      implicit none
      include 'mpif.h'

*     finished with includes, begin variables

*     declare constants
      integer GRID_SIZE
      integer master

*     set constants
      parameter ( GRID_SIZE = 50 )
      parameter ( master = 0 )

*     declare counters
      integer x,y,z, i,j,k

*     declare MPI variables
      integer numprocs, myid, error
      integer sender, ans_tag, ans
      integer status(MPI_STATUS_SIZE)

*     declare program variables

*     the grid should be visualized with z being the height,
*     x being the width, and y being the length
      integer grid(1:GRID_SIZE, 1:GRID_SIZE, 1:GRID_SIZE)

      integer buffer(1:3, 1:GRID_SIZE, 1:GRID_SIZE)

*     sums is a array indicating the status of each layer (z value)
*     of the grid
      integer sums(1:GRID_SIZE)

      integer numsent

      logical received_all

*     declare custom functions
      integer wrap
      external wrap

*     set program variables
      do z = 1, GRID_SIZE
       do x = 1, GRID_SIZE
        do y = 1, GRID_SIZE
         grid(z,x,y) = z + x + y
        enddo
       enddo
      enddo

      do z = 1, GRID_SIZE
       sums(z) = -1
      enddo  

      numsent = 0
      received_all = .false.

***** finished with varibles, begin instructions

***** begin MPI initialization instructions
      call MPI_Init( error )
      call MPI_Comm_size( MPI_COMM_WORLD, numprocs, error )
      call MPI_Comm_rank( MPI_COMM_WORLD, myid, error )
***** end MPI init instructions, begin program instructions

      print *, "Hello from node ", myid, " of ", numprocs

***** begin instructions for the manager (master node, node 0)
      if ( myid .eq. master ) then
       do i = 1, 3
        do z = i, GRID_SIZE, 3
         do j = 1, 3
          do x = 1, GRID_SIZE
           do y = 1, GRID_SIZE
            buffer(j,x,y) = grid(wrap(z-2+j, GRID_SIZE),x,y)
           enddo
          enddo
         enddo
*        send three layers of grid to each of the other processes
*        tag the transmissions with the middle layer number

         if ( numsent .lt. (numprocs-1) ) then
          call MPI_SEND(buffer, 3*(GRID_SIZE**2),
     &     MPI_INTEGER, numsent+1, z,
     &     MPI_COMM_WORLD, error )
         else
          call MPI_RECV(ans, 1, MPI_INTEGER,
     &     MPI_ANY_SOURCE, MPI_ANY_TAG,
     &     MPI_COMM_WORLD, status, error )
          sender     = status(MPI_SOURCE)
          ans_tag    = status(MPI_TAG)
          sums(ans_tag) = ans
          call MPI_SEND(buffer, 3*(GRID_SIZE**2),
     &     MPI_INTEGER, sender, z,
     &     MPI_COMM_WORLD, error )
         endif
         numsent = numsent + 1
        enddo ! loop over z
       enddo ! loop over i
       if ( numsent .lt. GRID_SIZE ) then
        print *, "Error:  did not send out all the data"
       else
10      continue
        call MPI_RECV(ans, 1, MPI_INTEGER,
     &   MPI_ANY_SOURCE, MPI_ANY_TAG,
     &   MPI_COMM_WORLD, status, error )
        sender  = status(MPI_SOURCE)
        ans_tag = status(MPI_TAG)
        sums(ans_tag) = ans
        call MPI_SEND( 1.0, 1, MPI_INTEGER, sender,
     &   0, MPI_COMM_WORLD, error )
        received_all = .true.
        do z = 1, GRID_SIZE
         if ( sums(z) .eq. -1 ) then
          received_all = .false.
         endif
        enddo

        if ( received_all .eqv. .false. ) then
         goto 10
        endif
       endif

       print *, "z:  sum"
       do z = 1, GRID_SIZE
        print *, z, ":  ", sums(z)
       enddo

***** end manager instructions, begin slave instructions
      else
20     continue
       call MPI_RECV( buffer, 3*(GRID_SIZE**2), MPI_INTEGER,
     &  master, MPI_ANY_TAG, MPI_COMM_WORLD, status,
     &  error )
       if ( status(MPI_TAG) .ne. 0 ) then
        ans = 0
        do z = 1, 3
         do x = 1, GRID_SIZE
          do y = 1, GRID_SIZE
           ans = ans + buffer(z,x,y)
          enddo
         enddo
        enddo
        call MPI_SEND( ans, 1,  MPI_INTEGER, master,
     &   status(MPI_TAG), MPI_COMM_WORLD, error )
        goto 20
       endif
      endif

***** end program instructions, begin exit instructions
      call MPI_Finalize( error )
      stop
      end

******************** wraps the indexer ********************
      integer function wrap( x , GRID_SIZE )
      integer x, GRID_SIZE

      if ( x .lt. 1) then
       wrap = GRID_SIZE
      else if ( x .gt. GRID_SIZE) then
       wrap = 1
      else
       wrap = x
      endif
     end

Top


Appendix B: layers.m

grid = [1:50; 1:50; 1:50];

wrap = [50  ...
 1   2   3   4   5 ...
 6   7   8   9  10 ...
11  12  13  14  15 ...
16  17  18  19  20 ...
21  22  23  24  25 ...
26  27  28  29  30 ...
31  32  33  34  35 ...
36  37  38  39  40 ...
41  42  43  44  45 ...
46  47  48  49  50 ...
1 ];

for i = 1:50
 for j = 1:50
  for k = 1:50
   grid(i,j,k) = i + j + k;
  end
 end
end

for n = 1:50
 ans = 0;
 for i = n : n+2
  for j = 1:50
   for k = 1:50
    ans = grid( wrap(i) , j , k ) + ans;
   end
  end
 end
 ans
end

Top

HMET

HMET Webmail
Photo Gallery
Wiki
WhIsKI

About Eric
Research
Teaching
HOWTOs
Software
Play

Contact Info


Links:
Astronomy
Weather
Surfer's Paradise

DNS
MPI
Presario_725