Stampede2 supercomputer

Intro

Stampede2 System at TACC

Stampede2 is the flagship supercomputer at The University of Texas at Austin’s Texas Advanced Computing Center (TACC). Stampede2 provides high-performance computing capabilities to many researchers across the U.S. TACC has accepted the instructor’s educational resource request to support this course. As a result, you will have access to Stampede2, the 21st fastest supercomputer in the world as of June 2020, with a Max Performance of 10,680 TFlop/s; see https://www.top500.org/. The cluster has 4200 Dell KNL compute nodes, each including 68 cores, 16GB high-speed MCDRAM, 96GB DDR-4 memory and a 200GB SSD drive.

This section aims at introducing you to Stampede2. In particular, you will learn how to:

  1. log into Stampede2
  2. compile OpenMP programs
  3. specify the number of threads for OpenMP programs
  4. set up batch scripts for parallel jobs
  5. run and analyze the timimg output from a simple OpenMP program

You may find useful information about Stapede2 in Stampede2 User Guide.

1. Logging into Stampede2

Create an account: you will first need to create an XSEDE portal account here: https://portal.xsede.org/ . Click on “Create Account” and follow directions. Make sure you use your UNM email address, not any other personal address. After you create a username, send me an email (entitled “471”) and give me your XSEDE User Portal username. Then I, as the PI of the Stampede2 allocation grant, will request that you be added as a Stampede2 user under the granted allocation.

As XSEDE users, you will need to access Stampede2 via the XSEDE SSO Hub as follows:

  1. Install the Duo app on your cellphone. It has a green icon. At the same time, login to https://portal.xsede.org/ and go to MY XSEDE -> Profile. Click on “Enroll in Duo” in the upper right corner of your profile page. Follow the instructions. In the end you will see “Due Enrollment Successful” in the upper right corner of your profile page. We will use Duo app as a two-factor authentication later in step 3. You will need to do this step only one time. For future logins start from step 2.

  2. From your local machine (or perhaps a remote machine such as linux.unm.edu) use SSH client to login to the XSEDE SSO Hub with your XSEDE User Portal (XUP) username and password:

    localhost$ ssh -l XUP_USERNAME login.xsede.org
    

    where XUP_USERNAME is your XUP username.

  3. After entering your password, you’ll be prompted to select an authentication method. Choose “Duo Push” by entering the corresponding number (number 1) on your keyboard. You should receive an update via your app on your cellphone saying “Request waiting: Tap to respond.” Tap this, then tap the “Approve” prompt on the next screen.

  4. You should now be logged into the XSEDE SSO hub. From here you can open a GSISSH session to Stampede2: $ gsissh stampede2. Once logged in the shell will display some useful information about Stampede2:

    [motamed@ssohub ~]$ gsissh stampede2
    ------------------------------------------------------------------------------
                        Welcome to the Stampede2 Supercomputer
         Texas Advanced Computing Center, The University of Texas at Austin
    ------------------------------------------------------------------------------
    
    Welcome to Stampede2, *please* read these important system notes:
    
    --> Stampede2 user documentation is available at:
    https://portal.tacc.utexas.edu/user-guides/stampede2
    
    --------------------- Project balances for user tg845292 ----------------------
    | Name           Avail SUs     Expires |                                      |
    | TG-MTH200004        1420  2021-08-20 |                                      |
    ------------------------ Disk quotas for user tg845292 ------------------------
    | Disk         Usage (GB)     Limit    %Used   File Usage       Limit   %Used |
    | /home1              0.0      10.0     0.00           32      200000    0.02 |
    | /work               0.0    1024.0     0.00          168     3000000    0.01 |
    | /scratch            0.0       0.0     0.00          153           0    0.00 |
    -------------------------------------------------------------------------------
    

This connects you to Stampede2 through one of four login nodes (login1-login4). Please note that a login node is a resource shared with many others. That is, what you do here directly affects other users. The login node will serve only as a preparation environment where you can edit and manage files, compile codes, and submit and track jobs before accessing the compute nodes. Do not run and execute your applications here, such as Matlab, R, ./a.out, etc.

Stampede2 file system

Stampede2 has three parallel file systems:

  • $HOME (permanent, quota’d: 10GB and 200,000 files, backed-up)
  • $WORK (permanent, quota’d: 1TB and 3,000,000 files, not backed-up)
  • $SCRATCH (high-speed purged storage).

These three file systems are associated with your account. Your $HOME directory is small and should mainly be used to store source codes and for compiling. The $WORK filesystem allocation is larger and should be used for most of your computations in homework 6 and homework 7. For this lab session we don’t really write anything to file so we can stay in $HOME. The $SCRATCH filesystem is very large but is periodically purged so it should only be used for very big computations. You will not need to use $SCRATCH filesystem in this course.

The “cdw” and “cds” aliases are provided as a convenience to change to your $WORK and $SCRATCH directories, respectively.

Cloning our repository

Once you are logged in (you will be in $HOME directory), clone our repository:

$ git clone https://username@bitbucket.org/motamed/hpsc2020.git

and cd into hpsc2020/Stampede2. In that directory there should be a “Makefile”, an OpenMP Fortran program “matrixmul.f90”, and a Slurm script “ompbatch_8.job”.

Compiling, executing, and timing

The makefile contains something like:

FC = ifort             #on Stampede2
F90FLAGS = -qopenmp    #on Stampede2
FL = gfortran          #on your local machine
FFLAGS = -fopenmp      #on your locl machine
EX = ./matrixmul.x
OBJECTS = matrixmul.o

As you can see we will use the Intel fortran compiler ifort. The flag -qopenmp will give you access to the OpenMP module and will also instruct the compiler to take the OpenMP directives (the lines starting with !$OMP) into account. Refer to the OpenMP section for information on OpenMP directives.

The program matrixmul.f90 is a very simple example of how OpenMP can be used for parallel computing.

program matrixmul
  use omp_lib
  implicit none
  integer, parameter :: nmax = 800
  integer :: i,j,k,l,myid
  real(kind = 8), dimension(nmax,nmax) :: A,B,C
  real(kind = 8) :: d,time1,time2,time1a,time2a
  A = 1.d0
  B = 2.d0

This code computes and times the matrix product \(C = AB\) using do loops. The program is using 1 to 8 threads, and the number of threads are set by the call to omp_set_num_threads().

do l = 1,8
   !$ call omp_set_num_threads(l)
   call cpu_time(time1)
   time1a = omp_get_wtime()

We use two timers, cpu_time() and omp_get_wtime() the first one times the cpu-time and the second the wall-clock time. Our approach to computing \(C = AB\) is straightforward: we simply use the OpenMP directive !$OMP PARALLEL DO which tells the compiler to execute the next do loop in parallel. By default, the variables inside the do loop are assumed to be shared but we can make them private to each thread by the PRIVATE() statement. The loop counter j is in-fact private by default but to make that explicit we put it in the PRIVATE() statement as well.

What could happen if i,k,d were shared?

     !$OMP PARALLEL DO PRIVATE(i,j,k,d)
     do j = 1,nmax
        do i = 1,nmax
           d = 0.d0
           do k = 1,nmax
              d =  d + A(i,k)*B(k,j)
           end do
           C(i,j) = d
        end do
     end do
     call cpu_time(time2)
     time2a = omp_get_wtime()
     write(*,*) "With ", l, " threads this takes: ",time2-time1 ,&
     " of cpu_time but only ",time2a-time1a, " wall clock time."
  end do
end program matrixmul

Start by compiling the program using make and make sure that the executable matrixmul.x has been produced.

Submitting jobs on Stampede2

Stampede2’s job scheduler is the Slurm Workload Manager. You can also check this Wikipedia link. Slurm commands (such as sbatch and squeue) enable us to submit, manage, monitor, and control your jobs. We need to create a Slurm script consisting of several Slurm directives (starting with #SBATCH) followed by several shell commands that describe the particulars of the job we are submitting and the computing resources we are requesting. See the Slurm script ompbatch_8.job below. We will discuss the directives in class, but try also to read Running Jobs on Stampede2 Compute Nodes and Accessing Compute Nodes in the Stampede2 userguide and find out what the lines in the script mean. In particular, pay close attention to Table 6.

#!/bin/bash
#SBATCH -A TG-MTH200004     # account name (project charge number)
#SBATCH -J mmul_omp_t8      # job name
#SBATCH -o mmul8_out.%j     # output file
#SBATCH -e mmul8_err.%j     # error file
#SBATCH -N 1                # total nodes requested
#SBATCH -n 1                # total MPI tasks requested
#SBATCH -p development      # queue name (use development for requested time < 2hrs)
#SBATCH -t 00:05:00         # total wall clock time requested <hh:mm:ss>

export OMP_NUM_THREADS=8    # total number of OpenMP threads requested (1 per KNL core)
./matrixmul.x

The %j in the output and error filenames gets replaced by your job-id so that the files don’t get overwritten.

Now submit your job:

login4.stampede2(115)$ sbatch ompbatch_8.job

-----------------------------------------------------------------
Welcome to the Stampede 2 Supercomputer
-----------------------------------------------------------------

No reservation for this job
--> Verifying valid submit host (login4)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/05230/tg845292)...OK
--> Verifying availability of your work dir (/work/05230/tg845292/stampede2)...OK
--> Verifying availability of your scratch dir (/scratch/05230/tg845292)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (development)...OK
--> Verifying job request is within current queue limits...OK
--> Checking available allocation (TG-MTH200004)...OK
Submitted batch job 431971

If you want to check the status you can use squeue -u username. If you realize that you did something wrong, scancel jobid is the command to use (jobid can be found from squeue -u username.) If you do not know your username you can type the most important command of all time: whoami.

Inspect the timing data (in the mmul8_out.jobid file ) and compute the speedup. Try to plot the speedup versus the number of threads (in your LOCAL machine, NOT on the login node in Stampede2). Does it behave as you expect?

Two more exercises:

  1. Notice that each node on Stampede2 has 68 cores. This would allow us to choose more than 8 threads. You may therefore want to change l = 1,8 in the loop in matrixmul.f90. If so, also change the maximum number of threads in the SLURM script, and recompile before submitting your job.
  2. Change the program so that the parallel do loop is either the i or the k loop and again compute the speedup.

Good citizenship

You share Stampede2 with thousands of other users, and what you do on the system affects others. Exercise good citizenship to ensure that your activity does not adversely impact the system and the research community with whom you share it. Here are some rules of thumb:

  • A login node is a good place to edit and manage files, initiate file transfers, compile code, submit new jobs, and track existing jobs.

  • Avoid computationally intensive activity on login nodes. This means:

    1. Don’t run research applications on the login nodes; this includes frameworks like MATLAB and R.
    2. Don’t launch too many simultaneous processes. While it’s fine to compile on a login node, don’t execute your code on a login node.
  • Your $HOME directory is small and should mainly be used to store source codes and for compiling. For this lab session you can stay in $HOME. The $WORK filesystem allocation is larger and should be used for most of your computations in homework 6 and homework 7. You do not need to worry about $SCRATCH filesystem, as you will not use it.

  • When you submit a job to the scheduler, don’t ask for more time than you really need. The scheduler will have an easier time finding you a slot for the 2 hours you need than the 48 hours you request. This means shorter queue waits times for you and everybody else.

  • Test your submission scripts. Start small: start with a small size problem and make sure everything works on 1 node with 2 cores for OpenMP (or 2 nodes for MPI) before you try 16 cores or 200 nodes. Work out submission bugs with 5 minute jobs that won’t wait long in the queue.

See Good Citizenship for further exercises of good citizenship.