.. -*- rst -*- -*- restructuredtext -*-

================================
Stampede2 supercomputer
================================

Intro
=====

.. image:: Stampede2_System.jpg
   :height: 500
   :width: 650
   :scale: 50%
   :alt: Stampede2 System at TACC
   :align: right

**Stampede2** is the flagship supercomputer at The University of Texas
at Austin's Texas Advanced Computing Center (`TACC`__). Stampede2
provides high-performance computing capabilities to many researchers across
the U.S.
TACC has accepted the instructor's educational resource
request to support this course. As a result, you will have access to
Stampede2, the 21st fastest supercomputer in the world as of June 2020, with a Max Performance of 10,680 TFlop/s; see `https://www.top500.org/`__. The cluster has 4200 Dell KNL compute nodes, each including 68
cores, 16GB high-speed MCDRAM, 96GB DDR-4 memory and a 200GB SSD drive. 


This section aims at introducing you to Stampede2. In particular, you
will learn how to:

1. log into Stampede2 
2. compile OpenMP programs
3. specify the number of threads for OpenMP programs
4. set up batch scripts for parallel jobs
5. run and analyze the timimg output from a simple OpenMP program

You may find useful information about Stapede2 in `Stampede2 User Guide`__.

__ https://www.tacc.utexas.edu/
__ https://www.top500.org/lists/top500/list/2020/06/?page=1
__ https://portal.tacc.utexas.edu/user-guides/stampede2


1. Logging into Stampede2
=================================

**Create an account**: you will first need to create an XSEDE portal account here:
`https://portal.xsede.org/`__ . Click on "Create Account" and follow
directions. Make sure you use your UNM email address, not any other personal
address. After you create a username, send me an email (entitled
"471") and give me your XSEDE User Portal username. Then I, as the PI of
the Stampede2 allocation grant, will request that you be added as a
Stampede2 user under the granted allocation.

As `XSEDE`__ users, you will need to access Stampede2 via the `XSEDE
SSO Hub`__ as follows:

1. Install the `Duo app`__ on your cellphone. It has a green
   icon. At the same time, login to `https://portal.xsede.org/`__ and go to MY
   XSEDE -> Profile. Click on "Enroll in Duo" in the upper right
   corner of your profile page. Follow the instructions. In the end
   you will see "Due Enrollment Successful" in the upper right
   corner of your profile page. We will use Duo  app as a `two-factor
   authentication`__ later in step 3. You will need to do this step
   only one time. For future logins start from step 2.
   
2. From your local machine (or perhaps a remote machine such as
   linux.unm.edu) use SSH client to login to the XSEDE SSO Hub with
   your XSEDE User Portal (XUP) username and password:

   .. code-block:: none

		localhost$ ssh -l XUP_USERNAME login.xsede.org

   where XUP_USERNAME is your XUP username.		

3. After entering your password, you'll be prompted to select an
   authentication method. Choose "Duo Push" by entering the
   corresponding number (number 1) on your keyboard. You should
   receive an update via your app on your cellphone saying "Request waiting: Tap to
   respond." Tap this, then tap the "Approve" prompt on the next
   screen.

4. You should now be logged into the XSEDE SSO hub. From here you can
   open a GSISSH session to Stampede2: ``$ gsissh
   stampede2``. Once logged in the shell will display
   some useful information about Stampede2:

   .. code-block:: none

		[motamed@ssohub ~]$ gsissh stampede2
		------------------------------------------------------------------------------
                                    Welcome to the Stampede2 Supercomputer
		     Texas Advanced Computing Center, The University of Texas at Austin
		------------------------------------------------------------------------------
		
		Welcome to Stampede2, *please* read these important system notes:
		
		--> Stampede2 user documentation is available at:
		https://portal.tacc.utexas.edu/user-guides/stampede2
		
		--------------------- Project balances for user tg845292 ----------------------
		| Name           Avail SUs     Expires |                                      |
		| TG-MTH200004        1420  2021-08-20 |                                      |
		------------------------ Disk quotas for user tg845292 ------------------------
		| Disk         Usage (GB)     Limit    %Used   File Usage       Limit   %Used |
		| /home1              0.0      10.0     0.00           32      200000    0.02 |
		| /work               0.0    1024.0     0.00          168     3000000    0.01 |
		| /scratch            0.0       0.0     0.00          153           0    0.00 |
		-------------------------------------------------------------------------------
		
		
This connects you to Stampede2 through one of four login nodes
(login1-login4). Please note that a login node is a resource shared 
with many others. That is, what you do here directly affects other
users. The login node will serve only as a preparation environment where
you can edit and manage files, compile codes, and submit and track
jobs before accessing the compute nodes. Do not run and execute your applications
here, such as Matlab, R, ./a.out, etc.
		
__ https://portal.xsede.org/
__ https://portal.xsede.org/
__ https://portal.xsede.org/single-sign-on-hub
__ https://duo.com/
__ https://portal.xsede.org/
__ https://portal.xsede.org/mfa


Stampede2 file system 
=====================================

Stampede2 has three parallel file systems:

* $HOME (permanent, quota'd: 10GB and 200,000 files, backed-up)
* $WORK (permanent, quota'd: 1TB and 3,000,000 files, not backed-up) 
* $SCRATCH (high-speed purged storage).

These three file systems are associated with your account. Your $HOME
directory is small and should mainly be used to store source codes and
for compiling. The $WORK filesystem allocation is larger and should be
used for most of your computations in homework 6 and homework 7. For this lab session we don't
really write anything to file so we can stay in $HOME. The $SCRATCH filesystem is very large but is periodically
purged so it should only be used for very big computations. You will
not need to use $SCRATCH filesystem in this course.

The "cdw" and "cds" aliases are provided as a convenience to change to
your $WORK and $SCRATCH directories, respectively. 


Cloning our repository
=======================================

Once you are logged in (you will be in $HOME directory), clone our repository: 

.. code-block:: makefile

   $ git clone https://username@bitbucket.org/motamed/hpsc2020.git

and cd into ``hpsc2020/Stampede2``. In that directory there should be a
"Makefile", an OpenMP Fortran program "matrixmul.f90", and a
Slurm script "ompbatch_8.job".


Compiling, executing, and timing
=======================================================

The makefile contains something like: 

.. code-block:: makefile

		FC = ifort             #on Stampede2
		F90FLAGS = -qopenmp    #on Stampede2
		FL = gfortran          #on your local machine
		FFLAGS = -fopenmp      #on your locl machine
		EX = ./matrixmul.x
		OBJECTS = matrixmul.o
		

As you can see we will use the Intel fortran compiler ``ifort``. The
flag ``-qopenmp`` will give you access to the OpenMP module and will
also instruct the compiler to take the OpenMP directives (the lines
starting with ``!$OMP``) into account. Refer to the `OpenMP section`__
for information on OpenMP directives. 

The program ``matrixmul.f90`` is a very simple example of how OpenMP can be used for parallel computing. 

.. code-block:: fortran 

 program matrixmul
   use omp_lib
   implicit none 
   integer, parameter :: nmax = 800
   integer :: i,j,k,l,myid
   real(kind = 8), dimension(nmax,nmax) :: A,B,C
   real(kind = 8) :: d,time1,time2,time1a,time2a
   A = 1.d0
   B = 2.d0
   
__ http://math.unm.edu/~motamed/Teaching/Fall20/HPSC/openmp.html

This code computes and times the matrix product :math:`C = AB` using do loops. The program is using 1 to 8 threads, and the number of threads are set by the call to ``omp_set_num_threads()``.  

.. code-block:: fortran 

   do l = 1,8
      !$ call omp_set_num_threads(l)
      call cpu_time(time1)
      time1a = omp_get_wtime()

We use two timers, ``cpu_time()`` and ``omp_get_wtime()`` the first one times the cpu-time and the second the wall-clock time. Our approach to computing :math:`C = AB` is straightforward: we simply use the OpenMP directive ``!$OMP PARALLEL DO`` which tells the compiler to execute the next do loop in parallel. By default, the variables inside the do loop are assumed to be shared but we can make them private to each thread by the ``PRIVATE()`` statement. The loop counter ``j`` is in-fact private by default but to make that explicit we put it in the ``PRIVATE()`` statement as well.

What could happen if ``i,k,d`` were shared? 

.. code-block:: fortran 

      !$OMP PARALLEL DO PRIVATE(i,j,k,d)
      do j = 1,nmax
         do i = 1,nmax
            d = 0.d0  
            do k = 1,nmax
               d =  d + A(i,k)*B(k,j)
            end do
            C(i,j) = d
         end do
      end do
      call cpu_time(time2)
      time2a = omp_get_wtime()
      write(*,*) "With ", l, " threads this takes: ",time2-time1 ,&
      " of cpu_time but only ",time2a-time1a, " wall clock time."  
   end do
 end program matrixmul


Start by compiling the program using ``make`` and make sure that the
executable ``matrixmul.x`` has been produced.


Submitting jobs on Stampede2
===============================

Stampede2's job scheduler is the `Slurm Workload Manager`__. You can
also check this `Wikipedia link`__. Slurm commands (such as ``sbatch``
and ``squeue``) enable us to submit, manage, monitor, and control your
jobs. We need to create a Slurm script consisting of several Slurm
directives (starting with ``#SBATCH``) followed by several shell
commands that describe the particulars of the job we are
submitting and the computing resources we are requesting. See the Slurm script ``ompbatch_8.job`` below. We will
discuss the directives in class, but try also to read `Running Jobs on
Stampede2 Compute Nodes`__ and `Accessing Compute Nodes`__ in the
Stampede2 userguide and find out what the lines in the script
mean. In particular, pay close attention to `Table 6`__.

.. code-block:: none
		
		#!/bin/bash                                                                                       
		#SBATCH -A TG-MTH200004     # account name (project charge number)                                
		#SBATCH -J mmul_omp_t8      # job name                                                            
		#SBATCH -o mmul8_out.%j     # output file                                                         
		#SBATCH -e mmul8_err.%j     # error file                                                          
		#SBATCH -N 1                # total nodes requested                                               
		#SBATCH -n 1                # total MPI tasks requested                                           
		#SBATCH -p development      # queue name (use development for requested time < 2hrs)              
		#SBATCH -t 00:05:00         # total wall clock time requested <hh:mm:ss>                          
		
		export OMP_NUM_THREADS=8    # total number of OpenMP threads requested (1 per KNL core)           
		./matrixmul.x


The ``%j`` in the output and error filenames gets replaced by your job-id so that the files don't get overwritten. 

Now submit your job:

.. code-block:: none

		login4.stampede2(115)$ sbatch ompbatch_8.job 

		-----------------------------------------------------------------
		Welcome to the Stampede 2 Supercomputer          
		-----------------------------------------------------------------
		
		No reservation for this job
		--> Verifying valid submit host (login4)...OK
		--> Verifying valid jobname...OK
		--> Enforcing max jobs per user...OK
		--> Verifying availability of your home dir (/home1/05230/tg845292)...OK
		--> Verifying availability of your work dir (/work/05230/tg845292/stampede2)...OK
		--> Verifying availability of your scratch dir (/scratch/05230/tg845292)...OK
		--> Verifying valid ssh keys...OK
		--> Verifying access to desired queue (development)...OK
		--> Verifying job request is within current queue limits...OK
		--> Checking available allocation (TG-MTH200004)...OK
		Submitted batch job 431971


If you want to check the status you can use ``squeue -u username``. If
you realize that you did something wrong, ``scancel jobid`` is the
command to use (jobid can be found from ``squeue -u username``.) If
you do not know your username you can type the most important command
of all time: ``whoami``.

Inspect the timing data (in the ``mmul8_out.jobid`` file ) and compute
the speedup. Try to plot the speedup versus the number of
threads (in your LOCAL machine, NOT on the login node in
Stampede2). Does it behave as you expect?


**Two more exercises**:

1. Notice that each node on Stampede2 has 68 cores. This would allow
   us to choose more than 8 threads. You may therefore want to change
   ``l = 1,8`` in the loop in ``matrixmul.f90``. If so, also change
   the maximum number of threads in the SLURM script, and recompile
   before submitting your job.
   
2. Change the program so that the parallel do loop is either the ``i``
   or the ``k`` loop and again compute the speedup.
   

Good citizenship
====================================

You share Stampede2 with thousands of other users, and what you do on
the system affects others. Exercise good citizenship to ensure that
your activity does not adversely impact the system and the research
community with whom you share it. Here are some rules of thumb:

* A login node is a good place to edit and manage files, initiate file
  transfers, compile code, submit new jobs, and track existing jobs.

* Avoid computationally intensive activity on login nodes. This means:

    1. Don't run research applications on the login nodes; this
       includes frameworks like MATLAB and R.
    2. Don't launch too many simultaneous processes. While it's fine
       to compile on a login node, don't execute your code on a login
       node.
*  Your $HOME directory is small and should mainly be used to store
   source codes and for compiling. For this lab session you can stay
   in $HOME. The $WORK filesystem allocation is larger and should be
   used for most of your computations in homework 6 and
   homework 7. You do not need to worry about $SCRATCH filesystem, as
   you will not use it.
   
* When you submit a job to the scheduler, don't ask for more time than
  you really need. The scheduler will have an easier time finding you
  a slot for the 2 hours you need than the 48 hours you request. This
  means shorter queue waits times for you and everybody else.
  
* Test your submission scripts. Start small: start with a small size
  problem and make sure everything
  works on 1 node with 2 cores for OpenMP (or 2 nodes for MPI) before
  you try 16 cores or 200 nodes. Work out submission bugs with 5
  minute jobs that won't wait long in the queue.

See `Good Citizenship`__ for further exercises of good citizenship.


__ https://www.schedmd.com/
__ https://en.wikipedia.org/wiki/Slurm_Workload_Manager
__ https://portal.tacc.utexas.edu/user-guides/stampede2#running-jobs-on-the-stampede2-compute-nodes
__ https://portal.tacc.utexas.edu/user-guides/stampede2#using-computenodes
__ https://portal.tacc.utexas.edu/user-guides/stampede2#table6
__ https://portal.tacc.utexas.edu/user-guides/stampede2#citizenship