================================
OpenMP
================================


Introduction
===================

**OpenMP** (Open Multi-Processing) is an Application Program Interface (API) that supports Fortran
and C/C++. OpenMP is
designed for multi-processor (or multi-core), `shared memory`__
machines. The underlying architecture can be UMA or NUMA shared memory. 


This section gives a quick introduction to OpenMP with Fortran example
codes. For more information refer to `www.openmp.org`__. In particular, you may find `here`__ and `here`__ useful. 

The Fortran codes for the examples in this section can be found in the course repository in directory "OpenMP".


Basic idea: fork-join model
=============================
OpenMP accomplishes parallelism through the use of threads. A thread
of execution is the smallest unit of processing that can be scheduled
by an operating system. The programmer controls parallelization using
**fork-joint** model of parallel execution:

1) All OpenMP programs begin as a single process, known as the `master
   thread`. The Master thred executes until a parallel region
   construct is encountered.

2) The sequantial execution branches off (or **forks**) in parallel at designated points in the
   program, and a number of parallel threads execute on a number of
   processors/cores simultaneously. Typically, the number of threads matches the
   number of machine processors/cores. Rule of thumb: one thread per processor/core.
   
3) After threads execute, they merge (or **join**) at a subsequent
   point and resume sequential execution.

As an example, the figure below (taken from `Wikipedia`__) is a
schematic representation of a parallel code, where three regions of the code (i.e. parallel tasks) permit parallel execution of a number of
threads. 
   

.. figure:: _static/ForkJoin.png
    :width: 600px
    :align: center
    :height: 150px
    :alt: alternate text
    :figclass: align-center

|
	       
Note that threads may need to `synchronize` if there is
`dependency`. The programmer is responsible for synchronizing input
and outputs. See below.


General recepie
====================

OpenMP is an explicit (not automatic) programming model, offering you (the
programmer) full control over parallelization.

First, given a serial code, you decide where to fork, where to join,
and how to split a parallel region into a number of tasks. You may
also need to synchronize threads.
   
Next, you convert the serial code into a parallel code in such a way that it can still
run in serial mode. This is done by employing the three components of OpenMP (see OpenMP
components below) and using two directive sentinels:

1. !$OMP

2. !$ 

When a non-OpenMP compiler encounters these lines, it will treat those
lines as comments. However, when an OpenMP compiler encounters these
lines, it will compile the directives and statements in those
lines. To run with an OpenMP compiler, you just need to use
``gfortran`` with the flag ``-fopenmp``.

1. ``!$OMP`` is followed by an OpenMP `directive`.

   Example: ``!$OMP parallel default(shared) private(x,pi)``

2.  ``!$`` can be followed by ordinary Fortran statements to enable
    conditional compiling.

    Example: ``!$ print *, "This line will be compiled with -fopenmp flag"``


OpenMP components
===============================


OpenMP consists of three components: 

* Compiler directives

* Runtime library routines

* Environment variables

We will briefly discuss the first two components. 

1. Compiler directives
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

OpenMP compiler directives are used for various purposes, such as (1)
generating a parallel region; (2) dividing blocks of code among
threads; (3) distributing loop iterations between threads; and (4)
synchronization of work among threads. A parallel region is a block of
code that will be executed by multiple threads. This is the
fundamental OpenMP parallel construct.

They have the syntax:   ``sentinel   directive-name   [clause1]   [clause2]   ...``

The sentinel is `!$OMP`, followed by a white space and the name of a
valid OpenMP directive. Clauses are optional and can be in any
order. Note that some clauses will not be accepted by some directives.


Important **directives** include:

* PARALLEL: duplicates the code in the parallel region on all threads,
  so that  each thread works independently doing all instructions in the region

* DO: shares iterations of a loop in parallel across the team

* PARALLEL DO: takes advantage of both PARALLEL and DO

* SECTIONS: breaks work into separate sections, each executed once by a
  thread in the team

* SINGLE: serializes a section of code, so that the section is
  executed by only one thread in the team.

* Synchronization directives: Sometimes, you need to synchronize the
  work among threads. This can be done by the following
  directives. Note that the synchronization directives do not accept cluses.

  1. MASTER

  2. CRITICAL

  3. BARRIER
     

Important **clauses** include:

* if (a scalar logical expression)

  `supported by PARALLEL, PARALLEL DO`


* private (list)

  `supported by PARALLEL, DO, PARALLEL DO, SECTIONS, SINGLE`

* shared (list)

  `supported by PARALLEL, DO, PARALLEL DO`

* default (private | shared | firstprivate)

  `supported by PARALLEL, PARALLEL DO`

* firstprivate (list)

  `supported by PARALLEL, DO, PARALLEL DO, SECTIONS, SINGLE`

* reduction (operator : list)

  `supported by PARALLEL, DO, PARALLEL DO, SECTIONS`

* schedule (type, chunk)

  `supported by DO, PARALLEL DO`


2. Runtime library routines
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~   

Library routines are used for various purposes, such as (1) setting
and querying the number of threads; (2) querying a thread's unique
identifier (thread ID); and (3) querying wall clock time. You need to
include ``!$ use omp_lib`` in your code to have access to these routines.

Important **library routines** include:

* omp_set_num_threads

  This will set the number of threads that will be used in the next
  parallel region. This routine can only be called
  from the serioal portions of code.

* omp_get_num_threads

  This will return the number of threads that are currently in the
  team executing the parallel region from which it is called.

* omp_get_thread_num

  This will return the thread number of the thread within the team
  
* omp_get_wtime

  This will return the elapsed time (in seconds), as a
  double-precision floating point value, since some point in the
  past. In order to obtain the elapsed time for a block of code, this
  function is used in pairs with the value of the first call
  subtracted from the value of the second call.

Among the above, the first one is a subroutine, and the other three are
functions. Note that they all need to come after !$.

Example 1
====================================

.. literalinclude:: ../hpsc2020/OpenMP/OMP1.f90
   :language: fortran
   :linenos:


* Line 2: need to use the OpenMP module to have access to subroutines
  and functions.

* Line 4: OpenMP library subroutine is used to set the number of threads to 4.

* The block between lines 5 and 12 is the parallel region and will be
  executed in parallel by multiple threads (here 4).
  
* Every thread executes all code enclosed in the parallel region. Each
  thread works independently doing all instructions in the block.

* Line 5: the thread number (tid) must be private. Each thread must
  have its own version of the tid variable, since each thread has its own number. 

* Line 6: OpenMP library function is used to obtain thread
  identifiers. This call must be inside the parallel region     

* Line 8-11: only master thread (which has id zero in the team) does this.

* Line 9: OpenMP library function is used to obtain the total number of threads.   
  

* To compile and run:

  .. code-block:: none

		$ gfortran -fopenmp -c OMP1.f90
		$ gfortran -fopenmp OMP1.o -o OMP1.x
		$ ./OMP1.x

		
  This may give the following output:

  .. code-block:: none

		This is thread no.            3
		This is thread no.            2
		This is thread no.            0
		This is thread no.            1
		Total number of threads =            4

* The order of executaion will not be guaranteed.

* You may also compile without OpenMP flag. In this case the parallel
  block will not be executed.

Example 2
====================

.. literalinclude:: ../hpsc2020/OpenMP/OMP2.f90
   :language: fortran
   :linenos:

* Each processor with ID number N (here N=0, 1, 2, 3) will compute sum = 1+2+ ... +N.

* Line 9: the variables tid, i, and sum are declared private because
  each processor must have its own version of these veriables.

* Compiling with -fopenmp, we may get the following output:

  .. code-block:: none

		 I am processor no.           3  and my sum is    3.0000000000000000     
		 I am processor no.           1  and my sum is    1.0000000000000000     
		 I am processor no.           0  and my sum is    6.9532229757727091E-310
		 I am processor no.           2  and my sum is    2.0000000000000000 

* If you compile without -fopenmp, you would get:

  .. code-block:: none

		 I am processor no.           1  and my sum is    1.0000000000000000 

Example 3
=================

.. literalinclude:: ../hpsc2020/OpenMP/OMP3.f90
   :language: fortran
   :linenos:

* The iterations of the loop will be distributed dynamically in CHUNK
  sized pieces.
  
* Arrays a, b, c, and variable n will be shared by all threads, while
  variable i will be private to each thread.

  
Example 4
=================

.. literalinclude:: ../hpsc2020/OpenMP/OMP4.f90
   :language: fortran
   :linenos:

* This is the same as Example 3.
  

Example 5
==================

.. literalinclude:: ../hpsc2020/OpenMP/OMP5.f90
   :language: fortran
   :linenos:


Example 6
==================

.. literalinclude:: ../hpsc2020/OpenMP/OMP6.f90
   :language: fortran
   :linenos:

* Single directive is useful for instance for initializing or printing
  out a shared variable. 

* You can also use ``!$omp master`` instead of ``!$omp single`` in
  line 15 to force execution by the master thread. In general, the
  user must not make any assumptions as to which thread will execute a
  single region.

* Only one thread will execute lines 16-17. All other threads will
  skip the single region and stop at the end of the single construct
  until all threads in the team have reached that point. If other
  threads can proceed without waiting for the single thread executing
  the single region, a `nowait` clause can be specified. For this you
  would need to replace line 18
  by ``!$omp end single nowait``.


Example 7 (Buggy code)
=============================

.. literalinclude:: ../hpsc2020/OpenMP/OMP7.f90
   :language: fortran
   :linenos:


Example 8 (Correct code)
=============================

.. literalinclude:: ../hpsc2020/OpenMP/OMP8.f90
   :language: fortran
   :linenos:

* Iterations of the parallel loop will be distributed in equal sized
  blocks to each thread in the team: SCHEDULE (static, chunk)

* REDUCTION(+:result) will create a private copy of result and
  initialize it (to zero) for each thread. At the end of the parallel loop, all threads will add their values
  of result to update the master thread's global copy.


Example 9 (Buggy code)
=============================

.. literalinclude:: ../hpsc2020/OpenMP/OMP9.f90
   :language: fortran
   :linenos:

* The code is buggy! Try to run this program a couple of times:

  .. code-block:: none

		  motamed$ ./OMP10.x
		  The integral is    -3.9999999999998405E-003
		  motamed$ ./OMP10.x
		  The integral is    -6.6599999999999770E-002
		  motamed$ ./OMP10.x
		  The integral is   -0.12253599999999992     
		  motamed$ ./OMP10.x
		  The integral is    -4.6559999999999893E-002
		  motamed$ ./OMP10.x
		  The integral is   -0.10002399999999984     
		  motamed$ ./OMP10.x
		  The integral is    -4.0527999999999897E-002


Example 10 (Correct code)
=============================

.. literalinclude:: ../hpsc2020/OpenMP/OMP10.f90
   :language: fortran
   :linenos:

* Now run this program a couple of times:

  .. code-block:: none

		  motamed$ ./OMP11.x
		  The integral is    -3.9999999999706937E-005  
		  motamed$ ./OMP11.x
		  The integral is    -3.9999999999706937E-005    
		  motamed$ ./OMP11.x
		  The integral is    -3.9999999999651425E-005   
		  motamed$ ./OMP11.x
		  The integral is    -3.9999999999595914E-005
		  
Example 11
=============================

This example show how to measure the wallclock time using both Fortran subroutine
`system_clock` and the OpenMP library routine `omp_get_wtime`.  

.. literalinclude:: ../hpsc2020/OpenMP/OMP11.f90
   :language: fortran
   :linenos:
		  
* The output with -fopenmp flag (on my 3GHz dual-core Intel Core i7 Apple MacBook) is:

  .. code-block:: none

		  The integral is    -3.9999999999651425E-005
		  OMP elapsed time =   0.41845393180847168      sec.
		  Fortran elapsed time =  0.41844800114631653      sec.

  And without -fopenmp (with only one processor):
  
  .. code-block:: none

		   The integral is    -4.0000000000106370E-005
		   Fortran elapsed time =   1.0169110298156738      sec.

   
__ http://math.unm.edu/~motamed/Teaching/Fall20/HPSC/parallel.html#parallel-computer-memory-architectures
__ http://www.openmp.org/
__ https://computing.llnl.gov/tutorials/openMP/
__ http://www.openmp.org/wp-content/uploads/OpenMP_Examples_4.0.1.pdf
__ https://en.wikipedia.org/wiki/Fork%E2%80%93join_model