Parallel computing¶

What is parallel computing?¶

Parallel computing is the simultaneous use of multiple compute resources to solve a computational problem. Provided the computational problem can be broken apart into discrete pieces of work that can be solved simultaneously, the goal is to solve the problem in less time with multiple compute resources than with a single compute resource. Luckily, the real-word (and hence the mathematical models desscribing real-world systems) is massively parallel. Many complex, interrelated events are happening at the same time, yet within a temporal sequence.

Today’s compute resources are typically: (1) a single computer with multiple processors/cores, e.g. an Apple MacBook with a duo-core or quad-core processor; or (2) a supercomputer consisting of a large number of such computers (or nodes) connected by a communication network, forming a cluster. Each node consists of multiple cores that sit together on a socket. For instance, Stampede2 supercomputer is a cluster with 4,200 nodes, each containing 68 cores, making a total of 285,600 cores.

Generally speaking, each core is an independent unit that executes a stream of instructions. The separate cores can work on unrelated tasks or (by data parallelism) collaborate on a common task.

Limits of parallel computing¶

Amdahl’s Law gives the potential program speedup ( $S$ ) in terms of the fraction $P \in [0,1]$ of the code that can be parallelized:

$S = \frac{T_s}{T_p} \le \frac{1}{1-P},$

where $T_s$ and $T_p$ denote the times required on a sequential and a parallel machine, respectively.

If no fraction of the code can be parallelized ( $P=0$ ), then there is no speedup ( $S = 1$ ). If all of the code can be parallelized ( $P = 1$ ), the speedup is in theory infinite (assuming we have many processors/cores). Typically only part of a computation can be parallelized. For instance, if $50\%$ of the code can be parallelized ( $P=0.5$ ), maximum speedup is 2, meaning the code will run twice as fast on many processors/cores. In this case we can gain at most a factor of 2 (2 times faster), because the other $50\%$ sequential part is taking half of the time, and that time is still required even if the parallel part is reduced to zero time.

Now let us assume that our parallel machine has $N_p$ processors. Then

$T_p \ge (1-P) \, T_s + P \, \frac{T_s}{N_p},$

with equality corresponding to the best case. In the limit (with many processors) the second term in the right hand side vanishes, and we obtain

$\lim_{N_p \rightarrow \infty} T_p = \frac{T_s}{S},$

gining a factor of $S$ with many processors.

In practice the speedup is less than (sometimes much less than) the number of processors $N_p$ , due to overhead costs of starting processors/threads, communications, memory limitations, algorithm’s limitations to scalability, etc.

Scalability¶

Scalability refers to the ability of a parallel system (software and/or hardware) to demonstrate a proportional increase in parallel speedup with the addition of more resources.

There are two types of scaling:

Strong scaling: How does the algorithm perform as $N_p$ increases for a fixed problem size $N$ ? The goal is to run the same problem size faster. Perfect scaling means problem is solved in $1/N_p$ time, compared to serial computation.
Weak scaling: How does the algorithm perform when $N$ increases with $N_p$ ? The goal is to run larger problem in the same amount of time. For example if we double $N_p$ , can we solve a problem twice as large in the same time?

What does it mean by doubling the size of a problem?

Example. When solving an $n \times n$ linear system with Gaussian elimination, we require ${\mathcal O}(n^3)$ FLOPS. Doubling $n$ would not double the size of the problem, as it would requires 8 times as many operations. This problem is indeed twice as large if we increase $n$ by a factor of $2^{1/3} \approx 1.26$ . In fact with $n_{new} = 2^{1/3} n$ , we would require ${\mathcal O}(n_{new}^3) = 2 \, {\mathcal O}(n^3)$ FLOPS.

Exercise. What if we solve the system in the previous example by an iterative method, such as the multigrid method, that requires ${\mathcal O}(n \log n)$ FLOPS?

Remark. Developing better algorithms is as important as better hardware.

Parallel computer memory architectures¶

Shared Memory describes a computer architecture where all processors have direct access to common physical memory as global address space. Multiple processors can operate independently but share the same memory resources. Shared memory machines have been classified as UMA and NUMA.

Uniform Memory Access (UMA): Identical processors have equal access and access times to memory. All processors/cores have a consistent view of the data: If one processor updates a location in shared memory, all the other processors know about the update (this is referred to as cache coherency). For instance, if data value x changes on a given core, there must be no risk of other cores using outdated values of x.
Non-Uniform Memory Access (NUMA): It is often made by linking a few UMA machines. Not all processors have equal access time to all memories. Under NUMA, a processor can access its own local memory faster than non-local memory (memory access across link is slower). With NUMA, maintaining cache coherency across shared memory has a significant overhead.

In general, the global address space in shared memory machines provides a user-friendly programming perspective to memory. However, shared memory machines suffer from the lack of scalability between memory and processors: adding more processors can geometrically increases traffic on the shared memory-processor path, and for cache coherent systems, geometrically increase traffic associated with cache/memory management.

Distributed Memory describes a computer architecture where processors have their own local memory. Each processor operates independently. Memory addresses in one processor do not map to another processor, so there is no concept of global address space across all processors. Distributed memory systems require a communication network to connect inter-processor memory. Since the changes that a processor makes to its local memory have no effect on the memory of other processors, the concept of cache coherency does not apply. When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated (or exchanged) between processors. Synchronization between tasks (coordination of parallel tasks) is likewise the programmer’s responsibility. Synchronization is often implemented by establishing a synchronization point within an application where a task may not proceed further until another task(s) reaches the same point. Synchronization usually involves waiting by at least one task, and can therefore cause a parallel application’s wall clock execution time to increase.

A main advantage of distributed memory systems is that memory is scalable with the number of processors. Increase the number of processors and the size of memory increases proportionately. Moreover, each processor can rapidly access its own memory without interference and without the overhead incurred with trying to maintain global cache coherency. However, one should beware of communication costs and non-uniform memory access times: data residing on a remote node takes longer to access than node local data.

Hybrid Memory: The largest and fastest computers in the world today (such as Stampede2) employ both shared and distributed memory architectures. In such a paradigm, increased scalability is an important advantage, while increased programmer complexity is an important disadvantage.

Parallel programming¶

We will study two parallel programming models:

1. Threads model on shared memory machines:

The main program performs some serial work, and then creates a number of tasks (threads) that can be scheduled and run by the operating system concurrently.
Each thread has local (or private) data, but also, shares the entire resources of the program (shared data). This saves the overhead associated with replicating a program’s resources for each thread. Each thread also benefits from a global memory view because it shares the memory space of the program.
Threads communicate with each other through global memory (updating address locations). This requires synchronization constructs to ensure that more than one thread is not updating the same global address at any time.
We will use OpenMP to implement this model.

2. Message passing model on distributed memory machines:

We create a set of tasks to be run on processors that use their own local memory during computation.
Tasks exchange data through communications by sending and receiving messages.
Data transfer usually requires cooperative operations to be performed by each process. For example, a send operation must have a matching receive operation.
We will use MPI to implement this model.

Designing parallel programs¶

Partitioning¶

One of the first steps in designing a parallel program is to break the problem into discrete “chunks” of work that can be distributed to multiple tasks. This is known as decomposition or partitioning. There are two basic ways to partition computational work among parallel tasks:

Domain Decomposition: In this type of partitioning, the data associated with a problem is decomposed. Each parallel task then works on a portion of the data. This is a common scenario in scientific computing: a data set (such as a vector or a matrix) is spread over many processors, each working on its part of the data. This may or may not need communication.
Functional Decomposition: In this approach, the focus is on the computation that is to be performed rather than on the data manipulated by the computation. The problem is decomposed according to the work that must be done. Each task then performs a portion of the overall work. Functional decomposition fits well to problems that can be split into different tasks. Functional parallelism is often obtained by considering independent subprograms, e.g. when we need to solve the same set of ODEs with various choices of parameters, or in convergence studies where we need to solve the same ODE problem with different step sizes.

Communications¶

The need for communications between tasks depends on the problem in hand:

We do not need communications if the problem can be decomposed and executed in parallel with almost no need for tasks to share data. These types of problems are often called embarrassingly parallel: little or no communications are required. For instance, $a(i) = 2*b(i)$ would not need communication between different processors (corresponding to different $i$ values). As another example, in Monte Carlo simulations we solve many determinstic problems (corresponding to many realizations of the random parameters) independently. Each deterministic problem can be solved on a processor (or multiple processors) indepdently of other problems.
We do however need communications in most parallel applications that require tasks to share data with each other. For example, if we solve a 2D heat diffusion problem with a central finite difference scheme, we require a task to know the temperatures calculated by the tasks that have neighboring data.