blog: hpc-2025-parallel-computing

2025-03-28 01:07:35 +08:00 · 2025-03-28 01:07:35 +08:00 · a254d0123d
commit a254d0123d
parent 22d28e763d
5 changed files with 378 additions and 0 deletions
--- a/YaeBlog/source/posts/hpc-2025-parallel-computing.md
+++ b/YaeBlog/source/posts/hpc-2025-parallel-computing.md
@ -0,0 +1,366 @@
+---
+title: High Performance Computing 25 SP Dichotomy of Parallel Computing Platforms
+date: 2025-03-28T01:03:32.2187720+08:00
+tags:
+- 高性能计算
+- 学习资料
+---
+
+
+Designing algorithms is always the hardest.
+
+<!--more-->
+
+Flynn's classical taxonomy:
+
+- SISD
+- SIMD
+- MISD
+- MIMD
+
+Multiple instruction and multiple data is currently the most common type of parallel computer.
+
+> A variant: single program multiple data(SPMD).
+
+## Dichotomy of Parallel Computing Platforms
+
+Based on the logical and physical organization of parallel platforms.
+
+Logical organization (from a programmer's perspective):
+
+- Control structure: ways of expressing parallel tasks.
+- Communication model: interactions between tasks.
+
+Hardware organization:
+
+- Architecture
+- Interconnection networks.
+
+Control Structure of Parallel Platform: parallel tasks can be specified at various levels of granularity.
+
+Communication Model: **Shared address space platforms**. Support a common data space that is accessible to all processors. Two types of architectures:
+
+- Uniform memory access (UMA)
+- Non-uniform memory access(NUMA)
+
+> NUMA and UMA are defined in term of memory access times not the cache access times.
+
+![image-20250313193604905](./hpc-2025-parallel-computing/image-20250313193604905.webp)
+
+NUMA and UMA:
+
+- The distinction between NUMA and UMA platforms is important from the point of view of algorithm design.
+
+- Programming these platforms is easier since reading and writing are implicitly visible to other processors.
+
+- Caches is such machines requires coordinated access to multiple copies.
+
+  > Leads to cache coherence problem.
+
+- A weaker model of these machines provides an address map but not coordinated access.
+
+**Global Memory Space**:
+
+- Easy to program.
+
+- Read-only interactions:
+
+  Invisible to programmers.
+
+  Same as in serial programs.
+
+- Read/write interactions:
+
+  Mutual exclusion for concurrent access such as lock and related mechanisms.
+
+- Programming paradigms: Threads/Directives.
+
+Caches in shared-address-space:
+
+- Address translation mechanism to locate a memory word in the system.
+- Well-defined semantics over multiple copies(**cache coherence**).
+
+> Shared-address-space vs shared memory machine:
+>
+> Shared address space is a programming abstraction.
+>
+> Shared memory machine is a physical machine attribute.
+
+Distributed Shared Memory(DSM) or Shared Virtual Memory(SVM):
+
+- Page-based access control: leverage the virtual memory support and manage main memory as a fully associative cache on the virtual address space by embedding a coherence protocol in the page fault handler.
+- Object based access control: flexible but no false sharing.
+
+## Parallel Algorithm Design
+
+Steps in parallel algorithm design:
+
+- Identifying portions of the work that can be performed concurrently.
+- Mapping the concurrent pieces of work onto multiple processors running in parallel.
+- Distributing the input, output and intermediate data associated with the program.
+- Managing accesses to data shared by multiple processors.
+- Synchronizing the processors at various stages of the parallel program execution.
+
+### Decomposition
+
+Dividing a computation into smaller parts some or all of which may be executed in parallel.
+
+Tasks: programmer-defined units with arbitrary size and is indivisible.
+
+Aim: **reducing execution time**
+
+Ideal decomposition:
+
+- All tasks have similar size.
+- Tasks are **not** waiting for each other **not** sharing resources.
+
+Dependency graphs:
+
+Task dependency graph: an abstraction to express dependencies among tasks and their relative order of execution.
+
+- Directed acyclic graphs.
+- Nodes are tasks.
+- Directed edges: dependencies amongst tasks.
+
+> The fewer directed edges, the better as parallelism.
+
+Granularity:
+
+The granularity of the decomposition: the number and size of tasks into which a problem is decomposed.
+
+- Fine-grained: a large number of small tasks.
+- Coarse-grained: a small number of large tasks.
+
+Concurrency: 
+
+**maximum degree of concurrency**
+
+**Average degree of concurrency**
+
+The critical path determines the average degree of concurrency.
+
+Critical path is the longest directed path between any pair of start and finish nodes. So a shorter critical path favors a higher degree of concurrency.
+
+**Limited Granularity**:
+
+It may appear that increasing the granularity of decomposition will utilize the resulting  concurrency.
+
+But there is a inherent bound on how fine-grained a decomposition a problem permits.
+
+Speedup:
+
+The ratio of serial to parallel execution time. Restrictions on obtaining unbounded speedup from:
+
+- Limited granularity.
+- Degree of concurrency.
+- Interaction among tasks running on different physical processors.
+
+Processor:
+
+Computing agent that performs tasks, an abstract entity that uses the code and data of a tasks to produce the output of the task within a finite amount of time.
+
+Mapping: the mechanism by which tasks are assigned to processor for execution. The task dependency and task interaction graphs play an important role.
+
+Decomposition techniques:
+
+Fundamental steps: split the computations to be performed into a set of tasks for concurrent execution.
+
+1. Recursive decomposition.
+
+   A method for inducing concurrency in problems that can be solved using the **divide-and-conquer** strategy.
+
+2. Data decomposition.
+
+   A method for deriving concurrency in algorithms that operate on large data structures.
+
+   The operations performed by these tasks on different data partitions.
+
+   Can be partitioning output data and partitioning input data or even partitioning intermediate data.
+
+3. Exploratory decomposition.
+
+   Decompose problems whose underlying computations correspond to a search of a space for solutions.
+
+   Exploratory decomposition appears similar to data decomposition.
+
+4. Speculative decomposition.
+
+   Used when a program may take one of many possible computationally significant branches depending on the output of preceding computation.
+
+   Similar to evaluating branches in a *switch* statement in `C` as evaluate multiple branches in parallel and correct branch will be used and other branches will be discarded.
+
+   The parallel run time is smaller than the serial run time by the amount of time to evaluate the condition.
+
+### Characteristics of Tasks
+
+**Task generation**:
+
+- Static: all the tasks are known before the algorithm starts executing.
+- Dynamic: the actual tasks and the task dependency graph are not explicitly available at priori.
+- Either static or dynamic.
+
+**Task Sizes**:
+
+The relative amount of time required t complete the task.
+
+- Uniform
+- Non-uniform
+
+The knowledge of task sizes will influence the choice of mapping scheme.
+
+**Inter-Task Interactions**:
+
+- Static versus dynamic.
+- Regular versus irregular.
+- Read-only versus read-write
+- One-way versus two-way.
+
+### Mapping Techniques
+
+Mapping techniques is for loading balancing.
+
+Good mappings:
+
+- Reduce the interaction time.
+- Reduce the idle time.
+
+![image-20250320200524155](./hpc-2025-parallel-computing/image-20250320200524155.webp)
+
+There are two mapping methods:
+
+- **Static Mapping**: determined by programming paradigm and the characteristics of tasks and interactions.
+
+  Static mapping is often used in conjunction with *data partitioning* and *task partitioning*.
+
+- **Dynamic Mapping**: distribute the work among processors during the execution. Also referred as dynamic load-balancing.
+
+  The **centralized scheme** as all the executable tasks are maintained in a common central data structure and distributed by a special process or a subset of processes as **master** process.
+
+  Centralized scheme always means easy to implement but with limited scalability.
+
+  The **distributed scheme** as the set of executable tasks are distributed among processes which exchange tasks at run time to balance work.
+
+**Minimize frequency of interactions**: 
+
+There is a relatively high startup cost associated with each interaction on many architectures.
+
+So restructure the algorithm such that shared data are accessed and used in large pieces.
+
+**Minimize contention and hot spots**:
+
+Contention occurs when multiple tasks try to access the same resources concurrently.
+
+And centralized scheme for dynamic mapping are a frequent source of contention so use the distributed mapping schemes.
+
+**Overlapping computations with interactions**:
+
+When waiting for shared data, do some useful computations.
+
+- Initiate an interaction early enough to complete before it needed.
+- In dynamic mapping schemes, the process can anticipate that it is going to run out of work and initiate a work which transfers interaction in advance.
+
+Overlapping computations with interaction requires support from the programming paradigm, the operating system and the hardware.
+
+- Disjoint address-space paradigm: non-blocking message passing primitives.
+- Share address-space paradigm: prefetching hardware which can anticipate the memory addresses and initiate access in advance of when they are needed.
+
+**Replicating data or computations**:
+
+Multiple processors may require frequent read-only access to shared data structure such as a hash-table.
+
+For different paradigm:
+
+- Share address space use cache.
+- Message passing: remote data accesses are more expensive and harder than local accesses.
+
+Data replication increases the memory requirements. In some situation, it may be more cost-effective to compute these intermediate results than to get then from another place.
+
+**Using optimized collective interaction operations**:
+
+Collective operations are like:
+
+- Broadcasting some data to all processes.
+- Adding up numbers each belonging to a different process.
+
+### Parallel Algorithm Model
+
+The way of structuring  parallel algorithm by
+
+- Selecting a decomposition
+- Selecting a mapping technique.
+- Applying the appropriate strategy to minimize interactions.
+
+**Data parallel model**:
+
+The tasks are statically or semi-statically mapped onto processes and each task performs similar operations on different data.
+
+Example: matrix multiplication.
+
+**Task graph model**:
+
+The interrelations among the tasks are utilized to promote locality or to reduce interaction costs.
+
+Example: quick sort, sparse matrix factorization and many other algorithms using divide-and-conquer decomposition.
+
+**Work pool model**:
+
+Characterized by a dynamic mapping of task onto processes for load balancing.
+
+Example: parallelization of loops by chunk scheduling.
+
+**Master-slave model** :
+
+One or more master processes generate work and allocate it to worker processes.
+
+**Pipeline or producer-consumer model**:
+
+A stream of data is passed on through a succession of processes, each of which performs some tasks.
+
+### Analytical Modeling of Parallel Programs
+
+**Performance evaluation**:
+
+Evaluation in terms of execution time.
+
+A parallel system is the combination of an algorithm and the parallel architecture on which it is implemented.
+
+**Sources of overhead in parallel program**:
+
+A typical execution includes:
+
+- Essential computation
+
+  Computation that world be performed by the serial program for solving the same problem instance.
+
+- Interprocess communication
+
+- Idling
+
+- Excess computation
+
+  Computation which not performed by the serial program.
+
+**Performance metrics for parallel system**:
+
+- Execution time
+- Overhead function
+- Total overhead
+- Speedup
+
+> For a given problem, more than one sequential algorithm may be available.
+
+Theoretically speaking, speed up can never exceed the number of PE.
+
+If super linear speedup: the work performed by a serial program is greater than its parallel formulation, maybe hardware features that put the serial implementation at a disadvantage.
+
+**Amdahl's Law**:
+
+![image-20250327194045418](./hpc-2025-parallel-computing/image-20250327194045418.webp)
+
+The overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used.
+
+Efficiency: a measure of the fraction of time for which a PE is usefully employed.
+
+Cost: the product of parallel run time and the number of processing elements used.
+
+![image-20250327194312962](./hpc-2025-parallel-computing/image-20250327194312962.webp)
--- a/YaeBlog/source/posts/hpc-2025-parallel-computing/image-20250313193604905.webp
+++ b/YaeBlog/source/posts/hpc-2025-parallel-computing/image-20250313193604905.webp
--- a/YaeBlog/source/posts/hpc-2025-parallel-computing/image-20250320200524155.webp
+++ b/YaeBlog/source/posts/hpc-2025-parallel-computing/image-20250320200524155.webp
--- a/YaeBlog/source/posts/hpc-2025-parallel-computing/image-20250327194045418.webp
+++ b/YaeBlog/source/posts/hpc-2025-parallel-computing/image-20250327194045418.webp
--- a/YaeBlog/source/posts/hpc-2025-parallel-computing/image-20250327194312962.webp
+++ b/YaeBlog/source/posts/hpc-2025-parallel-computing/image-20250327194312962.webp