Compare commits

...

1 Commits

Author SHA1 Message Date
457316971c
draft: six notebooks 2025-06-16 00:45:16 +08:00
53 changed files with 779 additions and 0 deletions

View File

@ -0,0 +1,369 @@
---
title: High Performance Computing 25 SP NVIDIA
date: 2025-04-24T19:02:36.1077330+08:00
tags:
- 高性能计算
- 学习资料
---
Fxxk you, NVIDIA!
<!--more-->
CPU/GPU Parallelism:
Moore's Law gives you more and more transistors:
- CPU strategy: make the workload (one compute thread) run as fast as possible.
- GPU strategy: make the workload (as many threads as possible) run as fast as possible.
GPU Architecture:
- Massively Parallel
- Power Efficient
- Memory Bandwidth
- Commercially Viable Parallelism
- Not dependent on large caches for performance
![image-20250424192311202](./hpc-2025-cuda/image-20250424192311202.webp)
## Nvidia GPU Generations
- 2006: G80-based GeForce 8800
- 2008: GT200-based GeForce GTX 280
- 2010: Fermi
- 2012: Kepler
- 2014: Maxwell
- 2016: Pascal
- 2017: Volta
- 2021: Ampere
- 2022: Hopper
- 2024: Blackwell
#### 2006: G80 Terminology
SP: Streaming Processor, scalar ALU for a single CUDA thread
SPA: Stream Processor Array
SM: Streaming Multiprocessor, containing of 8 SP
TPC: Texture Processor Cluster: 2 SM + TEX
![image-20250424192825010](./hpc-2025-cuda/image-20250424192825010.webp)
Design goal: performance per millimeter
For GPUs, performance is throughput, so hide latency with computation not cache.
So this is single instruction multiple thread (SIMT).
**Thread Life Cycle**:
Grid is launched on the SPA and thread blocks are serially distributed to all the SM.
![image-20250424193125125](./hpc-2025-cuda/image-20250424193125125.webp)
**SIMT Thread Execution**:
Groups of 32 threads formed into warps. Threads in the same wraps always executing same instructions. And some threads may become inactive when code path diverges so the hardware **automatically Handles divergence**.
Warps are the primitive unit of scheduling.
> SIMT execution is an implementation choice. As sharing control logic leaves more space for ALUs.
**SM Warp Scheduling**:
SM hardware implements zero-overhead warp scheduling:
- Warps whose next instruction has its operands ready for consumption are eligible for execution.
- Eligible warps are selected for execution on a prioritized scheduling policy.
> If 4 clock cycles needed to dispatch the same instructions for all threads in a warp, and one global memory access is needed for every 4 instructions and memory latency is 200 cycles. So there should be 200 / (4 * 4) =12.5 (13) warps to fully tolerate the memory latency
The SM warp scheduling use scoreboard and similar things.
**Granularity Consideration**:
Consider that int the G80 GPU, one SM can run 768 threads and 8 thread blocks, which is the best tiles to matrix multiplication: 16 * 16 = 256 and in one SM there can be 3 thread block which fully use the threads.
### 2008: GT200 Architecture
![image-20250424195111341](./hpc-2025-cuda/image-20250424195111341.webp)
### 2010: Fermi GF100 GPU
**Fermi SM**:
![image-20250424195221886](./hpc-2025-cuda/image-20250424195221886.webp)
There are 32 cores per SM and 512 cores in total, and introduce 64KB configureable L1/ shared memory.
Decouple internal execution resource and dual issue pipelines to select two warps.
And in Fermi, the debut the Parallel Thread eXecution(PTX) 2.0 ISA.
### 2012 Kepler GK 110
![image-20250424200022880](./hpc-2025-cuda/image-20250424200022880.webp)
### 2014 Maxwell
4 GPCs and 16 SMM.
![image-20250424200330783](./hpc-2025-cuda/image-20250424200330783.webp)
### 2016 Pascal
No thing to propose.
### 2017 Volta
First introduce the tensor core, which is the ASIC to calculate matrix multiplication.
### 2021 Ampere
The GA100 SM:
![image-20250508183446257](./hpc-2025-cuda/image-20250508183446257.webp)
### 2022 Hopper
Introduce the GH200 Grace Hopper Superchip:
![image-20250508183528381](./hpc-2025-cuda/image-20250508183528381.webp)
A system contains a CPU and GPU which is linked by a NVLink technology.
And this system can scale out for machine learning.
![image-20250508183724162](./hpc-2025-cuda/image-20250508183724162.webp)
Memory access across the NVLink:
- GPU to local CPU
- GPU to peer GPU
- GPU to peer CPU
![image-20250508183931464](./hpc-2025-cuda/image-20250508183931464.webp)
These operations can be handled by hardware accelerated memory coherency. Previously, there are separate page table for CPU and GPU but for GPU to access memory in both CPU and GPU, CPU and GPU can use the same page table.
![image-20250508184155087](./hpc-2025-cuda/image-20250508184155087.webp)
### 2025 Blackwell
![image-20250508184455215](./hpc-2025-cuda/image-20250508184455215.webp)
### Compute Capability
The software version to show hardware version features and specifications.
## G80 Memory Hierarchy
### Memory Space
Each thread can
- Read and write per-thread registers.
- Read and write per-thread local memory.
- Read and write pre-block shared memory.
- Read and write pre-grid global memory.
- Read only pre-grid constant memory.
- Read only pre-grid texture memory.
![image-20250508185236920](./hpc-2025-cuda/image-20250508185236920.webp)
Parallel Memory Sharing:
- Local memory is per-thread and mainly for auto variables and register spill.
- Share memory is pre-block which can be used for inter thread communication.
- Global memory is pre-application which can be used for inter grid communication.
### SM Memory Architecture
![image-20250508185812302](./hpc-2025-cuda/image-20250508185812302.webp)
Threads in a block share data and results in memory and shared memory.
Shared memory is dynamically allocated to blocks which is one of the limiting resources.
### SM Register File
Register File(RF): there are 32KB, or 8192 entries, register for each SM in G80 GPU.
The tex pipeline and local/store pipeline can read and write register file.
Registers are dynamically partitioned across all blocks assigned to the SM. Once assigned to a block the register is **not** accessible by threads in other blocks and each thread in the same block only access registers assigned to itself.
For a matrix multiplication example:
- If one thread uses 10 registers and one block has 16x16 threads, each SM can contains three thread blocks as one thread blocks need 16 * 16 * 10 =2,560 registers and 3 * 2560 < 8192.
- But if each thread need 11 registers, one SM can only contains two blocks once as 8192 < 2816 * 3.
More on dynamic partitioning: dynamic partitioning gives more flexibility to compilers and programmers.
1. A smaller number of threads that require many registers each.
2. A large number of threads that require few registers each.
So there is a tradeoff between instruction level parallelism and thread level parallelism.
### Parallel Memory Architecture
In a parallel machine, many threads access memory. So memory is divided into banks to achieve high bandwidth.
Each bank can service one address per cycle. If multiple simultaneous accesses to a bank result in a bank conflict.
Shared memory bank conflicts:
- The fast cases:
- All threads of a half-warp access different banks, there's no back conflict.
- All threads of a half-warp access the identical address ,there is no bank conflict (by broadcasting).
- The slow cases:
- Multiple threads in the same half-warp access the same bank
## Memory in Later Generations
### Fermi Architecture
**Unified Addressing Model** allows local, shared and global memory access using the same address space.
![image-20250508193756274](./hpc-2025-cuda/image-20250508193756274.webp)
**Configurable Caches** allows programmers to configure the size if L1 cache and the shared memory.
The L1 cache works as a counterpart to shared memory:
- Shared memory improves memory access for algorithms with well defined memory access.
- L1 cache improves memory access for irregular algorithms where data addresses are not known before hand.
### Pascal Architecture
**High Bandwidth Memory**: a technology which enables multiple layers of DRAM components to be integrated vertically on the package along with the GPU.
![image-20250508194350572](./hpc-2025-cuda/image-20250508194350572.webp)
**Unified Memory** provides a single and unified virtual address space for accessing all CPU and GPU memory in the system.
And the CUDA system software doesn't need to synchronize all managed memory allocations to the GPU before each kernel launch. This is enabled by **memory page faulting**.
## Advanced GPU Features
### GigaThread
Enable concurrent kernel execution:
![image-20250508195840957](./hpc-2025-cuda/image-20250508195840957.webp)
And provides dual **Streaming Data Transfer** engines to enable streaming data transfer, a.k.a direct memory access.
![image-20250508195938546](./hpc-2025-cuda/image-20250508195938546.webp)
### GPUDirect
![image-20250508200041910](./hpc-2025-cuda/image-20250508200041910.webp)
### GPU Boost
GPU Boost works through real time hardware monitoring as opposed to application based profiles. It attempts to find what is the appropriate GPU frequency and voltage for a given moment in time.
### SMX Architectural Details
Each unit contains four warp schedulers.
Scheduling functions:
- Register scoreboard for long latency operations.
- Inter-warp scheduling decisions.
- Thread block level scheduling.
### Improving Programmability
![image-20250515183524043](./hpc-2025-cuda/image-20250515183524043.webp)
**Dynamic Parallelism**: The ability to launch new grids from the GPU.
And then introduce data-dependent parallelism and dynamic work generation and even batched and nested parallelism.
The cpu controlled work batching:
- CPU program limited by single point of control.
- Can run at most 10s of threads.
- CPU is fully consumed with controlling launches.
![](./hpc-2025-cuda/image-20250515184225475.webp)
Batching via dynamic parallelism:
- Move top-level loops to GPUs.
- Run thousands of independent tasks.
- Release CPU for other work.
![image-20250515184621914](./hpc-2025-cuda/image-20250515184621914.webp)
### Grid Management Unit
![image-20250515184714663](./hpc-2025-cuda/image-20250515184714663.webp)
Fermi Concurrency:
- Up to 16 grids can run at once.
- But CUDA streams multiplex into a single queue.
- Overlap only at stream edge.
Kepler Improved Concurrency:
- Up to 32 grids can run at once.
- One work queue per stream.
- Concurrency at full-stream level.
- No inter-stream dependencies.
It is called as **Hyper-Q**.
Without Hyper-Q:
![image-20250515185019590](./hpc-2025-cuda/image-20250515185019590.webp)
With Hyper-Q:
![image-20250515185034758](./hpc-2025-cuda/image-20250515185034758.webp)
In pascal, **asynchronous concurrent computing** is introduced.
![image-20250515185801775](./hpc-2025-cuda/image-20250515185801775.webp)
### NVLink: High-Speed Node Network
![image-20250515185212184](./hpc-2025-cuda/image-20250515185212184.webp)
> The *consumer* prefix means the product is designed for gamers.
>
> The *big* prefix means the product is designed for HPC.
### Preemption
Pascal can actually preempt at the lowest level, the instruction level.
![image-20250515190244112](./hpc-2025-cuda/image-20250515190244112.webp)
### Tensor Core
Operates on a 4x4 matrix and performs: D = A x B + C.
![image-20250515190507199](./hpc-2025-cuda/image-20250515190507199.webp)
### GPU Multi-Process Scheduling
- Timeslice scheduling: single process throughput optimization.
- Multi process service: multi-process throughput optimization.
How about multi-process time slicing:
![image-20250515190703918](./hpc-2025-cuda/image-20250515190703918.webp)
Volta introduces the multi-process services:
![image-20250515191142384](./hpc-2025-cuda/image-20250515191142384.webp)

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -0,0 +1,10 @@
---
title: High Performance Computing 25 SP Quantum Computing
date: 2025-06-12T19:26:24.6668760+08:00
tags:
- 高性能计算
- 学习资料
---
<!--more-->

View File

@ -0,0 +1,78 @@
---
title: High Performance Computing 25 SP Potpourri
date: 2025-06-12T18:45:49.2698190+08:00
tags:
- 高性能计算
- 学习资料
---
Potpourri has a good taste.
<!--more-->
## Heterogeneous System Architecture
![image-20250612185019968](./hpc-2025-heterogeneous-system-architecture/image-20250612185019968.webp)
The goals of the HSA:
- Enable power efficient performance.
- Improve programmability of heterogeneous processors.
- Increase the portability of code across processors and platforms.
- Increase the pervasiveness of heterogeneous solutions.
### The Runtime Stack
![image-20250612185221643](./hpc-2025-heterogeneous-system-architecture/image-20250612185221643.webp)
## Accelerated Processing Unit
A processor that combines the CPU and the GPU elements into a single architecture.
![image-20250612185743675](./hpc-2025-heterogeneous-system-architecture/image-20250612185743675.webp)
## Intel Xeon Phi
The goal:
- Leverage X86 architecture and existing X86 programming models.
- Dedicate much of the silicon to floating point ops.
- Cache coherent.
- Increase floating-point throughput.
- Strip expensive features.
The reality:
- 10s of x86-based cores.
- Very high-bandwidth local GDDR5 memory.
- The card runs a modified embedded Linux.
## Deep Learning: Deep Neural Networks
?
## Tensor Processing Unit
A custom ASIC for the phase of Neural Networks (AI accelerator).
### TPUv1 Architecture
![image-20250612191035632](./hpc-2025-heterogeneous-system-architecture/image-20250612191035632.webp)
### THUv2 Architecture
![image-20250612191118473](./hpc-2025-heterogeneous-system-architecture/image-20250612191118473.webp)
Advantages of TPU:
- Allows to make predications very quickly and respond within fraction of a second.
- Accelerate performance of linear computation, key of machine learning applications.
- Minimize the time to accuracy when you train large and complex network models.
Disadvantages of TPU:
- Linear algebra that requires heavy branching or are not computed on the basis of element wise algebra.
- Non-dominated matrix multiplication is not likely to perform well on TPUs.
- Workloads that access memory using sparse technique.
- Workloads that use highly precise arithmetic operations.

View File

@ -0,0 +1,99 @@
---
title: High Performance Computing 2025 SP OpenCL Programming
date: 2025-05-29T18:29:14.8444660+08:00
tags:
- 高性能计算
- 学习资料
---
Open Computing Language.
<!--more-->
OpenCL is Open Computing Language.
- Open, royalty-free standard C-language extension.
- For parallel programming of heterogeneous systems using GPUs, CPUs , CBE, DSP and other processors including embedded mobile devices.
- Managed by Khronos Group.
![image-20250529185915068](./hpc-2025-opencl/image-20250529185915068.webp)
### Anatomy of OpenCL
- Platform Layer APi
- Runtime Api
- Language Specification
### Compilation Model
OpenCL uses dynamic/runtime compilation model like OpenGL.
1. The code is compiled to an IR.
2. The IR is compiled to a machine code for execution.
And in dynamic compilation, *step 1* is done usually once and the IR is stored. The app loads the IR and performs *step 2* during the app runtime.
### Execution Model
OpenCL program is divided into
- Kernel: basic unit of executable code.
- Host: collection of compute kernels and internal functions.
The host program invokes a kernel over an index space called an **NDRange**.
NDRange is *N-Dimensional Range*, and can be a 1, 2, 3-dimensional space.
A single kernel instance at a point of this index space is called **work item**. Work items are further grouped into **work groups**.
### OpenCL Memory Model
![image-20250529191215424](./hpc-2025-opencl/image-20250529191215424.webp)
Multiple distinct address spaces: Address can be collapsed depending on the device's memory subsystem.
Address space:
- Private: private to a work item.
- Local: local to a work group.
- Global: accessible by all work items in all work groups.
- Constant: read only global memory.
> Comparison with CUDA:
>
> ![image-20250529191414250](./hpc-2025-opencl/image-20250529191414250.webp)
Memory region for host and kernel:
![image-20250529191512490](./hpc-2025-opencl/image-20250529191512490.webp)
### Programming Model
#### Data Parallel Programming Model
1. Define N-Dimensional computation domain
2. Work-items can be grouped together as *work group*.
3. Execute multiple work-groups in parallel.
#### Task Parallel Programming Model
> Data parallel execution model must be implemented by all OpenCL computing devices, but task parallel programming is a choice for vendor.
Some computing devices such as CPUs can also execute task-parallel computing kernels.
- Executes as s single work item.
- A computing kernel written in OpenCL.
- A native function.
### OpenCL Framework
![image-20250529192022613](./hpc-2025-opencl/image-20250529192022613.webp)
The basic OpenCL program structure:
![image-20250529192056388](./hpc-2025-opencl/image-20250529192056388.webp)
**Contexts** are used to contain the manage the state of the *world*.
**Command-queue** coordinates execution of the kernels.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -0,0 +1,41 @@
---
title: High Performance Computing 2025 SP Programming CUDA
date: 2025-05-15T19:13:48.8893010+08:00
tags:
- 高性能计算
- 学习资料
---
Compute Unified Device Architecture
<!--more-->
## CUDA
General purpose programming model:
- Use kicks off batches of threads on the GPU.
![image-20250515195739382](./hpc-2025-program-cuda/image-20250515195739382.webp)
The compiling C with CUDA applications:
![image-20250515195907764](./hpc-2025-program-cuda/image-20250515195907764.webp)
### CUDA APIs
Areas:
- Device management
- Context management
- Memory management
- Code module management
- Execution control
- Texture reference management
- Interoperability with OpenGL and Direct3D
Two APIs:
- A low-level API called the CUDA driver API
- A higher-level API called the C runtime for CUDA that is implemented on top of the CUDA driver API.

Binary file not shown.

Binary file not shown.

View File

@ -0,0 +1,41 @@
---
title: High Performance Computing 2025 SP Stored Program Computing
date: 2025-05-29T18:29:28.6155560+08:00
tags:
- 高性能计算
- 学习资料
---
No Von Neumann Machines.
<!--more-->
## Application Specified Integrated Circuits
As known as **ASIC**, these hardwares can work along and are not von Neumann machines.
No stored program concept:
- Input data come in
- Pass through all circuit gates quickly
- Generate output results immediately
Advantages: performance is better.
Disadvantages: reusability is worse.
> The CPU and GPU are special kinds of ASIC.
Why we need ASIC in computing:
- Alternatives to the Moore'a law.
- High capacity and high speed.
![image-20250605185212740](./hpc-2025-stored-program-computing/image-20250605185212740.webp)
## Field Programming Gate Array
![image-20250612184120333](./hpc-2025-stored-program-computing/image-20250612184120333.webp)

Binary file not shown.

Binary file not shown.