feat: a lot of notebooks.
This commit is contained in:
parent
457316971c
commit
3053b83bbf
|
@ -116,7 +116,7 @@ And in Fermi, the debut the Parallel Thread eXecution(PTX) 2.0 ISA.
|
|||
|
||||
### 2016 Pascal
|
||||
|
||||
No thing to propose.
|
||||
No thing to pay attention to.
|
||||
|
||||
### 2017 Volta
|
||||
|
||||
|
|
239
YaeBlog/source/drafts/hpc-2025-non-stored-program-computing.md
Normal file
239
YaeBlog/source/drafts/hpc-2025-non-stored-program-computing.md
Normal file
|
@ -0,0 +1,239 @@
|
|||
---
|
||||
title: High Performance Computing 2025 SP Non Stored Program Computing
|
||||
date: 2025-05-29T18:29:28.6155560+08:00
|
||||
tags:
|
||||
- 高性能计算
|
||||
- 学习资料
|
||||
---
|
||||
|
||||
No Von Neumann Machines.
|
||||
|
||||
<!--more-->
|
||||
|
||||
## Application Specified Integrated Circuits
|
||||
|
||||
As known as **ASIC**, these hardwares can work along and are not von Neumann machines.
|
||||
|
||||
No stored program concept:
|
||||
|
||||
- Input data come in
|
||||
- Pass through all circuit gates quickly
|
||||
- Generate output results immediately
|
||||
|
||||
Advantages: performance is better.
|
||||
|
||||
Disadvantages: reusability is worse.
|
||||
|
||||
> The CPU and GPU are special kinds of ASIC.
|
||||
|
||||
Why we need ASIC in computing:
|
||||
|
||||
- Alternatives to the Moore'a law.
|
||||
- High capacity and high speed.
|
||||
|
||||

|
||||
|
||||
### Full Custom ASICs
|
||||
|
||||
All mask layers are customized in a full-custom ASICs.
|
||||
|
||||
The full-custom ASICs always can offer the highest performance and lowest part cost (smallest die size) for a given design.
|
||||
|
||||
A typical example of full-custom ASICs is the CPU.
|
||||
|
||||
The advantages and disadvantages of full-custom ASICs is shown below.
|
||||
|
||||
| Advantages | Disadvantages |
|
||||
| ------------------------------------------------------------ | -------------------------------------------------------- |
|
||||
| Reducing the area | The design process takes a longer time |
|
||||
| Enhancing the performance | Having more complexity in computer-aided design tool |
|
||||
| Better ability of integrating with other analog components and other pre-designed components | Requiring higher investment and skilled human resources. |
|
||||
|
||||
### Semi Custom ASICs
|
||||
|
||||
All the logical cell are predesigned and some or all of the mask layer is customized.
|
||||
|
||||
There are two types of semi-custom ASICs:
|
||||
|
||||
- Standard cell based ASICs
|
||||
- Gate-array based ASICs.
|
||||
|
||||
The Standard cell based ASICs is also called as **Cell-based ASIC(CBIC)**.
|
||||
|
||||

|
||||
|
||||
> The *gate* is used a unit to measure the ability of semiconductor to store logical elements.
|
||||
|
||||
The semi-custom ASICs is developed as:
|
||||
|
||||
- Programmable Logic Array(PLA)
|
||||
- Complex Programmable Logical Device(CPLD)
|
||||
- Programmable Array Logical
|
||||
- Field Programing Gate Array(FPGA)
|
||||
|
||||
#### Programmable Logical Device
|
||||
|
||||
An integrated circuit that can be programmed/reprogrammed with a digital logical of a curtain level.
|
||||
|
||||
The basic idea of PLD is an array of **AND** gates and an array of **OR** gates. Each input feeds both a non-inverting buffer and an inverting buffer to produce the true and inverted forms of each variable. The AND outputs are called the product lines. Each product line is connected to one of the inputs of each OR gate.
|
||||
|
||||
Depending on the structure, the standard PLD can be divided into:
|
||||
|
||||
- Read Only Memory(ROM): A fixed array of AND gates and a programmable array of OR gates.
|
||||
- Programmable Array Logic(PAL): A programmable array of AND gates feeding a fixed array of OR gates.
|
||||
- Programmable Logic Array(PLA): A programmable array of AND gates feeding a programmable of OR gates.
|
||||
- Complex Programmable Logic Device(CPLD) and Field Programmable Gate Array(FPGA): complex enough to be called as *architecture*.
|
||||
|
||||

|
||||
|
||||
|
||||
|
||||
## Field Programming Gate Array
|
||||
|
||||
> General speaking, all semiconductor can be considered as a special kind of ASIC. But in practice, we always refer the circuit with a special function as ASIC, a circuit that can change the function as FPGA.
|
||||
|
||||

|
||||
|
||||
### FPGA Architecture
|
||||
|
||||

|
||||
|
||||
#### Configurable Logic Block(CLB) Architecture
|
||||
|
||||
The CLB consists of:
|
||||
|
||||
- Look-up Table(LUT): implements the entries of a logic functions truth table.
|
||||
|
||||
And some FPGAs can use the LUTs to implement small random access memory(RAM).
|
||||
|
||||
- Carry and Control Logic: Implements fast arithmetic operation(adders/subtractors).
|
||||
|
||||
- Memory Elements: configures flip flops/latches (programmable clock edges, set, reset and clock enable). These memory elements usually can be configured as shift-registers.
|
||||
|
||||
##### Configuring LUTs
|
||||
|
||||
LUT is a ram with data width of 1 bit and the content is programmed at power up. Internal signals connect to control signals of MUXs to select a values of the truth tables for any given input signals.
|
||||
|
||||
The below figure shows LUT working:
|
||||
|
||||

|
||||
|
||||
The configuration memory holds the output of truth table entries, so that when the FPGA is restarting it will run with the same *program*.
|
||||
|
||||
And as the truth table entries are just bits, the program of FPGA is called as **BITSTREAM**, we download a bitstream to an FPGA and all LUTs will be configured using the BITSTREAM to implement the boolean logic.
|
||||
|
||||
##### LUT Based Ram
|
||||
|
||||
Let the input signal as address, the LUT will be configured as a RAM. Normally, LUT mode performs read operations, the address decoders can generate clock signal to latches for writing operation.
|
||||
|
||||

|
||||
|
||||
#### Routing Architecture
|
||||
|
||||
The logic blocks are connected to each though programmable routing network. And the routing network provides routing connections among logic blocks and I/O blocks to complete a user-designed circuit.
|
||||
|
||||
Horizontal and vertical mesh or wire segments interconnection by programmable switches called programmable interconnect points(PIPs).
|
||||
|
||||

|
||||
|
||||
These PIPs are implemented using a transmission gate controlled by a memory bits from the configuration memory.
|
||||
|
||||
Several types of PIPs are used in the FPGA:
|
||||
|
||||
- Cross-point: connects vertical or horizontal wire segments allowing turns.
|
||||
- Breakpoint: connects or isolates 2 wire segments.
|
||||
- Decoded MUX: groups of cross-points connected to a single output configured by n configuration bits.
|
||||
- Non-decoded MUX: n wire segments each with a configuration bit.
|
||||
- Compound cross-point: 6 breakpoint PIPs and can isolate two isolated signal nets.
|
||||
|
||||

|
||||
|
||||
#### Input/Output Architecture
|
||||
|
||||
The I/O pad and surrounding supporting logical and circuitry are referred as input/input cell.
|
||||
|
||||
The programmable Input/Output cells consists of three parts:
|
||||
|
||||
- Bi-directional buffers
|
||||
- Routing resources.
|
||||
- Programmable I/O voltage and current levels.
|
||||
|
||||

|
||||
|
||||
#### Fine-grained and Coarse-grained Architecture
|
||||
|
||||
The fine-grained architecture:
|
||||
|
||||
- Each logic block can implement a very simple function.
|
||||
- Very efficient in implementing systolic algorithms.
|
||||
- Has a large number of interconnects per logic block than the functionality they offer.
|
||||
|
||||
The coarse-grained architecture:
|
||||
|
||||
- Each logic block is relatively packed with more logic.
|
||||
- Has their logic blocks packed with more functionality.
|
||||
- Has fewer interconnections which leading to reduce the propagating delays encountered.
|
||||
|
||||
#### Interconnect Devices
|
||||
|
||||
FPGAs are based on an array of logic modules and uncommitted wires to route signal.
|
||||
|
||||
Three types of interconnected devices have been commonly used to connect there wires:
|
||||
|
||||
- Static random access memory (SRAM) based
|
||||
- Anti-fuse based
|
||||
- EEPROM based
|
||||
|
||||
### FPGA Design Flow
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
The FPGA configuration techniques contains:
|
||||
|
||||
- Full configuration and read back.
|
||||
- Partial re-configuration and read back.
|
||||
- Compressed configuration.
|
||||
|
||||
Based on the partially reconfiguration, the runtime reconfiguration is development. The area to be reconfigured is changed based on run-time.
|
||||
|
||||
#### Hardware Description Languages(HDL)
|
||||
|
||||
There are three languages targeting FPGAs:
|
||||
|
||||
- VHDL: VHSIC Hardware Description Language.
|
||||
- Verilog
|
||||
- OpenCL
|
||||
|
||||
The first two language are typical HDL:
|
||||
|
||||
| Verilog | VHDL |
|
||||
| -------------------------------------- | ------------------------------- |
|
||||
| Has fixed data types. | Has abstract data types. |
|
||||
| Relatively easy to learn. | Relatively difficult to learn. |
|
||||
| Good gate level timing. | Poor level gate timing. |
|
||||
| Interpreted constructs. | Compiled constructs. |
|
||||
| Limited design reusability. | Good design reusability. |
|
||||
| Doesn't support structure replication. | Supports structure replication. |
|
||||
| Limited design management. | Good design management. |
|
||||
|
||||
The OpenCL is not an traditional hardare description language. And OpenCL needs to turn the thread parallelism into hardware parallelism, called **pipeline parallelism**.
|
||||
|
||||
The follow figure shows how the OpenCL-FPGA compiler turns an vector adding function into the circuit.
|
||||
|
||||

|
||||
|
||||
The compiler generates three stages for this function:
|
||||
|
||||
1. In the first stage, two loading units are used.
|
||||
2. In the second stage, one adding unit is used.
|
||||
3. In the third stage, one storing unit is used.
|
||||
|
||||
Once cycle, the thread `N` is clocked in the first stage, loading values from the array meanwhile, the thread `N - 1` is in the second stage, adding values from the array and the thread `N - 2` is in the third stage, storing value into the target array.
|
||||
|
||||
So different from the CPU and GPU, the OpenCL on the FPGA has two levels of parallelism:
|
||||
|
||||
- Pipelining
|
||||
- Replication of the kernels and having them run concurrently.
|
||||
|
BIN
YaeBlog/source/drafts/hpc-2025-non-stored-program-computing/image-20250815093113115.png
(Stored with Git LFS)
Normal file
BIN
YaeBlog/source/drafts/hpc-2025-non-stored-program-computing/image-20250815093113115.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
YaeBlog/source/drafts/hpc-2025-non-stored-program-computing/image-20250817183832472.png
(Stored with Git LFS)
Normal file
BIN
YaeBlog/source/drafts/hpc-2025-non-stored-program-computing/image-20250817183832472.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
YaeBlog/source/drafts/hpc-2025-non-stored-program-computing/image-20250817184419856.png
(Stored with Git LFS)
Normal file
BIN
YaeBlog/source/drafts/hpc-2025-non-stored-program-computing/image-20250817184419856.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
YaeBlog/source/drafts/hpc-2025-non-stored-program-computing/image-20250817185111521.png
(Stored with Git LFS)
Normal file
BIN
YaeBlog/source/drafts/hpc-2025-non-stored-program-computing/image-20250817185111521.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
YaeBlog/source/drafts/hpc-2025-non-stored-program-computing/image-20250817185859510.png
(Stored with Git LFS)
Normal file
BIN
YaeBlog/source/drafts/hpc-2025-non-stored-program-computing/image-20250817185859510.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
YaeBlog/source/drafts/hpc-2025-non-stored-program-computing/image-20250817192006784.png
(Stored with Git LFS)
Normal file
BIN
YaeBlog/source/drafts/hpc-2025-non-stored-program-computing/image-20250817192006784.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
YaeBlog/source/drafts/hpc-2025-non-stored-program-computing/image-20250817194355228.png
(Stored with Git LFS)
Normal file
BIN
YaeBlog/source/drafts/hpc-2025-non-stored-program-computing/image-20250817194355228.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
YaeBlog/source/drafts/hpc-2025-non-stored-program-computing/image-20250817195139631.png
(Stored with Git LFS)
Normal file
BIN
YaeBlog/source/drafts/hpc-2025-non-stored-program-computing/image-20250817195139631.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
YaeBlog/source/drafts/hpc-2025-non-stored-program-computing/image-20250817195714935.png
(Stored with Git LFS)
Normal file
BIN
YaeBlog/source/drafts/hpc-2025-non-stored-program-computing/image-20250817195714935.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
YaeBlog/source/drafts/hpc-2025-non-stored-program-computing/image-20250817200350750.png
(Stored with Git LFS)
Normal file
BIN
YaeBlog/source/drafts/hpc-2025-non-stored-program-computing/image-20250817200350750.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
YaeBlog/source/drafts/hpc-2025-non-stored-program-computing/image-20250829210329225.png
(Stored with Git LFS)
Normal file
BIN
YaeBlog/source/drafts/hpc-2025-non-stored-program-computing/image-20250829210329225.png
(Stored with Git LFS)
Normal file
Binary file not shown.
|
@ -12,7 +12,7 @@ Open Computing Language.
|
|||
|
||||
OpenCL is Open Computing Language.
|
||||
|
||||
- Open, royalty-free standard C-language extension.
|
||||
- Open, royalty-free standard C-language extension.
|
||||
- For parallel programming of heterogeneous systems using GPUs, CPUs , CBE, DSP and other processors including embedded mobile devices.
|
||||
- Managed by Khronos Group.
|
||||
|
||||
|
|
|
@ -12,7 +12,7 @@ Potpourri has a good taste.
|
|||
|
||||
## Heterogeneous System Architecture
|
||||
|
||||

|
||||

|
||||
|
||||
The goals of the HSA:
|
||||
|
||||
|
@ -23,13 +23,13 @@ The goals of the HSA:
|
|||
|
||||
### The Runtime Stack
|
||||
|
||||

|
||||

|
||||
|
||||
## Accelerated Processing Unit
|
||||
|
||||
A processor that combines the CPU and the GPU elements into a single architecture.
|
||||
|
||||

|
||||

|
||||
|
||||
## Intel Xeon Phi
|
||||
|
||||
|
@ -49,7 +49,7 @@ The reality:
|
|||
|
||||
## Deep Learning: Deep Neural Networks
|
||||
|
||||
?
|
||||
The network can used as a computer.
|
||||
|
||||
## Tensor Processing Unit
|
||||
|
||||
|
@ -57,11 +57,11 @@ A custom ASIC for the phase of Neural Networks (AI accelerator).
|
|||
|
||||
### TPUv1 Architecture
|
||||
|
||||

|
||||

|
||||
|
||||
### THUv2 Architecture
|
||||
### TPUv2 Architecture
|
||||
|
||||

|
||||

|
||||
|
||||
Advantages of TPU:
|
||||
|
|
@ -36,6 +36,6 @@ Areas:
|
|||
|
||||
Two APIs:
|
||||
|
||||
- A low-level API called the CUDA driver API
|
||||
- A low-level API called the CUDA driver API.
|
||||
- A higher-level API called the C runtime for CUDA that is implemented on top of the CUDA driver API.
|
||||
|
||||
|
|
|
@ -1,41 +0,0 @@
|
|||
---
|
||||
title: High Performance Computing 2025 SP Stored Program Computing
|
||||
date: 2025-05-29T18:29:28.6155560+08:00
|
||||
tags:
|
||||
- 高性能计算
|
||||
- 学习资料
|
||||
---
|
||||
|
||||
No Von Neumann Machines.
|
||||
|
||||
<!--more-->
|
||||
|
||||
## Application Specified Integrated Circuits
|
||||
|
||||
As known as **ASIC**, these hardwares can work along and are not von Neumann machines.
|
||||
|
||||
No stored program concept:
|
||||
|
||||
- Input data come in
|
||||
- Pass through all circuit gates quickly
|
||||
- Generate output results immediately
|
||||
|
||||
Advantages: performance is better.
|
||||
|
||||
Disadvantages: reusability is worse.
|
||||
|
||||
> The CPU and GPU are special kinds of ASIC.
|
||||
|
||||
Why we need ASIC in computing:
|
||||
|
||||
- Alternatives to the Moore'a law.
|
||||
- High capacity and high speed.
|
||||
|
||||

|
||||
|
||||
## Field Programming Gate Array
|
||||
|
||||
|
||||
|
||||

|
||||
|
Loading…
Reference in New Issue
Block a user