With the rapid growth of GPU computing use cases, the demand for graphics processing units (GPUs) has surged. The demand for GPUs has been so high shortages are now common.
In this guide, we’ll take an in-depth look at the GPU architecture, specifically the Nvidia GPU architecture and CUDA parallel computing platform, to help you understand how GPUs work and why they’re an ideal fit for so many modern applications.
What is a GPU?
A GPU, or a graphics processing unit, is a specialized electronic circuit designed to rapidly process and manipulate memory to accelerate the creation of digital images. Because GPUs have thousands of smaller cores (depending on the model and intended application) compared to CPUs, GPU architecture is optimized for parallel processing. GPUs can handle multiple tasks simultaneously and are faster at graphical and mathematical workloads.
What is GPU architecture?
GPU architecture is everything that gives GPUs their functionality and unique capabilities. It includes the core computational units, memory, caches, rendering pipelines, and interconnects. GPU architecture has evolved over time, improving and expanding the functionality and efficiency of GPUs.
GPU architecture vs CPU
As discussed in the GPU vs CPU: Key Differences guide, GPU architecture emphasizes throughput via parallelism, while CPU architecture focuses on low-latency sequential execution and flexibility. GPUs specialize in rapid graphical and math operations and have thousands of smaller cores, while CPUs are more generalized with a few larger cores.
Basic GPU architecture exmplained
A GPU uses many lightweight processing cores, leverages data parallelism, and has high memory throughput. While the specific GPU architecture components vary by model, fundamentally most modern GPUs use single instruction multiple data (SIMD) stream architecture. To understand what that means -- and why it matters -- let’s take a look at Flynn’s Taxonomy.
What is Flynn’s Taxonomy?
Flynn’s Taxonomy is a categorization of computer architectures by Stanford University’s Michael J. Flynn. The basic idea behind Flynn’s Taxonomy is simple: computations consist of 2 streams (data and instructional streams) that can be processed in sequence(1 stream at a time) or in parallel (multiple streams at once). Two data streams with two possible methods to process them leads to the 4 different categories in Flynn’s Taxonomy. Let’s take a look at each.
Single Instruction Single Data (SISD)
SISD stream is an architecture where a single instruction stream (e.g. a program) executes on one data stream. This architecture is used in older computers with a single-core processor, as well as many simple compute devices.
Single Instruction Multiple Data (SIMD)
A SIMD stream architecture has a single control processor and instruction memory, so only one instruction can be run at any given point in time. That single instruction is copied and ran across each core at the same time. This is possible because each processor has its own dedicated memory which allows for parallelism at the data-level (a.k.a. “data parallelism”).
The fundamental advantage of SIMD is that data parallelism allows it to execute computations quickly (multiple processors doing the same thing) and efficiently (only one instruction unit).
Multiple Instruction Single Data (MISD)
MISD stream architecture is effectively the reverse of SIMD architecture. With MISD multiple instructions are performed on the same data stream. The use cases for MISD are very limited today. Most practical applications are better addressed by one of the other architectures. Multiple Instruction Multiple Data (MIMD)
MIMD stream architecture offers parallelism for both data and instruction streams. With MIMD, multiple processors execute instruction streams independently against different data streams.
What makes SIMD best for GPUs?
Now that we understand the different architectures, let’s consider why SIMD is the best choice for GPUs. The answer becomes intuitive when you understand that fundamentally graphics processing -- and many other common GPU computing use cases -- are simply running the same mathematical function over and over again at scale. In this case, many processors running the same instruction on multiple data sets is ideal.
Case in point: adjusting video brightness of a pixel relies on simple arithmetic using RGB (red green blue) values. Executing the same function multiple times is what’s needed to produce the desired result, and SIMD is ideal for that use case. Conversely, MIMD is most effective in applications that call for multiple discrete computations to be executed such as computer-aided design (CAD).
What about SIMT?
If you’re familiar with GPUs, you’ve likely heard the term single instruction multiple threads (SIMT). So where does SIMT fit into Flynn’s Taxonomy? SIMT can be viewed as an extension of SIMD. It adds multithreading to SIMD which improves efficiency as there is less instruction fetching overhead.
Next, we’ll look at how Nvidia’s CUDA toolkit has enabled developers to use GPUs without specialized graphics programming knowledge and explain the CUDA GPU architecture.
CUDA parallel computing platform
Our next step in understanding GPU architecture leads us to Nvidia's popular Compute Unified Device Architecture (CUDA) parallel computing platform. By providing an API that enables developers to optimize how GPU resources are used -- without the need for specialized graphics programming knowledge -- CUDA has gone a long way in making GPUs useful for general purpose computing.
Here, we’ll take a look at key CUDA concepts as they relate to GPU architecture.
CUDA compute hierarchy
The processing resources in CUDA are designed to help optimize performance for GPU use cases. Three of the fundamental components of the hierarchy are threads, thread blocks, and kernel grids.
A thread -- or CUDA core -- is a parallel processor that computes floating point math calculations in an Nvidia GPU. All the data processed by a GPU is processed via a CUDA core. Modern GPUs have hundreds or even thousands of CUDA cores. Each CUDA core has its own memory register that is not available to other threads.
While the relationship between compute power and CUDA cores is not perfectly linear, generally speaking -- and assuming all else is equal -- the more CUDA cores a GPU has, the more compute power it has. However, there are a variety of exceptions to this general idea. For example, different GPU microarchitectures can impact performance and make a GPU with fewer CUDA cores more powerful
As the name implies, a thread block -- or CUDA block -- is a grouping of CUDA cores (threads) that can be executed together in series or parallel. The logical grouping of cores enables more efficient data mapping. Thread blocks share memory on a per-block basis. Current CUDA architecture caps the amount of threads per block at 1024. Every thread in a given CUDA block can access the same shared memory (more on the different types of memory below).
The next layer of abstraction up from thread blocks is the kernel grid. Kernel grids are groupings of thread blocks on the same kernel. Grids can be used to perform larger computations in parallel (e.g. those that require more than 1024 threads), however since different thread blocks cannot use the same shared memory, the same synchronization that occurs at the block-level does not occur at the grid-level.
CUDA memory hierarchy
Like compute resources, memory allocation follows a specific hierarchy in CUDA. While the CUDA compiler automatically handles memory allocation, CUDA developers can and do program to optimize memory usage directly. Here are the key concepts to understand about the CUDA memory hierarchy.
Registers are the memory that gets allocated to individual threads (CUDA cores). Because registers exist in “on-chip” memory and are dedicated to individual threads, the data stored in a register can be processed faster than any other data. The allocation of memory in registers is a complicated process and is handled by compilers as opposed to being controlled by software CUDA developers write.
Read-only (RO) is on-chip memory on GPU streaming multiprocessors. It is used for specific tasks such as texture memory which can be accessed using CUDA texture functions. In many cases, fetching data from read-only memory can be faster and more efficient than using global memory.
L1 Cache/shared memory
Layer 1 (L1) cache and shared memory is on-chip memory that is shared within thread blocks (CUDA blocks). Because L1 cache and shared memory exists on-chip, it is faster than both L2 cache and global memory. The fundamental difference between L1 cache and shared memory is: shared memory usage is controlled via software while L1 cache is controlled by hardware.
Layer 2 cache can be accessed by all threads in all CUDA blocks. L2 cache stores both global and local memory. Retrieving data from L2 cache is faster than retrieving data from global memory.
Global memory is the memory that resides in a device’s DRAM. Using a CPU analogy, global memory is comparable to RAM. Fetching data from global memory is inherently slower than fetching it from L2 cache.
A brief history of Nvidia GPU Architecture
While Nvidia GPUs have certainly made the news more frequently in recent years, they’re by no means new. In fact, there have been multiple iterations of Nvidia GPUs and advances in GPU architecture over the years. So, let’s take a look back at recent history to understand how GPUs have evolved over time. We’ll do that by exploring each of the popular Nvidia GPU microarchitectures released since the year 2000.
Released in 2001, Kelvin was Nvidia’s first new GPU microarchitecture of the millennium. The original Xbox gaming console used an NV2A GPU with the Kelvin microarchitecture. The GeForce 3 and GeForce 4 series GPUs were released with this microarchitecture.
Rankine was the follow-up to Kelvin released in 2003 and used for the GeForce 5 series of Nvidia GPUs. Rankine had support for vertex and fragment programs and increased VRAM size to 256MB.
Curie -- the microarchitecture used by GeForce 6 and 7 series GPUs -- was released as the successor to Rankine in 2004. Curie doubled the amount of VRAM to 512MB and was the first generation of Nvidia GPUs to support the PureVideo video decoding.
The Tesla GPU microarchitecture, released in 2006 as Curie’s successor, introduced several important changes to Nvidia’s GPU product line. In addition to being the architecture used by the GeForce 8, 9, 100, 200, and 300 series GPUs, Tesla was used by the Quadro line of GPUs designed for use cases outside of graphics processing.
Confusingly, Tesla was both the name of a GPU microarchitecture and a brand of Nvidia GPUs. In 2020, Nvidia decided to stop using the Tesla name to avoid confusion with the popular electric vehicle brand.
Tesla’s successor Fermi was released in 2010. Fermi introduced a number of enhancements including:
- Support for 512 CUDA cores
- 64KB of RAM and the ability to partition L1 cache/shared memory
- Support for Error Correcting Code (ECC)
Kepler GPU microarchitecture was released as the successor to Fermi 2012. Key improvements over Fermi were:
- A new streaming multiprocessor architecture known as SMX
- Support for TXAA (an anti-aliasing method)
- An increase in CUDA cores to 1536
- Less power consumption
- Support for automatic overclocking via GPU boost
- Support for GPUDirect which allowed GPUs - both in the same computer or with network access to one another - to communicate without accessing the CPU
Maxwell, released in 2014, was the successor to Fermi. According to Nvidia, The first generation of Maxwell GPUs had these advantages over Fermi:
- More efficient multiprocessors as a result of enhancements related to control logic partitioning, clock-gating, instruction scheduling, and workload balancing
- 64KB of dedicated shared memory on each streaming multiprocessor
- Native shared memory atomic operations that offered performance improvements when compared to the lock/unlock paradigm used by Fermi
- Dynamic parallelism support
Pascal succeeded Maxwell in 2016. This Nvidia GPU microarchitecture offered improvements over Maxwell such as:
- Support for NVLink communications, which can offer a significant speed advantage over PCIe
- High bandwidth Memory 2 (HBM2)- a 4096-bit memory bus that offered a memory bandwidth of 720 GB
- Compute preemption
- Dynamic load balancing to enable optimization of GPU resource utilization
Volta was a somewhat unique microarchitecture iteration released in 2017. While most previous microarchitectures were used in consumer GPUs, Volta GPUs were marketed strictly for professional applications. Volta was also the first microarchitecture to use Tensor Cores.
Tensor Cores are a newer type of processing core that perform specialized math calculations. Specifically, Tensor Cores perform matrix operations that enable AI and deep learning use cases.
Turing was released in 2018 and in addition to supporting Tensor Cores, includes a number of consumer-focused GPUs as well. Turing is the microarchitecture used by Nvidia’s popular Quadro RTX and GeForce RTX series GPUs. These GPUs support real-time ray tracing (a.k.a. RTX) which is vital to computationally heavy applications such as virtual reality (VR).
The Ampere GPU microarchitecture is just beginning to hit the market. Ampere aims to further enable high-performance computing (HPC) and AI use cases. Enhancements in Ampere including 3rd generation NVLink and Tensor cores, structural sparsity (the conversion of unneeded parameters to zeros to enable AI model training), 2nd generation ray tracing cores, multi-instance GPU (MIG) to enable partitioning of A100 GPUs into individual logically isolated and secure GPU instances.
We hope you enjoyed our overview of GPU architecture and how it has evolved. Here at Cherry Servers, we’re passionate about the future of HPC and the use cases that the next generation of GPUs will enable. We are industry leaders in bare metal cloud and experts in helping businesses get the most out of their compute resources. If you’re looking for a cloud-based HPC solution, do not hesitate to contact us today.