Fermi (microarchitecture)

Fermi is the codename for a graphics processing unit microarchitecture developed by Nvidia, first released to retail in April 2010, as the successor to the Tesla microarchitecture. It was the primary microarchitecture used in the GeForce 400 series and GeForce 500 series. It was followed by Kepler, and used alongside Kepler in the GeForce 600 series, GeForce 700 series, and GeForce 800 series, in the latter two only in mobile GPUs. In the workstation market, Fermi found use in the Quadro x000 series, Quadro NVS models, as well as in Nvidia Tesla computing modules. All desktop Fermi GPUs were manufactured in 40 nm, mobile Fermi GPUs in 40 nm and 28 nm. Fermi is the oldest microarchitecture from NVIDIA that received support for the Microsoft's rendering API Direct3D 12 feature_level 11.
The architecture is named after Enrico Fermi, an Italian physicist.

Overview

Fermi Graphic Processing Units feature 3.0 billion transistors and a schematic is sketched in Fig. 1.

Streaming Multiprocessor : composed of 32 CUDA cores.
GigaThread global scheduler: distributes thread blocks to SM thread schedulers and manages the context switches between threads during execution.
Host interface: connects the GPU to the CPU via a PCI-Express v2 bus.
DRAM: supported up to 6GB of GDDR5 DRAM memory thanks to the 64-bit addressing capability.
Clock frequency: 1.5 GHz.
Peak performance: 1.5 TFlops.
Global memory clock: 2 GHz.
DRAM bandwidth: 192GB/s.
Streaming multiprocessor

Each SM features 32 single-precision CUDA cores, 16 load/store units, four Special Function Units, a 64KB block of high speed on-chip memory and an interface to the L2 cache.

Load/Store Units

Allow source and destination addresses to be calculated for 16 threads per clock. Load and store the data from/to cache or DRAM.

Special Functions Units (SFUs)

Execute transcendental instructions such as sin, cosine, reciprocal, and square root. Each SFU executes one instruction per thread, per clock; a warp executes over eight clocks. The SFU pipeline is decoupled from the dispatch unit, allowing the dispatch unit to issue to other execution units while the SFU is occupied.

CUDA core

Integer Arithmetic Logic Unit :
Supports full 32-bit precision for all instructions, consistent with standard programming language requirements. It is also optimized to efficiently support 64-bit and extended precision operations.

Floating Point Unit (FPU)

Implements the new IEEE 754-2008 floating-point standard, providing the fused multiply-add instruction for both single and double precision arithmetic. Up to 16 double precision fused multiply-add operations can be performed per SM, per clock.

Polymorph-Engine

Fused multiply-add

perform multiplication and addition with a single final rounding step, with no loss of precision in the addition. FMA is more accurate than performing the operations separately.

Warp scheduling

The Fermi architecture uses a two-level, distributed thread scheduler.
Each SM can issue instructions consuming any two of the four green execution columns shown in the schematic Fig. 1. For example, the SM can mix 16 operations from the 16 first column cores with 16 operations from the 16 second column cores, or 16 operations from the load/store units with four from SFUs, or any other combinations the program specifies.
Note that 64-bit floating point operations consumes both the first two execution columns. This implies that an SM can issue up to 32 single-precision floating point operations or 16 double-precision floating point operations at a time.

GigaThread Engine

The GigaThread engine schedules thread blocks to various SMs

Dual Warp Scheduler

At the SM level, each warp scheduler distributes warps of 32 threads to its execution units. Threads are scheduled in groups of 32 threads called warps. Each SM features two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. The dual warp scheduler selects two warps, and issues one instruction from each warp to a group of 16 cores, 16 load/store units, or 4 SFUs.
Most instructions can be dual issued; two integer instructions, two floating instructions, or a mix of integer, floating point, load, store, and SFU instructions can be issued concurrently.
Double precision instructions do not support dual dispatch with any other operation.

Performance

The theoretical single-precision processing power of a Fermi GPU in GFLOPS is computed as 2 × number of CUDA cores × shader clock speed. Note that the previous generation Tesla could dual-issue MAD+MUL to CUDA cores and SFUs in parallel, but Fermi lost this ability as it can only issue 32 instructions per cycle per SM which keeps just its 32 CUDA cores fully utilized. Therefore, it is not possible to leverage the SFUs to reach more than 2 operations per CUDA core per cycle.
The theoretical double-precision processing power of a Fermi GPU is 1/2 of the single precision performance on GF100/110. However, in practice this double-precision power is only available on professional Quadro and Tesla cards, while consumer GeForce cards are capped to 1/8.

Memory

L1 cache per SM and unified L2 cache that services all operations.

Registers

Each SM has 32K of 32-bit registers. Each thread has access to its own registers and not those of other threads. The maximum number of registers that can be used by a CUDA kernel is 63. The number of available registers degrades gracefully from 63 to 21 as the workload increases by number of threads. Registers have a very high bandwidth: about 8,000 GB/s.

L1+Shared Memory

On-chip memory that can be used either to cache data for individual threads and/or to share data among several threads.
This 64 KB memory can be configured as either 48 KB of shared memory with 16 KB of L1 cache, or 16 KB of shared memory with 48 KB of L1 cache.
Shared memory enables threads within the same thread block to cooperate, facilitates extensive reuse of on-chip data, and greatly reduces off-chip traffic.
Shared memory is accessible by the threads in the same thread block. It provides low-latency access and very high bandwidth to moderate amounts of data.
David Patterson says that this Shared Memory uses idea of local scratchpad

Local Memory

Local memory is meant as a memory location used to hold "spilled" registers. Register spilling occurs when a thread block requires more register storage than is available on an SM.
Local memory is used only for some automatic variables. Generally, an automatic variable resides in a register except for the following: Arrays that the compiler cannot determine are indexed with constant quantities; Large structures or arrays that would consume too much register space; Any variable the compiler decides to spill to local memory when a kernel uses more registers than are available on the SM.

L2 Cache

768 KB unified L2 cache, shared among the 16 SMs, that services all load and store from/to global memory, including copies to/from CPU host, and also texture requests.
The L2 cache subsystem also implements atomic operations, used for managing access to data that must be shared across thread blocks or even kernels.

Global memory

Accessible by all threads as well as host. High latency.

Video decompression/compression

See Nvidia NVDEC as well as Nvidia PureVideo and Nvidia NVENC.

Fermi chips

GF100
GF104
GF106
GF108
GF110
GF114
GF116
GF118
GF119
GF117
General
N. Brookwood,
P.N. Glaskowsky,
N. Whitehead, A. Fit-Florea, , 2011.
S.F. Oberman, M. Siu, "A high-performance area-efficient multifunction interpolator," Proc. of the 17th IEEE Symposium on Computer Arithmetic, Cap Cod, MA, USA, Jul. 27–29, 2005, pp. 272–279.
R. Farber, "CUDA Application Design and Development," Morgan Kaufmann, 2011.
NVIDIA Application Note "Tuning CUDA applications for Fermi".

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...