Graphics Core Next

Graphics Core Next is the codename for both a series of microarchitectures as well as for an instruction set. GCN was developed by AMD for their GPUs as the successor to TeraScale microarchitecture/instruction set. The first product featuring GCN was launched on January 9, 2012.
GCN is a RISC SIMD microarchitecture contrasting the VLIW SIMD architecture of TeraScale. GCN requires considerably more transistors than TeraScale, but offers advantages for GPGPU computation. It makes the compiler simpler and should also lead to better utilization.
GCN graphics chips are fabricated with CMOS at 28 nm, and with FinFET at 14 nm and 7 nm, available on selected models in the Radeon HD 7000, HD 8000, 200, 300, 400, 500 and Vega series of AMD Radeon graphics cards, including the separately released Radeon VII. GCN is also used in the graphics portion of AMD Accelerated Processing Units, such as in the PlayStation 4 and Xbox One APUs.

Instruction set

The GCN instruction set is owned by AMD. The GCN instruction set has been developed specifically for GPUs and, for example, has no micro-operation for division.
Documentation is available for:

the
the
the
the Graphics Core Next 4: Documentation for the GCN 4 instruction set is the same as for the 3rd generation.
the
the

An LLVM code generator is available for the GCN instruction set. It is used by Mesa 3D.
The GNU Compiler Collection supports GCN 3 and GCN 5 since 2019 for single-threaded, stand-alone programs and with GCC 10 also offloading via OpenMP and OpenACC.
is an open-source RTL implementation of the AMD Southern Islands GPGPU instruction set.
In November 2015, AMD announced the "Boltzmann Initiative". The AMD Boltzmann Initiative shall enable the porting of CUDA-based applications to a common C++ programming model.
At the "Super Computing 15" AMD showed their Heterogeneous Compute Compiler, a headless Linux driver and HSA runtime infrastructure for cluster-class, High Performance Computing and the Heterogeneous-compute Interface for Portability tool for porting CUDA-based applications to a common C++ programming model.

Microarchitectures

As of July 2017, the family of microarchitectures implementing the identically called instruction set "Graphics Core Next" has seen five iterations. The differences in the instruction set are rather minimal and do not differentiate too much from one another. An exception is the fifth generation GCN architecture, which heavily modified the stream processors to improve performance and support the simultaneous processing of two lower precision numbers in place of a single higher precision number.

Command processing

Graphics Command Processor

The "Graphics Command Processor" is a functional unit of the GCN microarchitecture. Among other tasks, it is responsible for Asynchronous Shaders. The short video visualizes the differences between "multi thread", "preemption" and "".

Asynchronous Compute Engine

The Asynchronous Compute Engine is a distinct functional block serving computing purposes. Its purpose is similar to that of the Graphics Command Processor.

Scheduler

Since the third iteration of GCN, the hardware contains two schedulers: One to schedule wavefronts during shader execution and a new one to schedule execution of draw and compute queues. The latter helps performance by executing compute operations when the CUs are underutilized because of graphics commands limited by fixed function pipeline speed or bandwidth limited. This functionality is known as Async Compute.
For a given shader, the GPU drivers also need to select a good instruction order, in order to minimize latency. This is done on cpu, and is sometimes referred as "Scheduling".

Geometric processor

The geometry processor contains the Geometry Assembler, the Tesselator and the Vertex Assembler.
The GCN Tesselator of the Geometry processor is capable of doing tessellation in hardware as defined by Direct3D 11 and OpenGL 4.5.
The GCN Tesselator is AMD's most current SIP block, earlier units were ATI TruForm and hardware tessellation in TeraScale.

Compute units

One compute unit combines 64 shader processors with 4 TMUs. The compute unit is separate from, but feed into, the render output units. Each Compute Unit consists of a CU Scheduler, a Branch & Message Unit, 4 SIMD Vector Units, 4 64KiB VGPR files, 1 scalar unit, a 4 KiB GPR file, a local data share of 64 KiB, 4 Texture Filter Units, 16 Texture Fetch Load/Store Units and a 16 KiB L1 Cache. Four Compute units are wired to share an Instruction Cache 16 KiB in size and a scalar data cache 32KiB in size. These are backed by the L2 cache. A SIMD-VU operates on 16 elements at a time, while a SU can operate on one a time. In addition the SU handles some other operations like branching.
Every SIMD-VU has some private memory where it stores its registers. There are two types of registers: scalar registers, which hold 4 bytes number each, and vector registers, which represent a set of 64 4 bytes numbers each. When you operate on the vector registers, every operation is done in parallel on the 64 numbers. Every time you do some work with them, you actually work with 64 inputs. For example, you work on 64 different pixels at a time.
Every SIMD-VU has room for 512 scalar registers and 256 vector registers.

CU scheduler

The CU scheduler is the hardware functional block choosing for the SIMD-VU which wavefronts to execute. It picks one SIMD-VU per cycle for scheduling. This is not to be confused with other schedulers, in hardware or software.
; Wavefront
In all GCN-GPUs, a "wavefront" consists of 64 threads, and in all Nvidia GPUs a "warp" consists of 32 threads.
AMD's solution is to attribute multiple wavefronts to each SIMD-VU. The hardware distributes the registers to the different wavefronts, and when one wavefront is waiting on some result, which lies in memory, the CU Scheduler decides to make the SIMD-VU work on another wavefront. Wavefronts are attributed per SIMD-VU. SIMD-VUs do not exchange wavefronts. At max 10 wavefronts can be attributed per SIMD-VU.
AMD CodeXL shows tables with the relationship between number of SGPRs and VGPRs to the number of wavefronts, but basically for SGPRS it is min and VGPRS 256/numwavefronts.
Note that in conjunction with the SSE instructions this concept of most basic level of parallelism is often called a "vector width". The vector width is characterized by the total number of bits in it.

SIMD Vector Unit

Each SIMD Vector Unit has:

a 16-lane integer and floating point vector Arithmetic Logic Unit
64 KiB Vector General Purpose Register file
A 48-bit Program Counter
Instruction buffer for 10 wavefronts
* A wavefront is a group of 64 threads: the size of one logical VGPR
A 64-thread wavefront issues to a 16-lane SIMD Unit over four cycles

Each SIMD-VU has 10 wavefront instruction buffer, and it takes 4 cycles to execute one wavefront.

Audio and video acceleration blocks

Many implementations of GCN are typically accompanied by several of AMD's other ASIC blocks. Including but not limited to the Unified Video Decoder, Video Coding Engine, and AMD TrueAudio.

Video Coding Engine

TrueAudio

Unified virtual memory

In a preview in 2011, AnandTech wrote about the unified virtual memory, supported by Graphics Core Next.

Heterogeneous System Architecture (HSA)

Some of the specific HSA features implemented in the hardware need support from the operating system's kernel and/or from specific device drivers. For example, in July 2014 AMD published a set of 83 patches to be merged into Linux kernel mainline 3.17 for supporting their Graphics Core Next-based Radeon graphics cards. The special driver titled "HSA kernel driver" resides in the directory /drivers/gpu/hsa while the DRM-graphics device drivers reside in /drivers/gpu/drm and augments the already existent DRM driver for Radeon cards. This very first implementation focuses on a single and works alongside the existing Radeon kernel graphics driver.

Lossless Delta Color Compression

Hardware schedulers

They are used to perform scheduling and offload the assignment of compute queues to the ACEs from the driver to hardware by buffering these queues until there is at least one empty queue in at least one ACE, causing the HWS to immediately assign buffered queues to the ACEs until all queues are full or there are no more queues to safely assign. Part of the scheduling work performed includes prioritized queues which allow critical tasks to run at a higher priority than other tasks without requiring the lower priority tasks to be preempted to run the high priority task, therefore allowing the tasks to run concurrently with the high priority tasks scheduled to hog the GPU as much as possible while letting other tasks use the resources that the high priority tasks are not using. These are essentially Asynchronous Compute Engines that lack dispatch controllers. They were first introduced in the fourth generation GCN microarchitecture, but were present in the third generation GCN microarchitecture for internal testing purposes. A driver update has enabled the hardware schedulers in third generation GCN parts for production use.

Primitive Discard Accelerator

This unit discards degenerate triangles before they enter the vertex shader and triangles that do not cover any fragments before they enter the fragment shader. This unit was introduced with the fourth generation GCN microarchitecture.

Generations

Graphics Core Next 1

support for 64-bit addressing with unified address space for CPU and GPU
* support for PCI-E 3.0
* GPU sends interrupt requests to CPU on various events
support for Partially Resident Textures, which enable virtual memory support through DirectX and OpenGL extensions
AMD PowerTune support, which dynamically adjusts performance to stay within a specific TDP
support for Mantle

There are Asynchronous Compute Engines controlling computation and dispatching.

ZeroCore Power

ZeroCore Power is a long idle power saving technology, shutting off functional units of the GPU when not in use. AMD ZeroCore Power technology supplements AMD PowerTune.

Chips

discrete GPUs :

Oland
Cape Verde
Pitcairn
Tahiti
Graphics Core Next 2

GCN 2nd generation was introduced with Radeon HD 7790 and is also found in Radeon HD 8770, R7 260/260X, R9 290/290X, R9 295X2, R7 360, R9 390/390X, as well as Steamroller-based Desktop Kaveri APUs and Mobile Kaveri APUs and in the Puma-based "Beema" and "Mullins" APUs. It has multiple advantages over the original GCN, including FreeSync support, AMD TrueAudio and a revised version of AMD PowerTune technology.
GCN 2nd generation introduced an entity called "Shader Engine". A Shader Engine comprises one geometry processor, up to 44 CUs, rasterizers, ROPs, and L1 cache. Not part of a Shader Engine is the Graphics Command Processor, the 8 ACEs, the L2 cache and memory controllers as well as the audio and video accelerators, the display controllers, the 2 DMA controllers and the PCIe interface.
The A10-7850K "Kaveri" contains 8 CUs and 8 Asynchronous Compute Engines for independent scheduling and work item dispatching.
At AMD Developer Summit in November 2013 Michael Mantor presented the Radeon R9 290X.

Chips

discrete GPUs :

Bonaire
Hawaii

integrated into APUs:

Temash
Kabini
Liverpool
Durango
Kaveri
Godavari
Mullins
Beema
Carrizo-L
Graphics Core Next 3

GCN 3rd generation was introduced in 2014 with the Radeon R9 285 and R9 M295X, which have the "Tonga" GPU. It features improved tessellation performance, lossless delta color compression in order to reduce memory bandwidth usage, an updated and more efficient instruction set, a new high quality scaler for video, and a new multimedia engine. Delta color compression is supported in Mesa. However, its double precision performance is worse compared to previous generation.

Chips

discrete GPUs:

Tonga, comes with UVD 5.0
Fiji, comes with UVD 6.0 and High Bandwidth Memory

integrated into APUs:

Carrizo, comes with UVD 6.0
Bristol Ridge
Stoney Ridge
Graphics Core Next 4

GPUs of the Arctic Islands-family were introduced in Q2 of 2016 with the AMD Radeon 400 series. The 3D-engine is identical to that found in the Tonga-chips. But Polaris feature a newer Display Controller engine, UVD version 6.3, etc.
All Polaris-based chips other than the Polaris 30 are produced on the 14 nm FinFET process, developed by Samsung Electronics and licensed to GlobalFoundries. The slightly newer refreshed Polaris 30 is built on the 12 nm LP FinFET process node, developed by Samsung and GlobalFoundries. The fourth generation GCN instruction set architecture is compatible with the third generation. It is an optimization for 14 nm FinFET process enabling higher GPU clock speeds than with the 3rd GCN generation. Architectural improvements include new hardware schedulers, a new primitive discard accelerator, a new display controller, and an updated UVD that can decode HEVC at 4K resolutions at 60 frames per second with 10 bits per color channel.

Chips

discrete GPUs:

Polaris 10 found on "Radeon RX 470"- and "Radeon RX 480"-branded graphics cards
Polaris 11 found on "Radeon RX 460"-branded graphics cards
Polaris 12 found on "Radeon RX 550" and "Radeon RX 540"-branded graphics cards
Polaris 20, which is a refreshed Polaris 10 with higher clocks, used for "Radeon RX 570" and "Radeon RX 580"-branded graphics cards
Polaris 21, which is a refreshed Polaris 11, used for "Radeon RX 560"-branded graphics cards
Polaris 22, found on "Radeon RX Vega M GH" and "Radeon RX Vega M GL"-branded graphics cards
Polaris 30, which is a refreshed Polaris 20 with higher clocks, used for "Radeon RX 590"-branded graphics cards
Precision Performance

FP64 performance of all GCN 4th generation GPUs is ¹/₁₆ of FP32 performance.

Graphics Core Next 5

AMD began releasing details of their next generation of GCN Architecture, termed the 'Next-Generation Compute Unit', in January 2017. The new design was expected to increase instructions per clock, higher clock speeds, support for HBM2, a larger memory address space. The discrete graphics chipsets also include "HBCC ", but not when integrated into APUs. Additionally, the new chips were expected to include improvements in the Rasterisation and Render output units. The stream processors are heavily modified from the previous generations to support packed math Rapid Pack Math technology for 8-bit, 16-bit, and 32-bit numbers. With this there is a significant performance advantage when lower precision is acceptable.
Nvidia introduced tile-based rasterization and binning with Maxwell, and this was a big reason for Maxwell's efficiency increase. In January, AnandTech assumed that Vega would finally catch up with Nvidia regarding energy efficiency optimizations due to the new "DSBR " to be introduced with Vega.
It also added support for a new shader stage – Primitive Shaders. Primitive shaders provide more flexible geometry processing and replace the vertex and geometry shaders in a rendering pipeline. As of December 2018, the Primitive shaders can't be used because required API changes are yet to be done.
Vega 10 and Vega 12 use the 14 nm FinFET process, developed by Samsung Electronics and licensed to GlobalFoundries. Vega 20 uses the 7 nm FinFET process developed by TSMC.

Chips

discrete GPUs:

Vega 10 found on "Radeon RX Vega 64", "Radeon RX Vega 56", "Radeon Vega Frontier Edition", "Radeon Pro V340", and Radeon Pro WX 8200 graphics cards
Vega 12 found on "Radeon Pro Vega 20" and "Radeon Pro Vega 16"-branded mobile graphics cards
Vega 20 found on "Radeon Instinct MI50" and "Radeon Instinct MI60"-branded accelerator cards, "Radeon Pro VegaII", and "Radeon VII"-branded graphics cards.

integrated into APUs:

Raven Ridge came with VCN 1 which supersedes VCE and UVD and allows full fixed-function VP9 decode.
Precision Performance

performance of all GCN 5th generation GPUs, except for Vega 20, is ¹/₁₆ of FP32 performance. For Vega 20 this is ¹/₂ of FP32 performance.
All GCN 5th generation GPUs support half-precision floating-point calculations which is double of FP32 performance.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...

Graphics Core Next

Instruction set

Microarchitectures

Command processing

Graphics Command Processor

Asynchronous Compute Engine

Scheduler

Geometric processor

Compute units

CU scheduler

SIMD Vector Unit

Audio and video acceleration blocks

Video Coding Engine

TrueAudio

Unified virtual memory

Heterogeneous System Architecture (HSA)

Lossless Delta Color Compression

Hardware schedulers

Primitive Discard Accelerator

Generations

Graphics Core Next 1

ZeroCore Power

Chips

Graphics Core Next 2

Chips

Graphics Core Next 3

Chips

Graphics Core Next 4

Chips

Precision Performance

Graphics Core Next 5

Chips

Precision Performance