Teraflops Research Chip


Intel Teraflops Research Chip is a research manycore processor containing 80 cores, using a network-on-chip architecture, developed by Intel's Tera-Scale Computing Research Program. It was manufactured using a 65 nm CMOS process with eight layers of copper interconnect and contains 100 million transistors on a 275 mm2 die. Its design goal was to demonstrate a modular architecture capable of a sustained performance of 1.0 TFLOPS while dissipating less than 100 W. Research from the project was later incorporated into Xeon Phi. The technical lead of the project was Sriram R. Vangal.
The processor was initially presented at the Intel Developer Forum on September 26, 2006 and officially announced on February 11, 2007. A working chip was presented at the 2007 IEEE International Solid-State Circuits Conference, alongside technical specifications.

Architecture

The chip consists of a 10x8 2D mesh network of cores and nominally operates at 4 GHz. Each core, called a tile, contains a processing engine and a 5-port wormhole-switched router with mesochronous interfaces, with a bandwidth of 80 GB/s and latency of 1.25 ns at 4 GHz. The processing engine in each tile contains two independent, 9-stage pipeline, single-precision floating-point multiplyaccumulator units, 3 KB of single-cycle instruction memory and 2 KB of data memory. Each FPMAC unit is capable of performing 2 single-precision floating-point operations per cycle. Each tile has thus an estimated peak performance of 16 GFLOPS at the standard configuration of 4 GHz. A 96-bit very long instruction word encodes up to eight operations per cycle. The custom instruction set includes instructions to send and receive packets into/from the chip's network and well as instructions for sleeping and waking a particular tile. Underneath each tile, a 256 KB SRAM module was 3D stacked, thus bringing memory nearer to the processor to increase overall memory bandwidth to 1 TB/s, at the expense of higher cost, thermal stress and latency, and a small total capacity of 20 MB. The network of Polaris was shown to have a bisection bandwidth of 1.6 Tbit/s at 3.16 GHz and 2.92 Tbit/s at 5.67 GHz.
Other prominent features of the Teraflops Research chip include its fine-grained power management with 21 independent sleep regions on a tile and dynamic tile sleep, and very high energy efficiency with 27 GFLOPS/W theoretical peak at 0.6 V and 19.4 GFLOPS/W actual for stencil at 0.75 V.
Instruction typeLatency
FPMAC9
LOAD/STORE2
SEND/RECEIVE2
JUMP/BRANCH1
STALL/WFD?
SLEEP/WAKE6

PowerSource
0.60 V1.0 GHz0.32 TFLOPS11 W110 °C
0.675 V1.0 GHz0.32 TFLOPS15.6 W80 °C
0.70 V1.5 GHz0.48 TFLOPS25 W110 °C
0.70 V1.35 GHz0.43 TFLOPS18 W80 °C
0.75 V1.6 GHz0.51 TFLOPS21 W80 °C
0.80 V2.1 GHz0.67 TFLOPS42 W110 °C
0.80 V2.0 GHz0.64 TFLOPS26 W80 °C
0.85 V2.4 GHz0.77 TFLOPS32 W80 °C
0.90 V2.6 GHz0.83 TFLOPS70 W110 °C
0.90 V2.85 GHz0.91 TFLOPS45 W80 °C
0.95 V3.16 GHz1.0 TFLOPS62 W80 °C
1.00 V3.13 GHz1.0 TFLOPS98 W110 °C
1.00 V3.8 GHz1.22 TFLOPS78 W80 °C
1.05 V4.2 GHz1.34 TFLOPS82 W80 °C
1.10 V3.5 GHz1.12 TFLOPS135 W110 °C
1.10 V4.5 GHz1.44 TFLOPS105 W80 °C
1.15 V4.8 GHz1.54 TFLOPS128 W80 °C
1.20 V4.0 GHz1.28 TFLOPS181 W110 °C
1.20 V5.1 GHz1.63 TFLOPS152 W80 °C
1.25 V5.3 GHz1.70 TFLOPS165 W80 °C
1.30 V4.4 GHz1.39 TFLOPS?110 °C
1.30 V5.5 GHz1.76 TFLOPS210 W80 °C
1.35 V5.67 GHz1.81 TFLOPS230 W80 °C
1.40 V4.8 GHz1.52 TFLOPS?110 °C

Issues

Intel aimed to help software development for the new exotic architecture by creating a new programming model, especially for the chip, called Ct. The model never gained the following Intel hoped for and has been eventually incorporated into Intel Array Building Blocks, a now defunct C++ library.