A 16nm 72kb 120.5TFLOPS/W Versatile-Format Dual-Representation Gain-Cell CIM Macro for General Purpose AI Tasks

Presented is Versatile Format (MX-LNS-FP-INT) Dual Representation (2C-SM) gain-cell CIM (GC-CIM) using eDRAM for general-purpose AI tasks.

Challenges in Modern CIM

2's Complement (2C) versus Sign Magnitude (SM) area-power trade-off:
2C representation suffers from high toggle rates due to low bit-wise sparsity near negative zero, whereas SM increases hardware area overhead.
Fixed MX block size: Using a single MX block-size (k) constrains the energy-accuracy trade-off as it cannot adapt to different input distributions.
LNS area overhead: Supporting LNS in CIM requires replacing MAC operations with additions and look-up table (LUT) accesses, increasing area overhead and reducing area utilisation since hardware cannot be shared across formats.

Innovations Presented

Compact in-situ weight (W) sparsity booster (CiWSB) with polarity-shift W control (PSWC): Transfers weight polarity to inputs (INs) to overcome the area-energy trade-off between 2C and SM and enables reconfigurable 2CSM computation in CIM macros with minimal area overhead.
Input distribution-aware MX quantizer (IDA-MXQ): Adaptively selects the optimal MX block-size k based on IN distributions, improving both energy efficiency and accuracy.

Multi-format adaptive computing cell (MFA-CC) with delta exponent subtractor (ΔES) and LUT support: Shares hardware across LNS, FP, and INT modes. The ΔES increases bit-wise sparsity in LNS mode, enabling high area utilisation and energy efficient computation.

Results Stats

Done in 16nm TSMC operating at 0.8V.
72kb total memory.
MXINT8 – 120.5 TFLOPS/W; 3.18 TOPS/mm²
LNS8 – 98.1 TFLOPS/W; 3.18 TOPS/mm²
BF16 – 50.4 TFLOPS/W;1.59 TOPS/mm²
INT8 – 138.5 TOPS/W; 3.18 TOPS/mm²

Components

Four tile-based matrix-matrix CIM clusters. Each cluster contains an IDA-MXQ and four CIM banks.
Each CIM bank includes 32 sub-banks, an alignment and adder tree, and a FP converter.
Each sub-bank consists of a gain-cell array (8 rows×18 columns), an IN processor, a CiWSB, and four MFA-CC units.
Each CIM bank provides four output channels, broadcasting the same weight to four different inputs across MFA-CC units. The outputs from the four CIM banks are combined to produce the final matrix–matrix result.