A reconfigurable 2’s-complement and sign-magnitude scheme integrated within a versatile-format CIM macro supporting MX, LNS, FP, and INT for MAC operations.
Stats: 16nm 72kb gain-cell array, energy efficiency of 120.5TFLOPS/W and throughput density of 3.18 TOPS/mm2 in MXINT8 mode.
A 16nm 72kb 120.5TFLOPS/W Versatile-Format Dual-Representation Gain-Cell CIM Macro for General Purpose AI Tasks
J-C. Tien et. al. — National Tsing Hua University and TSMC in Hsinchu, Taiwan
IEEE International Solid-State Circuits Conference (ISSCC) (2026)
Presented is Versatile Format (MX-LNS-FP-INT) Dual Representation (2C-SM) gain-cell CIM (GC-CIM) using eDRAM for general-purpose AI tasks.
Challenges in Modern CIM
2's Complement (2C) versus Sign Magnitude (SM) area-power trade-off:
2C representation suffers from high toggle rates due to low bit-wise sparsity near negative zero, whereas SM increases hardware area overhead.Fixed MX block size: Using a single MX block-size (k) constrains the energy-accuracy trade-off as it cannot adapt to different input distributions.
LNS area overhead: Supporting LNS in CIM requires replacing MAC operations with additions and look-up table (LUT) accesses, increasing area overhead and reducing area utilisation since hardware cannot be shared across formats.
Innovations Presented
Compact in-situ weight (W) sparsity booster (CiWSB) with polarity-shift W control (PSWC): Transfers weight polarity to inputs (INs) to overcome the area-energy trade-off between 2C and SM and enables reconfigurable 2CSM computation in CIM macros with minimal area overhead.
Input distribution-aware MX quantizer (IDA-MXQ): Adaptively selects the optimal MX block-size k based on IN distributions, improving both energy efficiency and accuracy.
Multi-format adaptive computing cell (MFA-CC) with delta exponent subtractor (ΔES) and LUT support: Shares hardware across LNS, FP, and INT modes. The ΔES increases bit-wise sparsity in LNS mode, enabling high area utilisation and energy efficient computation.

Results Stats
Done in 16nm TSMC operating at 0.8V.
72kb total memory.
MXINT8 – 120.5 TFLOPS/W; 3.18 TOPS/mm2
LNS8 – 98.1 TFLOPS/W; 3.18 TOPS/mm2
BF16 – 50.4 TFLOPS/W;1.59 TOPS/mm2
INT8 – 138.5 TOPS/W; 3.18 TOPS/mm2
Components
Four tile-based matrix-matrix CIM clusters. Each cluster contains an IDA-MXQ and four CIM banks.
Each CIM bank includes 32 sub-banks, an alignment and adder tree, and a FP converter.
Each sub-bank consists of a gain-cell array (8 rows×18 columns), an IN processor, a CiWSB, and four MFA-CC units.
Each CIM bank provides four output channels, broadcasting the same weight to four different inputs across MFA-CC units. The outputs from the four CIM banks are combined to produce the final matrix–matrix result.