We need 50 or 60% expert pruning please
#3
by
hxssgaa
- opened
In order to run this on a single 96GB H200 or Rtx Pro 6000 with 4 bit quantisation after expert pruning would be very useful. Don’t mind sacrificing a little more performance.
You can fit 30% IQ4_XS in it.
You can fit 30% IQ4_XS on it.
Model weights take about 80G, 16G left on KV cache a bit resource tight for long horizon tasks like claude code which easily require 64K context
Quantize kv_cache
@hxssgaa
@ha1ry
@0xSero
hey folks, we just dropped a 40% REAP: https://huggingface.co/cerebras/MiniMax-M2-REAP-139B-A10B
we do see a slightly bigger drop of a few percentage points on some benchmarks, please let us know if you see issues with the model!