We need 50 or 60% expert pruning please

#3
by hxssgaa - opened

In order to run this on a single 96GB H200 or Rtx Pro 6000 with 4 bit quantisation after expert pruning would be very useful. Don’t mind sacrificing a little more performance.

You can fit 30% IQ4_XS in it.

You can fit 30% IQ4_XS on it.

Model weights take about 80G, 16G left on KV cache a bit resource tight for long horizon tasks like claude code which easily require 64K context

Quantize kv_cache

@hxssgaa @ha1ry @0xSero hey folks, we just dropped a 40% REAP: https://huggingface.co/cerebras/MiniMax-M2-REAP-139B-A10B
we do see a slightly bigger drop of a few percentage points on some benchmarks, please let us know if you see issues with the model!

Sign up or log in to comment