We need 50 or 60% expert pruning please

by hxssgaa - opened 26 days ago

26 days ago

In order to run this on a single 96GB H200 or Rtx Pro 6000 with 4 bit quantisation after expert pruning would be very useful. Don’t mind sacrificing a little more performance.

ha1ry

26 days ago

•

edited 26 days ago

You can fit 30% IQ4_XS in it.

hxssgaa

26 days ago

•

edited 26 days ago

You can fit 30% IQ4_XS on it.

Model weights take about 80G, 16G left on KV cache a bit resource tight for long horizon tasks like claude code which easily require 64K context

0xSero

24 days ago

Quantize kv_cache

lazarevich

Cerebras org 24 days ago

•

edited 24 days ago

@hxssgaa @ha1ry @0xSero hey folks, we just dropped a 40% REAP: https://huggingface.co/cerebras/MiniMax-M2-REAP-139B-A10B
we do see a slightly bigger drop of a few percentage points on some benchmarks, please let us know if you see issues with the model!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment