From 82e2faa1907e1df81bdeb49dc295beb8d3f1f1c7 Mon Sep 17 00:00:00 2001 From: saharNooby Date: Mon, 17 Apr 2023 19:17:47 +0400 Subject: [PATCH] Update data type info --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 5b29344..fd04047 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,7 @@ Loading LoRA checkpoints in [Blealtan's format](https://github.com/Blealtan/RWKV **TODO (contributions welcome!)**: -1. Optimize AVX2 implementation of `Q4_1_O` matmul — currently, it is as slow as `FP32` +1. Optimize AVX2 implementation of `Q4_1_O` matmul — currently, it is 40% slower than `Q4_1` 2. Measure latency and perplexity of different model sizes (169M to 14B) and data types (`FP32`, `FP16`, `Q4_0`, `Q4_1`, `Q4_1_O`) 3. Test on Linux (including Colab) and MacOS 4. Make required memory calculation more robust (see [#4](https://github.com/saharNooby/rwkv.cpp/issues/4)) @@ -91,9 +91,9 @@ python rwkv/quantize.py ~/Downloads/rwkv.cpp-169M.bin ~/Downloads/rwkv.cpp-169M- Formats available: -- `4`: `Q4_1_O`, best quality, very slow (as `FP32`). -- `3`: `Q4_1`, poor quality, very fast (as `FP16`). -- `2`: `Q4_0`, worst quality, breaks larger models, moderately fast (between `FP16` and `FP32`). +- `4`: `Q4_1_O`, best quality, slow (30% slower than `FP16`). +- `3`: `Q4_1`, poor quality, fast (comparable to `FP16`). +- `2`: `Q4_0`, worst quality, breaks larger models, very fast. ### 4. Run the model