5.3 KiB
rwkv.cpp
This is a port of BlinkDL/RWKV-LM to ggerganov/ggml.
Besides the usual FP32, it supports FP16 and quantized INT4 inference on CPU. This project is CPU only.
RWKV is a novel large language model architecture, with the largest model in the family having 14B parameters. In contrast to Transformer with O(n^2)
attention, RWKV requires only state from previous step to calculate logits. This makes RWKV very CPU-friendly on large context lenghts.
This project provides a C library rwkv.h and a convinient Python wrapper for it.
Loading LoRA checkpoints in Blealtan's format is supported through merge_lora_into_ggml.py script.
TODO (contributions welcome!):
- Optimize AVX2 implementation of
Q4_1_O
matmul — currently, it is 40% slower thanQ4_1
- Measure latency and perplexity of different model sizes (169M to 14B) and data types (
FP32
,FP16
,Q4_0
,Q4_1
,Q4_1_O
) - Test on Linux (including Colab) and MacOS
- Make required memory calculation more robust (see #4)
How to use
1. Clone the repo
Requirements: git.
git clone --recursive https://github.com/saharNooby/rwkv.cpp.git
cd rwkv.cpp
2. Get the rwkv.cpp library
Option 2.1. Download a pre-compiled library
Windows / Linux / MacOS
Check out Releases, download appropriate ZIP for your OS and CPU, extract rwkv
library file into the repository directory.
On Windows: to check whether your CPU supports AVX2 or AVX-512, use CPU-Z.
Option 2.2. Build the library yourself
Windows
Requirements: CMake or CMake from anaconda, MSVC compiler.
cmake -DBUILD_SHARED_LIBS=ON .
cmake --build . --config Release
If everything went OK, bin\Release\rwkv.dll
file should appear.
Linux / MacOS
Requirements: CMake (Linux: sudo apt install cmake
, MacOS: brew install cmake
, anaconoda: cmake package).
cmake -DBUILD_SHARED_LIBS=ON .
cmake --build . --config Release
Anaconda & M1 users: please verify that CMAKE_SYSTEM_PROCESSOR: arm64
after running cmake -DBUILD_SHARED_LIBS=ON .
— if it detects x86_64
, edit the CMakeLists.txt
file under the # Compile flags
to add set(CMAKE_SYSTEM_PROCESSOR "arm64")
.
If everything went OK, librwkv.so
(Linux) or librwkv.dylib
(MacOS) file should appear in the base repo folder.
3. Download an RWKV model from Hugging Face like this one and convert it into ggml
format
Requirements: Python 3.x with PyTorch.
# Windows
python rwkv\convert_pytorch_to_ggml.py C:\RWKV-4-Pile-169M-20220807-8023.pth C:\rwkv.cpp-169M.bin float16
# Linux / MacOS
python rwkv/convert_pytorch_to_ggml.py ~/Downloads/RWKV-4-Pile-169M-20220807-8023.pth ~/Downloads/rwkv.cpp-169M.bin float16
3.1. Optionally, quantize the model
To convert the model into INT4 quantized format, run:
# Windows
python rwkv\quantize.py C:\rwkv.cpp-169M.bin C:\rwkv.cpp-169M-Q4_1_O.bin 4
# Linux / MacOS
python rwkv/quantize.py ~/Downloads/rwkv.cpp-169M.bin ~/Downloads/rwkv.cpp-169M-Q4_1_O.bin 4
Formats available:
4
:Q4_1_O
, best quality, slow (30% slower thanFP16
).3
:Q4_1
, poor quality, fast (comparable toFP16
).2
:Q4_0
, worst quality, breaks larger models, very fast.
4. Run the model
Requirements: Python 3.x with PyTorch and tokenizers.
Note: change the model path with the non-quantized model for the full weights model.
To generate some text, run:
# Windows
python rwkv\generate_completions.py C:\rwkv.cpp-169M-Q4_1_O.bin
# Linux / MacOS
python rwkv/generate_completions.py ~/Downloads/rwkv.cpp-169M-Q4_1_O.bin
To chat with a bot, run:
# Windows
python rwkv\chat_with_bot.py C:\rwkv.cpp-169M-Q4_1_O.bin
# Linux / MacOS
python rwkv/chat_with_bot.py ~/Downloads/rwkv.cpp-169M-Q4_1_O.bin
Edit generate_completions.py or chat_with_bot.py to change prompts and sampling settings.
Example of using rwkv.cpp
in your custom Python script:
import rwkv_cpp_model
import rwkv_cpp_shared_library
# Change to model paths used above (quantized or full weights)
model_path = r'C:\rwkv.cpp-169M.bin'
model = rwkv_cpp_model.RWKVModel(
rwkv_cpp_shared_library.load_rwkv_shared_library(),
model_path
)
logits, state = None, None
for token in [1, 2, 3]:
logits, state = model.eval(token, state)
print(f'Output logits: {logits}')
# Don't forget to free the memory after you've done working with the model
model.free()