Go to file

Alexander 4f1df7c89e Merge pull request #9 from hypnopump/more_instructions_works_linux Adds instructions and works on linux as well		2023-04-03 11:35:38 +05:00
.github/workflows	Add GitHub workflows file	2023-04-02 16:56:04 +04:00
rwkv	for consistency	2023-04-03 08:27:00 +02:00
.gitignore	deploy : add a Package.swift for SwiftPM support (#393 )	2023-03-28 19:39:01 +03:00
CMakeLists.txt	Remove unused files	2023-04-02 12:53:41 +04:00
LICENSE	Add LICENSE (#21 )	2023-03-12 08:36:03 +02:00
Makefile	Remove unused files	2023-04-02 12:53:41 +04:00
README.md	more details for macos/linux	2023-04-03 08:33:57 +02:00
ggml.c	Fix build errors and warnings	2023-04-02 17:23:39 +04:00
ggml.h	Implement exp, max, 1_minus_x, sigmoid operators in ggml	2023-03-31 19:04:35 +04:00
rwkv.cpp	Increase memory for overhead from 32 MB to 256 MB	2023-04-03 09:32:58 +04:00
rwkv.h	Remove unused files	2023-04-02 12:53:41 +04:00

README.md

rwkv.cpp

This is a port of BlinkDL/RWKV-LM to ggerganov/ggml.

Besides the usual FP32, it supports FP16 and quantized INT4 inference on CPU. This project is CPU only.

RWKV is a novel large language model architecture, with the largest model in the family having 14B parameters. In contrast to Transformer with O(n^2) attention, RWKV requires only state from previous step to calculate logits. This makes RWKV very CPU-friendly on large context lenghts.

This project provides a C library rwkv.h and a convinient Python wrapper for it.

TODO (contributions welcome!):

Measure latency and perplexity of different model sizes (169M to 14B) and data types (FP32, FP16, Q4_0, Q4_1)
Test on Linux (including Colab) and MacOS
Make required memory calculation more robust (see #4)

How to use

1. Clone the repo

Requirements: git.

git clone https://github.com/saharNooby/rwkv.cpp.git
cd rwkv.cpp

2. Get the rwkv.cpp library

Option 2.1. Download a pre-compiled library

Windows

Check out Releases, download appropriate ZIP for your CPU, extract rwkv.dll file into bin\Release\ directory inside the repository directory.

To check whether your CPU supports AVX2 or AVX-512, use CPU-Z.

Option 2.2. Build the library yourself

Windows

Requirements: CMake or CMake from anaconda, MSVC compiler.

cmake -DBUILD_SHARED_LIBS=ON .
cmake --build . --config Release

If everything went OK, bin\Release\rwkv.dll file should appear.

Linux / MacOS

Get Cmake (linux: sudo apt install cmake, macos: brew install cmake, anaconoda: cmake package), then run:

cmake -DBUILD_SHARED_LIBS=ON .
cmake --build . --config Release

If everything went OK, rwkv.o (macOS) or librwkv.so (linux) file should appear in the base repo folder.

3. Download an RWKV model from Hugging Face like this one and convert it into `ggml` format

Requirements: Python 3.x with PyTorch.

# Windows
python rwkv\convert_rwkv_to_ggml.py C:\RWKV-4b-Pile-169M-20220807-8023.pth C:\rwkv.cpp-169M.bin float32
# Linux/MacOS
python rwkv/convert_pytorch_to_ggml.py ~/Downloads/RWKV-4b-Pile-169M-20220807-8023.pth ~/Downloads/rwkv.cpp-169M.bin float32

3.1. Optionally, quantize the model

To convert the model into INT4 quantized format, run:

# Windows
python rwkv\quantize.py C:\rwkv.cpp-169M.bin C:\rwkv.cpp-169M-Q4_1.bin 3
# Linux / MacOS
python rwkv/quantize.py ~/Downloads/rwkv.cpp-169M.bin ~/Downloads/rwkv.cpp-169M-Q4_1.bin 3

Pass 2 for Q4_0 format (smaller size, lower quality), 3 for Q4_1 format (larger size, higher quality).

4. Run the model

Requirements: Python 3.x with PyTorch and tokenizers.

Note: change the model path with the non-quantized model for the full weights model.

To generate some text, run:

# Windows
python rwkv\generate_completions.py C:\rwkv.cpp-169M-Q4_1.bin
# Linux / MacOS
python rwkv/generate_completions.py ~/Downloads/rwkv.cpp-169M-Q4_1.bin

To chat with a bot, run:

# Windows
python rwkv\chat_with_bot.py C:\rwkv.cpp-169M-Q4_1.bin
# Linux / MacOS
python rwkv/chat_with_bot.py ~/Downloads/rwkv.cpp-169M-Q4_1.bin

Edit generate_completions.py or chat_with_bot.py to change prompts and sampling settings.

Example of using rwkv.cpp in your custom Python script:

import rwkv_cpp_model
import rwkv_cpp_shared_library

# change to model paths used above (quantized or full weights) 
model_path = r'C:\rwkv.cpp-169M.bin'


model = rwkv_cpp_model.RWKVModel(
    rwkv_cpp_shared_library.load_rwkv_shared_library(),
    model_path
)

logits, state = None, None

for token in [1, 2, 3]:
    logits, state = model.eval(token, state)
    
    print(f'Output logits: {logits}')

# Don't forget to free the memory after you've done working with the model
model.free()

README.md

rwkv.cpp

How to use

1. Clone the repo

2. Get the rwkv.cpp library

Option 2.1. Download a pre-compiled library

Windows

Option 2.2. Build the library yourself

Windows

Linux / MacOS

3. Download an RWKV model from Hugging Face like this one and convert it into ggml format

3.1. Optionally, quantize the model

4. Run the model

3. Download an RWKV model from Hugging Face like this one and convert it into `ggml` format