Go to file
LoganDark 9e2a0de843
Add rwkv_set_print_errors and rwkv_get_last_error (#68)
* Add rwkv_set_print_errors and rwkv_get_last_error

Fixes #63

This allows retrieving errors from the library without having to
pipe stderr. Also it was annoying that rwkv.cpp assumed control of
the caller process by doing things like calling abort() when it
shouldn't, so I also fixed that.

The basic way this works is:

1. by default, not much is different, except more errors are caught,
   and rwkv.cpp should never abort the process or throw a C++
   exception.

2. the difference comes when you call rwkv_set_print_errors
   (working title):

   1. errors will no longer be printed to stderr automatically
   2. errors will be assigned to a thread-local variable (during
      init/quantization) or a context-local variable (during eval)
   3. the last error can be retrieved using rwkv_get_last_error

I also overhauled the assert macros so more error cases are
handled:

- the file is now closed if rwkv_init_from_file exits early
- the ggml context is freed if rwkv_init_from_file exits early
- if parameters cannot be found an error will be set about it

I also made some optimizations:

- just use fstat instead of opening the file twice
- deduplicated some code / removed edge cases that do not exist
- switched to ggml inplace operations where they exist

test_tiny_rwkv.c seems to run perfectly fine. The Python scripts
also.

The built DLL is perfectly backwards compatible with existing API
consumers like the python library, because it does not remove or
change any functions, only adds some optional ones.

The sad thing is that this will break every PR because the error
handling in this library was terrible and needed to be totally
redone. But I think it is worth it.

* Fix typo

Co-authored-by: Alex <saharNooby@users.noreply.github.com>

* Visual Studio lied and _fileno is incorrect

* Fix trailing comma in assert macros

This was an accident left over from something that didn't pan out,
some compilers do not like when function arguments have a trailing
comma.

* Include header file for fstat

* Remove uses of std::make_unique

* Fix width of format string argument on all platforms

* Use C free for smart pointers

* Revert "Use C free for smart pointers" and try nothrow

* Initialize cgraph to zero

* Fix ggml_cgraph initialization

* Zero-initialize allocations

---------

Co-authored-by: Alex <saharNooby@users.noreply.github.com>
2023-05-24 16:06:52 +05:00
.github/workflows Various improvements (#47) 2023-04-30 20:27:14 +05:00
ggml@ff6e03cbcd Various improvements (#52) 2023-05-08 14:28:54 +05:00
rwkv Fix encoding issue when loading prompt data (#58) 2023-05-13 21:53:54 +05:00
tests Add rwkv_set_print_errors and rwkv_get_last_error (#68) 2023-05-24 16:06:52 +05:00
.gitignore deploy : add a Package.swift for SwiftPM support (#393) 2023-03-28 19:39:01 +03:00
.gitmodules Use main ggml repo (#45) 2023-04-29 21:35:36 +05:00
CMakeLists.txt Various improvements (#52) 2023-05-08 14:28:54 +05:00
FILE_FORMAT.md Add support for Q5_0, Q5_1 and Q8_0 formats; remove Q4_1_O format (#44) 2023-04-29 17:39:11 +05:00
LICENSE Add robust automatic testing (#33) 2023-04-20 11:00:35 +05:00
README.md Various improvements (#52) 2023-05-08 14:28:54 +05:00
rwkv.cpp Add rwkv_set_print_errors and rwkv_get_last_error (#68) 2023-05-24 16:06:52 +05:00
rwkv.h Add rwkv_set_print_errors and rwkv_get_last_error (#68) 2023-05-24 16:06:52 +05:00

README.md

rwkv.cpp

This is a port of BlinkDL/RWKV-LM to ggerganov/ggml.

Besides the usual FP32, it supports FP16, quantized INT4 and quantized INT8 inference. This project is CPU only.

This project provides a C library rwkv.h and a convinient Python wrapper for it.

RWKV is a novel large language model architecture, with the largest model in the family having 14B parameters. In contrast to Transformer with O(n^2) attention, RWKV requires only state from previous step to calculate logits. This makes RWKV very CPU-friendly on large context lenghts.

Loading LoRA checkpoints in Blealtan's format is supported through merge_lora_into_ggml.py script.

Quality and performance

If you use rwkv.cpp for anything serious, please test all available formats for perplexity and latency on a representative dataset, and decide which trade-off is best for you.

Below table is for reference only. Measurements were made on 4C/8T x86 CPU with AVX2, 4 threads.

Format Perplexity (169M) Latency, ms (1.5B) File size, GB (1.5B)
Q4_0 17.507 76 1.53
Q4_1 17.187 72 1.68
Q4_2 17.060 85 1.53
Q5_0 16.194 78 1.60
Q5_1 15.851 81 1.68
Q8_0 15.652 89 2.13
FP16 15.623 117 2.82
FP32 15.623 198 5.64

How to use

1. Clone the repo

Requirements: git.

git clone --recursive https://github.com/saharNooby/rwkv.cpp.git
cd rwkv.cpp

2. Get the rwkv.cpp library

Option 2.1. Download a pre-compiled library

Windows / Linux / MacOS

Check out Releases, download appropriate ZIP for your OS and CPU, extract rwkv library file into the repository directory.

On Windows: to check whether your CPU supports AVX2 or AVX-512, use CPU-Z.

Option 2.2. Build the library yourself

This option is recommended for maximum performance, because the library would be built specifically for your CPU and OS.

Windows

Requirements: CMake or CMake from anaconda, MSVC compiler.

cmake .
cmake --build . --config Release

If everything went OK, bin\Release\rwkv.dll file should appear.

Linux / MacOS

Requirements: CMake (Linux: sudo apt install cmake, MacOS: brew install cmake, anaconoda: cmake package).

cmake .
cmake --build . --config Release

Anaconda & M1 users: please verify that CMAKE_SYSTEM_PROCESSOR: arm64 after running cmake . — if it detects x86_64, edit the CMakeLists.txt file under the # Compile flags to add set(CMAKE_SYSTEM_PROCESSOR "arm64").

If everything went OK, librwkv.so (Linux) or librwkv.dylib (MacOS) file should appear in the base repo folder.

3. Get an RWKV model

Option 3.1. Download pre-quantized Raven model

There are pre-quantized Raven models available on Hugging Face. Check that you are downloading .bin file, NOT .pth.

Option 3.2. Convert and quantize PyTorch model

Requirements: Python 3.x with PyTorch.

This option would require a little more manual work, but you can use it with any RWKV model and any target format.

First, download a model from Hugging Face like this one.

Second, convert it into rwkv.cpp format using following commands:

# Windows
python rwkv\convert_pytorch_to_ggml.py C:\RWKV-4-Pile-169M-20220807-8023.pth C:\rwkv.cpp-169M.bin float16

# Linux / MacOS
python rwkv/convert_pytorch_to_ggml.py ~/Downloads/RWKV-4-Pile-169M-20220807-8023.pth ~/Downloads/rwkv.cpp-169M.bin float16

Optionally, quantize the model into one of quantized formats from the table above:

# Windows
python rwkv\quantize.py C:\rwkv.cpp-169M.bin C:\rwkv.cpp-169M-Q4_2.bin Q4_2

# Linux / MacOS
python rwkv/quantize.py ~/Downloads/rwkv.cpp-169M.bin ~/Downloads/rwkv.cpp-169M-Q4_2.bin Q4_2

4. Run the model

Requirements: Python 3.x with PyTorch and tokenizers.

Note: change the model path with the non-quantized model for the full weights model.

To generate some text, run:

# Windows
python rwkv\generate_completions.py C:\rwkv.cpp-169M-Q4_2.bin

# Linux / MacOS
python rwkv/generate_completions.py ~/Downloads/rwkv.cpp-169M-Q4_2.bin

To chat with a bot, run:

# Windows
python rwkv\chat_with_bot.py C:\rwkv.cpp-169M-Q4_2.bin

# Linux / MacOS
python rwkv/chat_with_bot.py ~/Downloads/rwkv.cpp-169M-Q4_2.bin

Edit generate_completions.py or chat_with_bot.py to change prompts and sampling settings.


Example of using rwkv.cpp in your custom Python script:

import rwkv_cpp_model
import rwkv_cpp_shared_library

# Change to model paths used above (quantized or full weights) 
model_path = r'C:\rwkv.cpp-169M.bin'


model = rwkv_cpp_model.RWKVModel(
    rwkv_cpp_shared_library.load_rwkv_shared_library(),
    model_path
)

logits, state = None, None

for token in [1, 2, 3]:
    logits, state = model.eval(token, state)
    
    print(f'Output logits: {logits}')

# Don't forget to free the memory after you've done working with the model
model.free()