* Rework the entire file parsing system
prepare for future changes
* Estimate memory usage perfectly
Removes whatever issue with small models that used to exist
* Fix file stream ops on macOS
for me this compiles on Windows 11, Ubuntu 20.04, and macOS 10.14
* Fix rwkv.cpp for non-WIN32 MSVC invocations like bindgen-rs
* Implement Q8_1 quantization
...and disable the type, because GGML doesn't support the ops
required to run inference with it.
It's not worth any nasty hacks or workarounds right now, Q8_0 is
very very similar if one wants 8-bit quantization.
* Completely remove Q8_1 type
This type isn't meant to be user-facing in any way so I may as well
get rid of it now since it will probably never exist as a data
format.
* Switch from std::vector to unique array for model layers
These don't ever need to be resized
* Factor ffn.key.weight height into memory estimate
some models have this set weirdly, in various different ways.
just give up and record the actual size of it and use that
* Make a few more operations inplace
ggml doesn't currently expose most of the stuff it supports, so
force some things. not 100% sure about this, I don't think the
memory savings are that worth it
* attempt a perfect upper bound size for the scratch space
This should be the largest work_size seen in any model, since it
is always larger than any of the other paramters except vocab
(which does not participate in the graph work size).
* Revert "Make a few more operations inplace"
This reverts commit f94d6eb216040ae0ad23d2b9c87fae8349882f89.
* Make less calls to fread
micro-optimization
* Fix memory size estimation for smaller models
ggml works with some larger formats internally
* print location in all assert macros
* remove trailing whitespace
* add type_to_string entry for unknown
* Simplify quantization a bit
* fix cuBLAS compatibility
adding n_gpu_layers to rwkv_init_from_file won't work.
add an extra function instead
* fix quantize
* quantize: don't create output file if opening input fails
* Rename gpu offload layers
might want to avoid branding it with cublas in case we add something
like clblast support in the future
* Remove old read_int32 and write_int32 functions
It's all uints now
* Remove static from things
* Only call gpu_offload_layers if gpu_layer_count > 0
* Add rwkv_ prefix to all structures
* Braces
* Functions naming convention
* Remove blank line after comment
* Capitalize comments
* Re-add quantize explanatory comment
* Re-add histogram comment
* Convert all error messages to uppercase
* Make type conversions extern
for ffi bindings from other langs
* Name the state parts
The code in rwkv_eval to initialize the state (when state_in is
NULL) was getting very confusing so I just put everything in a
struct to name it.
* Fnvalid
* chore: add ggml import in the head of rwkv.h
* chore: add ggml import in the head of rwkv.h
* feat: add cublas support
* feat: update rwkv.cpp
* feat: remove unused change
* chore: fix linux build issue
* chore: sync ggml and offload tensor to gpu
* chore: comment out tensors which occurs error on GPU
* chore: update comment and readme
* chore: update ggml to recent
* chore: add more performance test results
* chore: add more performance test results
* chore: fix problem of reading file more than 2 gb
* chore: merge master
* chore: remove unused comment
* chore: fix for comments
* Update README.md
* Update rwkv.cpp
---------
Co-authored-by: Alex <saharNooby@users.noreply.github.com>
* Use types from typing for better compatibility with older Python versions
* Split last double end of line token as per BlinkDL's suggestion
* Fix MSVC warnings
* Drop Q4_2 support
* Update ggml
* Bump file format version for quantization changes
* Apply suggestions
* Update ggml
* Pack only rwkv.dll for Windows releases
Test executables would not be packed anymore.
* Move test code into a separate file
* Remove redundant zeroing
* Refactor chat script
* Remove Q4_3 support
* Add Q5_0, Q5_1, Q8_0 support
* Add more clear message when loading Q4_3 model
* Remove Q4_1_O format
* Fix indentation in .gitmodules
* Simplify sanitizer matrix