Commit Graph

40 Commits

Author SHA1 Message Date
LoganDark fb6708b555
Fix pytorch storage warnings, fixes #80 (#88)
we seriously don't care what type of storage we get, pytorch sucks
2023-06-03 15:09:51 +05:00
LoganDark 363dfb1a06
File parsing and memory usage optimization (#74)
* Rework the entire file parsing system

prepare for future changes

* Estimate memory usage perfectly

Removes whatever issue with small models that used to exist

* Fix file stream ops on macOS

for me this compiles on Windows 11, Ubuntu 20.04, and macOS 10.14

* Fix rwkv.cpp for non-WIN32 MSVC invocations like bindgen-rs

* Implement Q8_1 quantization

...and disable the type, because GGML doesn't support the ops
required to run inference with it.

It's not worth any nasty hacks or workarounds right now, Q8_0 is
very very similar if one wants 8-bit quantization.

* Completely remove Q8_1 type

This type isn't meant to be user-facing in any way so I may as well
get rid of it now since it will probably never exist as a data
format.

* Switch from std::vector to unique array for model layers

These don't ever need to be resized

* Factor ffn.key.weight height into memory estimate

some models have this set weirdly, in various different ways.
just give up and record the actual size of it and use that

* Make a few more operations inplace

ggml doesn't currently expose most of the stuff it supports, so
force some things. not 100% sure about this, I don't think the
memory savings are that worth it

* attempt a perfect upper bound size for the scratch space

This should be the largest work_size seen in any model, since it
is always larger than any of the other paramters except vocab
(which does not participate in the graph work size).

* Revert "Make a few more operations inplace"

This reverts commit f94d6eb216040ae0ad23d2b9c87fae8349882f89.

* Make less calls to fread

micro-optimization

* Fix memory size estimation for smaller models

ggml works with some larger formats internally

* print location in all assert macros

* remove trailing whitespace

* add type_to_string entry for unknown

* Simplify quantization a bit

* fix cuBLAS compatibility

adding n_gpu_layers to rwkv_init_from_file won't work.
add an extra function instead

* fix quantize

* quantize: don't create output file if opening input fails

* Rename gpu offload layers

might want to avoid branding it with cublas in case we add something
like clblast support in the future

* Remove old read_int32 and write_int32 functions

It's all uints now

* Remove static from things

* Only call gpu_offload_layers if gpu_layer_count > 0

* Add rwkv_ prefix to all structures

* Braces

* Functions naming convention

* Remove blank line after comment

* Capitalize comments

* Re-add quantize explanatory comment

* Re-add histogram comment

* Convert all error messages to uppercase

* Make type conversions extern

for ffi bindings from other langs

* Name the state parts

The code in rwkv_eval to initialize the state (when state_in is
NULL) was getting very confusing so I just put everything in a
struct to name it.

* Fnvalid
2023-05-31 16:31:19 +05:00
YorkZero 241350fde6
Feature add cublas support (#65)
* chore: add ggml import in the head of rwkv.h

* chore: add ggml import in the head of rwkv.h

* feat: add cublas support

* feat: update rwkv.cpp

* feat: remove unused change

* chore: fix linux build issue

* chore: sync ggml and offload tensor to gpu

* chore: comment out tensors which occurs error on GPU

* chore: update comment and readme

* chore: update ggml to recent

* chore: add more performance test results

* chore: add more performance test results

* chore: fix problem of reading file more than 2 gb

* chore: merge master

* chore: remove unused comment

* chore: fix for comments

* Update README.md

* Update rwkv.cpp

---------

Co-authored-by: Alex <saharNooby@users.noreply.github.com>
2023-05-29 17:10:19 +05:00
Alex dea929f8ca
Various improvements & upgrade ggml (#75)
* Use types from typing for better compatibility with older Python versions

* Split last double end of line token as per BlinkDL's suggestion

* Fix MSVC warnings

* Drop Q4_2 support

* Update ggml

* Bump file format version for quantization changes

* Apply suggestions
2023-05-27 16:02:24 +05:00
LoganDark b61d94aef0
Flush output every token in generate_completions.py (#73) 2023-05-26 17:23:58 +05:00
LoganDark d26791b5bc
Silence PyTorch warnings by using untyped storage (#72) 2023-05-26 17:21:18 +05:00
柏园猫 1c363e6d5f
Fix encoding issue when loading prompt data (#58)
* Fix encoding issue when loading prompt data

* Update chat_with_bot.py

Fix code style

---------

Co-authored-by: Alex <saharNooby@users.noreply.github.com>
2023-05-13 21:53:54 +05:00
Alex a3178b20ea
Various improvements (#52)
* Update ggml

* Add link to pre-quantized models in README

* Enable W4 for MSVC

* Fix warnings, clean up code

* Fix LoRA merge script
2023-05-08 14:28:54 +05:00
Alex 5eb8f09c14
Various improvements (#47)
* Update ggml

* Pack only rwkv.dll for Windows releases

Test executables would not be packed anymore.

* Move test code into a separate file

* Remove redundant zeroing

* Refactor chat script
2023-04-30 20:27:14 +05:00
Jarrett Ye 3621172428
punish repetitions & break if END_OF_TEXT & decouple prompts from chat script (#37)
* punish repetitions & break if END_OF_TEXT

* decouple prompts from chat_with_bot.py

* improve code style

* Update rwkv/chat_with_bot.py

Co-authored-by: Alex <saharNooby@users.noreply.github.com>

* Update rwkv/chat_with_bot.py

Co-authored-by: Alex <saharNooby@users.noreply.github.com>

* add types

* JSON prompt

---------

Co-authored-by: Alex <saharNooby@users.noreply.github.com>
2023-04-30 18:50:05 +05:00
Alex 1198892888
Add support for Q5_0, Q5_1 and Q8_0 formats; remove Q4_1_O format (#44)
* Remove Q4_3 support

* Add Q5_0, Q5_1, Q8_0 support

* Add more clear message when loading Q4_3 model

* Remove Q4_1_O format

* Fix indentation in .gitmodules

* Simplify sanitizer matrix
2023-04-29 17:39:11 +05:00
Alex c736ef5411
Improve chat_with_bot.py script (#39) 2023-04-22 20:33:58 +05:00
Alex 3587ff9e58
Sync ggml with upstream (#38)
* Sync ggml with upstream

* Remove file filters from Actions triggers

* Update ggml

* Add Q4_2 and Q4_3 support

* Improve output of perplexity measuring script

* Add tests for new formats

* Add token limit argument to perplexity measuring script

* Update README

* Update README

* Update ggml

* Use master branch of ggml
2023-04-22 20:25:29 +05:00
Jarrett Ye ac663631e1
Improve the prompt & fix chinese display issue & support commands (#34)
* update the prompt

* Fix/chinese display issue

* remove debug code

* support commands (#1)

+reset +gen +i +qq +qa +++ ++ +

* run_rnn before decode

* remove debug code

* deep copy logits

* remove extra print()

* print newline if reach max_tokens_per_generation

* fix typo in init prompt

* Update rwkv/chat_with_bot.py

Co-authored-by: Alex <saharNooby@users.noreply.github.com>

* Update rwkv/chat_with_bot.py

Co-authored-by: Alex <saharNooby@users.noreply.github.com>

* Update rwkv/chat_with_bot.py

Co-authored-by: Alex <saharNooby@users.noreply.github.com>

* Update rwkv/chat_with_bot.py

Co-authored-by: Alex <saharNooby@users.noreply.github.com>

* refine code & type annotation

* add comments for commands

* support change temp & top_p during chat.

* set default language & prompt

---------

Co-authored-by: Alex <saharNooby@users.noreply.github.com>
2023-04-22 12:48:44 +05:00
saharNooby 678f5233a5 Add LoRA loading support 2023-04-15 20:46:30 +04:00
saharNooby e4268a36c8 Update file format documentation 2023-04-14 18:59:16 +04:00
saharNooby 85db23c7de Add script that measures perplexity 2023-04-08 10:41:16 +04:00
saharNooby e04baa032c Remove reference impl comparison test 2023-04-08 10:01:29 +04:00
saharNooby c40941d9d0 Add Q4_1_O format 2023-04-07 09:55:39 +04:00
saharNooby fa9ad13a39 Free ggml context when model is garbage collected 2023-04-06 20:27:33 +04:00
hypnopump a9cb9adfd6
streaming output 2023-04-04 18:27:04 +02:00
PXLKSR 977efba905 we actually build a dylib on macos 2023-04-04 10:19:06 +02:00
hypnopump 0a0cabc4c7
for consistency 2023-04-03 08:27:00 +02:00
hypnopump 6f3fb01913
suggestions 2023-04-03 08:25:54 +02:00
hypnopump a64aaa81ec
initial addition 2023-04-03 00:52:26 +02:00
saharNooby e0684e8104 Add text generation and chat scripts 2023-04-02 15:03:31 +04:00
saharNooby 935d16f5db Move library wrapper to separate file, refactor code 2023-04-02 12:24:40 +04:00
saharNooby 972e28d48d Implement INT4 conversion and inference 2023-04-01 19:22:01 +04:00
saharNooby a1e1d34c93 Add Python wrapper for C library 2023-04-01 16:02:22 +04:00
saharNooby 7130a89d1f [FILE FORMAT CHANGED] Reverse dimensions in ggml file (makes it more similar to llama.cpp format) 2023-04-01 14:41:30 +04:00
saharNooby f6d45baec0 Support FP16 inference 2023-04-01 11:53:49 +04:00
saharNooby fe98c94a63 [FILE FORMAT CHANGED] Use ggml_get_rows to get embedding 2023-04-01 11:28:32 +04:00
saharNooby 16ec7a5c18 Add fail-fast version of the test 2023-04-01 11:15:15 +04:00
saharNooby 0fcb7c64c6 Remove reference implementation code and test against pre-created logits 2023-04-01 11:09:24 +04:00
saharNooby 6fe9486cee Finally, FP32 inference 2023-04-01 10:06:39 +04:00
saharNooby 61c6b1a4e0 Add comparison against reference implementation script, implement state & logits saving 2023-03-31 20:23:42 +04:00
saharNooby d00f28581a Add reference implementation of RWKV RNN 2023-03-31 19:57:16 +04:00
saharNooby fe272dc3d3 Minor changes 2023-03-31 10:24:12 +04:00
saharNooby 873cb954d0 Make ln0 work correctly 2023-03-30 20:01:26 +04:00
saharNooby 2f51451561 Initial commit 2023-03-30 17:55:30 +04:00