rwkv.cpp

Commit Graph

Author	SHA1	Message	Date
Mathmagician8191	82c4ac78f4	Add support for the world tokenizer (#86 ) * Add support for the world tokenizer * Move tokenizer logic to rwkv_tokenizer.py * Added test for the tokenizer	2023-06-08 16:37:18 +05:00
LoganDark	fb6708b555	Fix pytorch storage warnings, fixes #80 (#88 ) we seriously don't care what type of storage we get, pytorch sucks	2023-06-03 15:09:51 +05:00
LoganDark	363dfb1a06	File parsing and memory usage optimization (#74 ) * Rework the entire file parsing system prepare for future changes * Estimate memory usage perfectly Removes whatever issue with small models that used to exist * Fix file stream ops on macOS for me this compiles on Windows 11, Ubuntu 20.04, and macOS 10.14 * Fix rwkv.cpp for non-WIN32 MSVC invocations like bindgen-rs * Implement Q8_1 quantization ...and disable the type, because GGML doesn't support the ops required to run inference with it. It's not worth any nasty hacks or workarounds right now, Q8_0 is very very similar if one wants 8-bit quantization. * Completely remove Q8_1 type This type isn't meant to be user-facing in any way so I may as well get rid of it now since it will probably never exist as a data format. * Switch from std::vector to unique array for model layers These don't ever need to be resized * Factor ffn.key.weight height into memory estimate some models have this set weirdly, in various different ways. just give up and record the actual size of it and use that * Make a few more operations inplace ggml doesn't currently expose most of the stuff it supports, so force some things. not 100% sure about this, I don't think the memory savings are that worth it * attempt a perfect upper bound size for the scratch space This should be the largest work_size seen in any model, since it is always larger than any of the other paramters except vocab (which does not participate in the graph work size). * Revert "Make a few more operations inplace" This reverts commit f94d6eb216040ae0ad23d2b9c87fae8349882f89. * Make less calls to fread micro-optimization * Fix memory size estimation for smaller models ggml works with some larger formats internally * print location in all assert macros * remove trailing whitespace * add type_to_string entry for unknown * Simplify quantization a bit * fix cuBLAS compatibility adding n_gpu_layers to rwkv_init_from_file won't work. add an extra function instead * fix quantize * quantize: don't create output file if opening input fails * Rename gpu offload layers might want to avoid branding it with cublas in case we add something like clblast support in the future * Remove old read_int32 and write_int32 functions It's all uints now * Remove static from things * Only call gpu_offload_layers if gpu_layer_count > 0 * Add rwkv_ prefix to all structures * Braces * Functions naming convention * Remove blank line after comment * Capitalize comments * Re-add quantize explanatory comment * Re-add histogram comment * Convert all error messages to uppercase * Make type conversions extern for ffi bindings from other langs * Name the state parts The code in rwkv_eval to initialize the state (when state_in is NULL) was getting very confusing so I just put everything in a struct to name it. * Fnvalid	2023-05-31 16:31:19 +05:00
YorkZero	241350fde6	Feature add cublas support (#65 ) * chore: add ggml import in the head of rwkv.h * chore: add ggml import in the head of rwkv.h * feat: add cublas support * feat: update rwkv.cpp * feat: remove unused change * chore: fix linux build issue * chore: sync ggml and offload tensor to gpu * chore: comment out tensors which occurs error on GPU * chore: update comment and readme * chore: update ggml to recent * chore: add more performance test results * chore: add more performance test results * chore: fix problem of reading file more than 2 gb * chore: merge master * chore: remove unused comment * chore: fix for comments * Update README.md * Update rwkv.cpp --------- Co-authored-by: Alex <saharNooby@users.noreply.github.com>	2023-05-29 17:10:19 +05:00
Alex	dea929f8ca	Various improvements & upgrade ggml (#75 ) * Use types from typing for better compatibility with older Python versions * Split last double end of line token as per BlinkDL's suggestion * Fix MSVC warnings * Drop Q4_2 support * Update ggml * Bump file format version for quantization changes * Apply suggestions	2023-05-27 16:02:24 +05:00
LoganDark	b61d94aef0	Flush output every token in generate_completions.py (#73 )	2023-05-26 17:23:58 +05:00
LoganDark	d26791b5bc	Silence PyTorch warnings by using untyped storage (#72 )	2023-05-26 17:21:18 +05:00
柏园猫	1c363e6d5f	Fix encoding issue when loading prompt data (#58 ) * Fix encoding issue when loading prompt data * Update chat_with_bot.py Fix code style --------- Co-authored-by: Alex <saharNooby@users.noreply.github.com>	2023-05-13 21:53:54 +05:00
Alex	a3178b20ea	Various improvements (#52 ) * Update ggml * Add link to pre-quantized models in README * Enable W4 for MSVC * Fix warnings, clean up code * Fix LoRA merge script	2023-05-08 14:28:54 +05:00
Alex	5eb8f09c14	Various improvements (#47 ) * Update ggml * Pack only rwkv.dll for Windows releases Test executables would not be packed anymore. * Move test code into a separate file * Remove redundant zeroing * Refactor chat script	2023-04-30 20:27:14 +05:00
Jarrett Ye	3621172428	punish repetitions & break if END_OF_TEXT & decouple prompts from chat script (#37 ) * punish repetitions & break if END_OF_TEXT * decouple prompts from chat_with_bot.py * improve code style * Update rwkv/chat_with_bot.py Co-authored-by: Alex <saharNooby@users.noreply.github.com> * Update rwkv/chat_with_bot.py Co-authored-by: Alex <saharNooby@users.noreply.github.com> * add types * JSON prompt --------- Co-authored-by: Alex <saharNooby@users.noreply.github.com>	2023-04-30 18:50:05 +05:00
Alex	1198892888	Add support for Q5_0, Q5_1 and Q8_0 formats; remove Q4_1_O format (#44 ) * Remove Q4_3 support * Add Q5_0, Q5_1, Q8_0 support * Add more clear message when loading Q4_3 model * Remove Q4_1_O format * Fix indentation in .gitmodules * Simplify sanitizer matrix	2023-04-29 17:39:11 +05:00
Alex	c736ef5411	Improve chat_with_bot.py script (#39 )	2023-04-22 20:33:58 +05:00
Alex	3587ff9e58	Sync ggml with upstream (#38 ) * Sync ggml with upstream * Remove file filters from Actions triggers * Update ggml * Add Q4_2 and Q4_3 support * Improve output of perplexity measuring script * Add tests for new formats * Add token limit argument to perplexity measuring script * Update README * Update README * Update ggml * Use master branch of ggml	2023-04-22 20:25:29 +05:00
Jarrett Ye	ac663631e1	Improve the prompt & fix chinese display issue & support commands (#34 ) * update the prompt * Fix/chinese display issue * remove debug code * support commands (#1) +reset +gen +i +qq +qa +++ ++ + * run_rnn before decode * remove debug code * deep copy logits * remove extra print() * print newline if reach max_tokens_per_generation * fix typo in init prompt * Update rwkv/chat_with_bot.py Co-authored-by: Alex <saharNooby@users.noreply.github.com> * Update rwkv/chat_with_bot.py Co-authored-by: Alex <saharNooby@users.noreply.github.com> * Update rwkv/chat_with_bot.py Co-authored-by: Alex <saharNooby@users.noreply.github.com> * Update rwkv/chat_with_bot.py Co-authored-by: Alex <saharNooby@users.noreply.github.com> * refine code & type annotation * add comments for commands * support change temp & top_p during chat. * set default language & prompt --------- Co-authored-by: Alex <saharNooby@users.noreply.github.com>	2023-04-22 12:48:44 +05:00
saharNooby	678f5233a5	Add LoRA loading support	2023-04-15 20:46:30 +04:00
saharNooby	e4268a36c8	Update file format documentation	2023-04-14 18:59:16 +04:00
saharNooby	85db23c7de	Add script that measures perplexity	2023-04-08 10:41:16 +04:00
saharNooby	e04baa032c	Remove reference impl comparison test	2023-04-08 10:01:29 +04:00
saharNooby	c40941d9d0	Add Q4_1_O format	2023-04-07 09:55:39 +04:00
saharNooby	fa9ad13a39	Free ggml context when model is garbage collected	2023-04-06 20:27:33 +04:00
hypnopump	a9cb9adfd6	streaming output	2023-04-04 18:27:04 +02:00
PXLKSR	977efba905	we actually build a dylib on macos	2023-04-04 10:19:06 +02:00
hypnopump	0a0cabc4c7	for consistency	2023-04-03 08:27:00 +02:00
hypnopump	6f3fb01913	suggestions	2023-04-03 08:25:54 +02:00
hypnopump	a64aaa81ec	initial addition	2023-04-03 00:52:26 +02:00
saharNooby	e0684e8104	Add text generation and chat scripts	2023-04-02 15:03:31 +04:00
saharNooby	935d16f5db	Move library wrapper to separate file, refactor code	2023-04-02 12:24:40 +04:00
saharNooby	972e28d48d	Implement INT4 conversion and inference	2023-04-01 19:22:01 +04:00
saharNooby	a1e1d34c93	Add Python wrapper for C library	2023-04-01 16:02:22 +04:00
saharNooby	7130a89d1f	[FILE FORMAT CHANGED] Reverse dimensions in ggml file (makes it more similar to llama.cpp format)	2023-04-01 14:41:30 +04:00
saharNooby	f6d45baec0	Support FP16 inference	2023-04-01 11:53:49 +04:00
saharNooby	fe98c94a63	[FILE FORMAT CHANGED] Use ggml_get_rows to get embedding	2023-04-01 11:28:32 +04:00
saharNooby	16ec7a5c18	Add fail-fast version of the test	2023-04-01 11:15:15 +04:00
saharNooby	0fcb7c64c6	Remove reference implementation code and test against pre-created logits	2023-04-01 11:09:24 +04:00
saharNooby	6fe9486cee	Finally, FP32 inference	2023-04-01 10:06:39 +04:00
saharNooby	61c6b1a4e0	Add comparison against reference implementation script, implement state & logits saving	2023-03-31 20:23:42 +04:00
saharNooby	d00f28581a	Add reference implementation of RWKV RNN	2023-03-31 19:57:16 +04:00
saharNooby	fe272dc3d3	Minor changes	2023-03-31 10:24:12 +04:00
saharNooby	873cb954d0	Make ln0 work correctly	2023-03-30 20:01:26 +04:00
saharNooby	2f51451561	Initial commit	2023-03-30 17:55:30 +04:00

41 Commits