Tim Dettmers
|
ba51d95d43
|
Added more extensive gemv tests; blocksize guard for gemv.
|
2023-07-11 05:55:49 -07:00 |
|
Tim Dettmers
|
5fab673442
|
Added fp32 compute type for gemv_4bit.
|
2023-07-09 21:06:01 -07:00 |
|
Tim Dettmers
|
4b88d69de7
|
Added abitrary data types; fixed a bug for small matrices.
|
2023-07-09 12:04:09 -07:00 |
|
Tim Dettmers
|
02fd80cb81
|
Added bfloat16 quantizations and tests.
|
2023-07-04 19:58:31 -07:00 |
|
Tim Dettmers
|
dfe6900b94
|
Vectorized loads, conflict free NF4; 52 vs 172.
|
2023-07-04 15:20:10 -07:00 |
|
Tim Dettmers
|
f89ff93e26
|
Initial 4-bit naive batch size 1, 81 vs 185.
|
2023-07-03 18:45:38 -07:00 |
|
Tim Dettmers
|
1b8772a8f3
|
Added PagedLion and bf16 Lion.
|
2023-05-23 19:37:38 -07:00 |
|
Tim Dettmers
|
675baa79d2
|
Merge remote-tracking branch 'origin/main' into merge
|
2023-05-07 13:34:03 -07:00 |
|
Tim Dettmers
|
ec38ba95b0
|
Added paging.
|
2023-05-06 11:14:06 -07:00 |
|
Tim Dettmers
|
264a948539
|
4-bit draft; 128 vector load 240.
|
2023-05-02 16:15:38 -07:00 |
|
Tim Dettmers
|
77f15fdce9
|
Shared memory efficient 240.
|
2023-05-02 11:38:11 -07:00 |
|
Tim Dettmers
|
f9bfea8f23
|
Baseline for debugging.
|
2023-05-02 07:24:12 -07:00 |
|
Tim Dettmers
|
7bfa09d0fc
|
8x32 240 6 warps.
|
2023-05-01 16:38:09 -07:00 |
|
Tim Dettmers
|
3d4a2eadd3
|
16x16 240.
|
2023-05-01 16:23:45 -07:00 |
|
Tim Dettmers
|
7cc8ff4727
|
Warp specalization 362.
|
2023-05-01 08:21:12 -07:00 |
|
Tim Dettmers
|
30d03e0254
|
64 threads, high smem, 434.
|
2023-04-30 18:55:12 -07:00 |
|
Tim Dettmers
|
604bb3fb57
|
Slow non-vector 530.
|
2023-04-30 18:06:01 -07:00 |
|
Tim Dettmers
|
ad07d254fb
|
Slow tensor core solution.
|
2023-04-30 17:43:02 -07:00 |
|
Tim Dettmers
|
21723f796a
|
4-bit draft.
|
2023-04-29 21:52:47 -07:00 |
|
Tim Dettmers
|
cad839941b
|
Added bit template.
|
2023-04-28 22:10:42 -07:00 |
|
Tim Dettmers
|
f3e97ccbd2
|
New implementation for batch size 1.
|
2023-04-28 21:29:40 -07:00 |
|
Tim Dettmers
|
f6df4aef6a
|
Added fp16 and thread/item template.
|
2023-04-28 18:26:52 -07:00 |
|
Tim Dettmers
|
3aef78342a
|
Added template refactor.
|
2023-04-28 17:34:08 -07:00 |
|
Tim Dettmers
|
c1bfb210c5
|
First baseline kernel.
|
2023-04-28 17:19:02 -07:00 |
|
Tim Dettmers
|
9cab14a3ff
|
Adedd pipeline draft.
|
2023-04-27 15:12:49 -07:00 |
|
Tim Dettmers
|
d1c4c20568
|
Added non-cutlass template.
|
2023-04-27 15:11:26 -07:00 |
|
Tim Dettmers
|
0afc8e9e2f
|
Best attempt at cutlass3.
|
2023-04-26 17:12:34 -07:00 |
|
Tim Dettmers
|
84964db937
|
CUTLASS compiles.
|
2023-04-25 17:15:51 -07:00 |
|
Tim Dettmers
|
6e2544da25
|
Added cutlass example.
|
2023-04-25 16:15:44 -07:00 |
|
Tim Dettmers
|
6bfd7a405f
|
Initial template.
|
2023-04-25 16:13:43 -07:00 |
|
Tim Dettmers
|
7dc198feb7
|
Added 32-bit optimizer for bfloat16 gradients.
|
2023-04-17 18:01:49 -07:00 |
|
Tim Dettmers
|
7140c01405
|
Merge branch 'main' into fp8_merge
|
2023-04-12 11:44:39 -07:00 |
|
Tim Dettmers
|
64cc05920d
|
First draft of NF4.
|
2023-04-02 16:10:35 -07:00 |
|
Tim Dettmers
|
c4cfe4fbdd
|
Added bf16 Adam.
|
2023-04-01 10:33:03 -07:00 |
|
Tim Dettmers
|
8645d1f71c
|
Added normal quant.
|
2023-03-29 18:41:37 -07:00 |
|
Tim Dettmers
|
69810521d3
|
Some small changes.
|
2023-03-27 09:12:57 -07:00 |
|
Phil Wang
|
6c377b39b6
|
always pass beta2 into all the 1state functions
|
2023-03-10 13:00:59 -08:00 |
|
Phil Wang
|
c99b44f774
|
do the epsilon beta2 switcharoo within the cuda code, and not within the python class (so that the state dict still makes sense)
|
2023-03-10 08:57:59 -08:00 |
|
Phil Wang
|
8618bed001
|
swap the order in which momentum and parameters are updated in ops.cu
|
2023-03-10 08:39:06 -08:00 |
|
Phil Wang
|
cb4c3c8c66
|
do a bunch of typical bookkeeping before getting to main lion logic
|
2023-03-09 10:10:19 -08:00 |
|
Tim Dettmers
|
2489d819c5
|
Added more blocksizes for stochastic rounding; fixed dequant blocksize.
|
2023-02-14 13:55:17 -08:00 |
|
Tim Dettmers
|
3ac5840c03
|
Added fp4 quant/dequant and dequant optimizations.
|
2023-02-04 14:52:04 -08:00 |
|
Tim Dettmers
|
c91f592ad7
|
Merge branch 'main' into cleanup
|
2023-01-02 11:19:16 +01:00 |
|
Tim Dettmers
|
c059bd2848
|
Added additional blocksizes: {64, 128, 256}.
|
2022-11-20 14:18:15 -08:00 |
|
Tom Aarsen
|
b104ce3b62
|
Merge branch 'main' into cleanup
|
2022-11-17 15:22:29 +01:00 |
|
Tim Dettmers
|
6bc2b992be
|
Added blocksizes 2048, 1024, and 512 to blockwise quant.
|
2022-11-06 16:27:48 -08:00 |
|
Tom Aarsen
|
1eec77d34c
|
Remove trailing whitespace & ensure newline at EOF
|
2022-10-27 13:11:29 +02:00 |
|
Tim Dettmers
|
ee5b947e63
|
Fixed issue where Pascal was not displaying proper error.
|
2022-08-23 16:00:26 -07:00 |
|
Tim Dettmers
|
a6664de072
|
Enhanced error handling in CUDA SETUP failures.
|
2022-08-16 19:03:19 -07:00 |
|
Tim Dettmers
|
1ed2fa2f21
|
Removed storage() from get_ptr; added boilerplate for bias dequant_mm.
|
2022-08-16 10:56:17 -07:00 |
|