c88f97a9c8
drop support for gfx903 because depending on hipblaslt gums up too many things
2023-10-12 19:16:14 -05:00
arlo-phoenix
0b481bfcc2
Use workaround for ROCm wave32 recognition
...
just sets __AMDGCN_WAVEFRONT_SIZE forcefully to 32.
Not correct (some GPU's don't support wave32), but works
on the supported GPU's. Can disable with DISABLE_WARP_32
With this blockwise quantize works and with that nf4 is supported.
2023-08-08 18:50:26 +00:00
arlo-phoenix
d10197bc93
Add HIP to cuda defines
...
collected by hipifying all files and then comparing with original
Cuda file
2023-08-05 02:11:46 +02:00
Tim Dettmers
c82f51c0f7
Increased occupancy.
2023-07-19 16:08:37 -07:00
Tim Dettmers
7be5f2c7b3
Guard for prefetchAsync GPU capability. #470 #451 #477
2023-07-16 21:12:03 -07:00
Tim Dettmers
ba51d95d43
Added more extensive gemv tests; blocksize guard for gemv.
2023-07-11 05:55:49 -07:00
Tim Dettmers
a26a321e07
Removed debugging statement.
2023-07-10 14:34:19 -07:00
Tim Dettmers
306f6b2362
Fixed accidential deletion of limits in kernel.
2023-07-10 14:24:33 -07:00
Tim Dettmers
2221f4cee0
Fixed potential memory leak.
2023-07-10 13:57:44 -07:00
Tim Dettmers
1c774ecebb
Added ARCH guard for bfloat16 computations.
2023-07-10 09:53:23 -07:00
Tim Dettmers
5fab673442
Added fp32 compute type for gemv_4bit.
2023-07-09 21:06:01 -07:00
Tim Dettmers
94168d79d7
Added FP4 fast inference support.
2023-07-09 14:46:19 -07:00
Tim Dettmers
4b88d69de7
Added abitrary data types; fixed a bug for small matrices.
2023-07-09 12:04:09 -07:00
Tim Dettmers
eefbf60270
Turning optimization (float accumulation). 185 vs 50.
2023-07-08 16:31:58 -07:00
Tim Dettmers
7e49b5b938
Added warp_shuffle indexing 185 vs 54.
2023-07-08 14:27:12 -07:00
Tim Dettmers
02fd80cb81
Added bfloat16 quantizations and tests.
2023-07-04 19:58:31 -07:00
Tim Dettmers
dfe6900b94
Vectorized loads, conflict free NF4; 52 vs 172.
2023-07-04 15:20:10 -07:00
Tim Dettmers
f89ff93e26
Initial 4-bit naive batch size 1, 81 vs 185.
2023-07-03 18:45:38 -07:00
Tim Dettmers
e54d2730fc
Added debugging functions.
2023-05-30 20:42:21 -07:00
Tim Dettmers
b7f04e2a20
Added lookup table.
2023-05-30 20:07:05 -07:00
Tim Dettmers
ac5550a023
Added changes for deployment.
2023-05-30 19:06:59 -07:00
Tim Dettmers
1b8772a8f3
Added PagedLion and bf16 Lion.
2023-05-23 19:37:38 -07:00
Tim Dettmers
675baa79d2
Merge remote-tracking branch 'origin/main' into merge
2023-05-07 13:34:03 -07:00
Tim Dettmers
ec38ba95b0
Added paging.
2023-05-06 11:14:06 -07:00
Tim Dettmers
264a948539
4-bit draft; 128 vector load 240.
2023-05-02 16:15:38 -07:00
Tim Dettmers
869b7e83b5
Warp multi-specialization 240.
2023-05-02 12:10:32 -07:00
Tim Dettmers
77f15fdce9
Shared memory efficient 240.
2023-05-02 11:38:11 -07:00
Tim Dettmers
89cccd8196
A tile multi-tiling.
2023-05-02 09:40:31 -07:00
Tim Dettmers
4decb3cc68
Removed uncessary sync.
2023-05-02 09:38:14 -07:00
Tim Dettmers
394749db71
Correct implementation 240.
2023-05-02 08:58:59 -07:00
Tim Dettmers
9192c9de64
Tighter and scaled error analysis.
2023-05-02 07:50:32 -07:00
Tim Dettmers
f9bfea8f23
Baseline for debugging.
2023-05-02 07:24:12 -07:00
Tim Dettmers
7bfa09d0fc
8x32 240 6 warps.
2023-05-01 16:38:09 -07:00
Tim Dettmers
3d4a2eadd3
16x16 240.
2023-05-01 16:23:45 -07:00
Tim Dettmers
7cc8ff4727
Warp specalization 362.
2023-05-01 08:21:12 -07:00
Tim Dettmers
cabcd9b9d5
Halved shared memory 466.
2023-04-30 19:12:42 -07:00
Tim Dettmers
30d03e0254
64 threads, high smem, 434.
2023-04-30 18:55:12 -07:00
Tim Dettmers
e01d4e033d
Fixed bank conflicts in non-vector load 422.
2023-04-30 18:28:52 -07:00
Tim Dettmers
c35ed09b66
Double frag 440.
2023-04-30 18:19:30 -07:00
Tim Dettmers
604bb3fb57
Slow non-vector 530.
2023-04-30 18:06:01 -07:00
Tim Dettmers
ad07d254fb
Slow tensor core solution.
2023-04-30 17:43:02 -07:00
Tim Dettmers
21723f796a
4-bit draft.
2023-04-29 21:52:47 -07:00
Tim Dettmers
cad839941b
Added bit template.
2023-04-28 22:10:42 -07:00
Tim Dettmers
f3e97ccbd2
New implementation for batch size 1.
2023-04-28 21:29:40 -07:00
Tim Dettmers
f6df4aef6a
Added fp16 and thread/item template.
2023-04-28 18:26:52 -07:00
Tim Dettmers
3aef78342a
Added template refactor.
2023-04-28 17:34:08 -07:00
Tim Dettmers
c1bfb210c5
First baseline kernel.
2023-04-28 17:19:02 -07:00
Tim Dettmers
9cab14a3ff
Adedd pipeline draft.
2023-04-27 15:12:49 -07:00
Tim Dettmers
d1c4c20568
Added non-cutlass template.
2023-04-27 15:11:26 -07:00
Tim Dettmers
0afc8e9e2f
Best attempt at cutlass3.
2023-04-26 17:12:34 -07:00