Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstreamchanges to base #1

Merged
merged 17 commits into from
Apr 16, 2023
Merged

Upstreamchanges to base #1

merged 17 commits into from
Apr 16, 2023

Conversation

YellowRoseCx
Copy link
Owner

No description provided.

prusnak and others added 16 commits April 14, 2023 15:37
after LostRuins#545 we do not need torch, tqdm and requests in the dependencies
* GGML map ops proof of concept.

* Various cleanups.

Add handling for task setting.

Add handling for ggml_compute_backward.

Rename functions to ggml_map_unary_f32 and ggml_map_binary_f32

Fix compiler warnings related to casting function pointers and `void *`

Reorder functions and definitions based on the GGML op number.

Use typedefs for map op function pointer types.

* Fix position of map ops cases in ggml_compute_forward
* Add support for configs, add configurable prefixes / suffixes, deprecate instruct mode, add stop prompt

* Add multiline mode, update text input.

* bugfix

* update implementation

* typos

* Change --multiline implementation to be toggled by EOF.

* bugfix

* default multiline mode

* add more configs

* update formating

* update formatting

* apply suggestions
Avoid duplication of type names in utils

Co-authored-by: Håkon H. Hitland <haakon@likedan.net>
# Conflicts:
#	.devops/full.Dockerfile
#	Makefile
#	flake.nix
@YellowRoseCx YellowRoseCx changed the title Upstreamchanges Upstreamchanges to base Apr 15, 2023
@YellowRoseCx YellowRoseCx merged commit 02d1e12 into base Apr 16, 2023
YellowRoseCx added a commit that referenced this pull request Jun 29, 2023
* kquants_iter for hipblas and add gfx803
* Update CMakeLists.txt with hipblas kquants_iter and DMMV_F16
* remove dmmv_f16 for now
YellowRoseCx added a commit that referenced this pull request Aug 25, 2023
commit 3416c98
Merge: 5eb17f0 4c4e435
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Fri Aug 25 13:46:56 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit 5eb17f0
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Fri Aug 25 13:38:21 2023 -0500

    ROCm Port update

    * use hipblas based on cublas
    * Update Makefile for the Cuda kernels
    * Expand arch list and make it overrideable
    * Fix multi GPU on multiple amd architectures with rocblas_initialize() (#5)
    * add hipBLAS to README
    * new build arg LLAMA_CUDA_MMQ_Y
    * fix half2 decomposition
    * Add intrinsics polyfills for AMD
    * AMD assembly optimized __dp4a
    * Allow overriding CC_TURING
    * use "ROCm" instead of "CUDA"
    * ignore all build dirs
    * Add Dockerfiles
    * fix llama-bench
    * fix -nommq help for non CUDA/HIP

    ---------

    Co-Authored-By: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
    Co-Authored-By: ardfork <134447697+ardfork@users.noreply.github.com>
    Co-Authored-By: funnbot <22226942+funnbot@users.noreply.github.com>
    Co-Authored-By: Engininja2 <139037756+Engininja2@users.noreply.github.com>
    Co-Authored-By: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
    Co-Authored-By: jammm <2500920+jammm@users.noreply.github.com>
    Co-Authored-By: jdecourval <7315817+jdecourval@users.noreply.github.com>

commit b34f4bd
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Aug 19 17:12:52 2023 -0500

    Update README.md

commit 7d11961
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Aug 14 23:03:12 2023 -0500

    remove force DMMV

commit cd61aa0
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Aug 12 17:24:31 2023 -0500

    restore main_gpu parameter

commit 4a042f3
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat Aug 12 10:51:46 2023 +0300

    gfx1100 support

    ---------

    Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com>
    Co-authored-by: jammm <2500920+jammm@users.noreply.github.com>
    Co-authored-by: jdecourval <7315817+jdecourval@users.noreply.github.com>

commit 8913bc6
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Aug 11 10:16:02 2023 +0300

    Allow overriding CC_TURING

commit e77a4c3
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Aug 11 10:00:07 2023 +0300

    Merge 'origin/master' into hipblas

commit cc4c4e3
Author: Engininja2 <139037756+Engininja2@users.noreply.github.com>
Date:   Fri Aug 11 09:43:14 2023 +0300

    New __dp4a assembly

    Now compatible with gfx900 and faster as well.

commit 1a03b70
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Aug 11 09:30:28 2023 +0300

    Undo mess

    ---------

    Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com>

commit 4366ff9
Author: DannyDaemonic <DannyDaemonic@gmail.com>
Date:   Thu Aug 10 13:11:36 2023 -0700

    Handle `ENABLE_VIRTUAL_TERMINAL_PROCESSING` more gracefully on earlier versions of Windows.

commit 811ff85
Author: Christian Demsar <crasm@git.vczf.us>
Date:   Thu Aug 10 10:28:27 2023 -0400

    Add --n-predict -2 for stopping generation on full context (ggerganov#2565)

commit 37c9717
Author: Martin Krasser <krasserm@googlemail.com>
Date:   Thu Aug 10 12:16:38 2023 +0200

    Fix grammar-based sampling issue in server (ggerganov#2566)

commit d18ecd5
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Aug 10 13:19:41 2023 -0500

    make mmq gen faster for amd

commit 243894a
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu Aug 10 12:14:40 2023 +0300

    ws fix

commit ac2f14d
Author: Engininja2 <139037756+Engininja2@users.noreply.github.com>
Date:   Thu Aug 10 12:11:27 2023 +0300

    AMD assembly optimized __dp4a

    Doesn't seem to work for gfx900, so commented out.

commit 9dba0c9
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu Aug 10 12:09:28 2023 +0300

    Fix merge

    ---------

    Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com>
    Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>

commit f570b5c
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Aug 9 22:11:20 2023 -0500

    Revert "revert cuda changes as they are bugggy"

    This reverts commit 1541bf8.

commit 1541bf8
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Wed Aug 9 22:36:41 2023 +0800

    revert cuda changes as they are bugggy

commit bacc202
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Aug 9 20:37:17 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit b7cb4cf
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Aug 9 20:00:52 2023 -0500

    additional fixes

commit fadae72
Merge: 518eb2a 8f8ab6c
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Aug 9 18:45:50 2023 -0500

    Merge branch 'hipblas' into develop4Main

commit 518eb2a
Merge: bda0215 cae6a84
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Aug 9 18:32:10 2023 -0500

    Merge remote-tracking branch 'upstream/concedo' into develop2Main

commit bda0215
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Aug 9 18:17:54 2023 -0500

    update makefile to multisystem path

commit 8f8ab6c
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Aug 9 18:05:03 2023 -0500

    hipLDFLAG Path change Unix to multisystem in Makefile

    changed the hardcoded linux distro hipblas LD path from -L/opt/rocm/lib to use the defined ROCM_PATH variable to be flexible with ROCm on non-Linux OS

commit 610ba4c
Merge: 4024f91 25d43e0
Author: Henri Vasserman <henv@hot.ee>
Date:   Wed Aug 9 23:54:58 2023 +0300

    Merge 'origin/master' into hipblas

commit 4024f91
Author: Henri Vasserman <henv@hot.ee>
Date:   Wed Aug 9 01:56:44 2023 +0300

    Add intrinsics polyfills for AMD

    ---------

    Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com>
    Co-authored-by: funnbot <22226942+funnbot@users.noreply.github.com>
    Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com>

commit ab62128
Merge: d91456a f5bfea0
Author: Henri Vasserman <henv@hot.ee>
Date:   Wed Aug 9 00:37:01 2023 +0300

    Merge 'origin/master' into hipblas

commit ee9fa2a
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Aug 2 01:53:58 2023 -0500

    Update Makefile

commit d91456a
Author: ardfork <134447697+ardfork@users.noreply.github.com>
Date:   Mon Jul 31 20:35:00 2023 +0300

    fix half2 decomposition

commit c1cb70d
Author: Henri Vasserman <henv@hot.ee>
Date:   Mon Jul 31 19:56:44 2023 +0300

    new build arg LLAMA_CUDA_MMQ_Y

commit c1664a0
Merge: 4336231 0728c5a
Author: Henri Vasserman <henv@hot.ee>
Date:   Mon Jul 31 19:32:27 2023 +0300

    Merge 'origin/master' into hipblas

commit 848558d
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 30 20:02:52 2023 -0500

    import vars logic fix

commit b650b84
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 30 00:21:36 2023 -0500

    Update easy_KCPP-ROCm_install.sh

commit 8573a67
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 29 21:31:12 2023 -0500

    remove duplicate code and fix typo

    remove duplicate tooltip

commit 430986e
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 29 21:07:34 2023 -0500

    hide "missing" if all are built

    move tooltip functions to helper functions section. hides the string "Missing: ..." from showing if all backends are available
    " if len(runopts)==6 else + "

commit dd0db72
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 29 20:52:31 2023 -0500

    hide "missing" if all are built

    move tooltip functions to helper functions section. hides the string "Missing: ..." from showing if all backends are available

commit 43fffb6
Merge: 0ed65a4 b40550c
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 29 19:13:15 2023 -0500

    Merge branch 'concedo'

commit 0ed65a4
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 29 18:34:21 2023 -0500

    Hide unavailable backends & Add tooltip over backend count

    Hides unavailable backends from the user and if the program is launched without any backends made, it shows an error message to them stating no backends were found and to make them using the 'make' command

    Add tooltip when hovering over backend count label

    hovering over the new label that shows the backend count will explain what the numbers are, and show the users which backends are not available or built

commit 2a26398
Merge: cee2e9d 31486eb
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 29 15:16:33 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit 4336231
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat Jul 29 18:35:56 2023 +0300

    add hipBLAS to README

    ---------

    Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com>

commit f8e3fc6
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat Jul 29 14:16:46 2023 +0300

    rocblas init stuff

commit d2ade63
Merge: cde52d6 8a88e58
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat Jul 29 12:59:48 2023 +0300

    Merge 'origin/master' into hipblas

commit cee2e9d
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 26 23:36:55 2023 -0500

    Only Show Available Backends in GUI

    Hides unavailable backends from the user and if the program is launched without any backends made, it shows an error message to them stating no backends were found and to make them using the 'make' command

commit 7863610
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 26 13:27:22 2023 -0500

    Update easy_KCPP-ROCm_install.sh

commit 731cd6e
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Tue Jul 25 22:39:50 2023 -0500

    Create easy_rocm_install.sh

commit f154685
Merge: cbdc1f3 94e0a06
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Tue Jul 25 22:25:10 2023 -0500

    Merge branch 'concedo_experimentalMAIN'

commit cbdc1f3
Merge: 5b838d4 9731682
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 24 16:53:21 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit cde52d6
Merge: 8e8054a 84e09a7
Author: Henri Vasserman <henv@hot.ee>
Date:   Mon Jul 24 12:22:58 2023 +0300

    Merge 'origin/master' into hipblas

commit 8e8054a
Author: Henri Vasserman <henv@hot.ee>
Date:   Mon Jul 24 12:20:49 2023 +0300

    Add rocblas to build files

commit 1f6294d
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 24 03:52:01 2023 -0500

    Fix multi GPU on multiple amd architectures with rocblas_initialize() (#5)

    * initialize rocblas

commit 5b838d4
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 24 03:10:35 2023 -0500

    amd multigpu full layer offload w/o vram scratch

commit 9bfb2fd
Merge: b379f9d 66328fc
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 24 03:07:44 2023 -0500

    Merge branch 'concedo_experimental'

commit b379f9d
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 24 03:07:00 2023 -0500

    Revert "amd multigpu full layer offload w/o vram scratch"

    This reverts commit 9adfc8e.

commit 9adfc8e
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 24 02:56:40 2023 -0500

    amd multigpu full layer offload w/o vram scratch

commit 05c792e
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 24 00:18:48 2023 -0500

    initialize rocblas

commit ade68d0
Merge: 521ad6b 56995ca
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 23 20:25:05 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit 521ad6b
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jul 20 21:42:33 2023 -0500

    lazy import_var error handling for saves

commit 9553e52
Merge: cac6650 f036109
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jul 20 19:59:41 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit cac6650
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 17 23:05:02 2023 -0500

    Makefile fix! Allows hip/clblast build together

commit 3db70b5
Merge: 2ec4466 7568d1a
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jul 18 01:54:17 2023 +0300

    Merge 'origin/master' into hipblas

commit f208670
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Fri Jul 14 02:56:03 2023 -0500

    improve error handling with gpu names

commit 860e738
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Fri Jul 14 00:33:03 2023 -0500

    Show GPU names in GUI, Only show GPUs that exist

    changed the pre-set 1,2,3 and 1,2,3,all settings that the GPU selector had and replaced them with a function that grabs the GPU names and sets the names as the values for the selector boxes.

commit 2ec4466
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu Jul 13 13:44:02 2023 +0300

    Update build flags.

    GGML_CUDA_DMMV_Y is now GGML_CUDA_MMV_Y
    so update your build instructions.

    GGML_CUDA_FORCE_DMMV is always enabled.

    ---------

    Co-authored-by: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>

commit cd36b18
Merge: afcb8fe 1cbf561
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu Jul 13 13:03:01 2023 +0300

    Merge 'origin/master' into hipblas

commit ac7ebc3
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 12 18:32:18 2023 -0500

    add hipBLAS name scheme to GUI and update README

commit 7f85cc5
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 12 17:35:54 2023 -0500

    update makefile and ggml.c

commit 6ca3499
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 12 15:43:45 2023 -0500

    ggml.c fix

commit 770e674
Merge: 2b289cd 5941514
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 12 15:24:36 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit 2b289cd
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 12 14:30:00 2023 -0500

    Update c-cpp.yml

commit 5dae95a
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 12 14:28:51 2023 -0500

    Update c-cpp.yml

commit b37cd73
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 12 14:27:04 2023 -0500

    Create c-cpp.yml to test Actions

commit afcb8fe
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jul 11 18:09:27 2023 +0300

    Add new config option

commit 8c2c497
Merge: e610466 2347463
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jul 11 17:53:54 2023 +0300

    Merge 'origin/master' into hipblas

commit e610466
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jul 11 17:53:14 2023 +0300

    Expand arch list and make it overrideable

commit 80e4e54
Merge: 7735c5a 1d16309
Author: Henri Vasserman <henv@hot.ee>
Date:   Mon Jul 10 02:09:28 2023 +0300

    Merge 'origin/master' into hipblas

commit 8432e9d
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 9 16:55:30 2023 -0500

    Update Makefile

commit b58c189
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 9 16:20:00 2023 -0500

    Add multi-gpu CuBLAS support to new GUI

commit 0c1c71b
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 8 07:56:57 2023 -0500

    Update Makefile

commit f864f60
Author: Johannes Gäßler <johannesg@5d6.de>
Date:   Sat Jul 8 00:25:15 2023 +0200

    CUDA: add __restrict__ to mul mat vec kernels (ggerganov#2140)

commit 4539bc2
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 8 01:36:14 2023 -0500

    update makefile for changes

commit 912e31e
Merge: 74e2703 ddaa4f2
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Fri Jul 7 23:15:37 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit 74e2703
Merge: cf65429 f9108ba
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 5 15:16:49 2023 -0500

    Merge branch 'LostRuins:concedo' into main

commit 7735c5a
Merge: c3e3733 7ee76e4
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jul 4 17:09:16 2023 +0300

    Merge 'origin/master' into hipblas

commit cf65429
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 3 16:56:40 2023 -0500

    print cuda or opencl based on what's used

commit 72c16d2
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 3 16:45:39 2023 -0500

    Revert "fix my mistake that broke other arches"

    This reverts commit 777aed5.

commit 777aed5
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 3 15:53:32 2023 -0500

    fix my mistake that broke other arches

commit 27780a9
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 2 16:03:27 2023 -0500

    rocm fixes

commit f52c7d4
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 2 16:02:58 2023 -0500

    Revert "rocm fixes"

    This reverts commit 2fe9927.

commit 2fe9927
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 2 15:58:21 2023 -0500

    rocm fixes

commit efe7560
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 2 15:55:43 2023 -0500

    Revert "move HIPBLAS definitions into ggml-cuda.h"

    This reverts commit bf49a93.

commit 4fc0181
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 2 15:55:36 2023 -0500

    Revert "move hipblas definitions to header files"

    This reverts commit 2741ffb.

commit 89eb576
Merge: 2741ffb 3d2907d
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 2 14:44:13 2023 -0500

    Merge branch 'LostRuins:concedo' into main

commit c3e3733
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun Jul 2 15:51:31 2023 +0300

    ROCm fixes

commit 15db19a
Merge: 04419f1 46088f7
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun Jul 2 15:39:57 2023 +0300

    Merge 'origin/master' into hipblas

commit 2741ffb
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 1 17:07:42 2023 -0500

    move hipblas definitions to header files

commit bf49a93
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 1 16:38:50 2023 -0500

    move HIPBLAS definitions into ggml-cuda.h

commit 540f4e0
Merge: 2c3b46f eda663f
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 1 14:58:32 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit 2c3b46f
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 29 18:43:43 2023 -0500

    changes to fix build

commit c9e1103
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 29 18:20:07 2023 -0500

    Update ggml_v2-cuda-legacy.cu for ROCM

commit b858fc5
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 29 17:49:39 2023 -0500

    changes to work with upstream

commit 69a0c25
Merge: 096f0b0 1347d3a
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 29 16:59:06 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit 04419f1
Merge: bb16eff d3494bb
Author: Henri Vasserman <henv@hot.ee>
Date:   Wed Jun 28 23:30:10 2023 +0300

    Merge 'origin/master' into hipblas

commit bb16eff
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jun 28 15:27:10 2023 -0500

    headers fix; add kquants_iter for hipblas and add gfx803 (#1)

    * kquants_iter for hipblas and add gfx803
    * Update CMakeLists.txt with hipblas kquants_iter and DMMV_F16
    * remove dmmv_f16 for now

commit 096f0b0
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jun 28 15:27:02 2023 -0500

    revert unnecessary hipblas conditionals

commit d81e81a
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jun 28 14:48:23 2023 -0500

    Update Makefile hipblas nvcc correction

commit c8ae945
Merge: c1e5c83 0be54f7
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jun 27 10:50:37 2023 +0300

    Merge 'origin/master' into hipblas

commit 2579ecf
Merge: abed427 d2034ce
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jun 25 17:50:04 2023 -0500

    Merge branch 'LostRuins:concedo' into main

commit c1e5c83
Merge: 35a6031 447ccbe
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun Jun 25 21:40:05 2023 +0300

    Merge 'origin/master' into hipblas

commit 35a6031
Merge: df7346c 66a2555
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun Jun 25 10:57:48 2023 +0300

    Merge 'origin/master' into hipblas

commit abed427
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jun 24 19:16:30 2023 -0500

    reorganize If statements to include proper headers

commit 06c3bf0
Merge: ea6d320 8342fe8
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jun 24 16:57:20 2023 -0500

    Merge branch 'LostRuins:concedo' into main

commit ea6d320
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Fri Jun 23 01:53:28 2023 -0500

    Update README.md

commit 4d56ad8
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 22 16:19:43 2023 -0500

    Update README.md

commit 21f9308
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 22 15:42:05 2023 -0500

    kquants_iter for hipblas and add gfx803

commit df7346c
Merge: 5dd2fbe 7487137
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu Jun 22 20:51:09 2023 +0300

    Merge 'origin/master' into hipblas

commit b6ff890
Merge: eb094f0 e6ddb15
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 22 12:42:09 2023 -0500

    Merge branch 'LostRuins:concedo' into main

commit eb094f0
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jun 21 23:59:18 2023 -0500

    lowvram parameter description

commit 3a5dfeb
Merge: 665cc11 b1f00fa
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jun 21 16:53:03 2023 -0500

    Merge branch 'LostRuins:concedo' into koboldcpp-rocm

commit 665cc11
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jun 21 01:13:19 2023 -0500

    add lowvram parameter

commit 222cbbb
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Tue Jun 20 19:03:28 2023 -0500

    add additional hipblas conditions for cublas

commit e1f9581
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Tue Jun 20 16:51:59 2023 -0500

    Add hip def for cuda v2

commit 3bff5c0
Merge: a7e74b3 266d47a
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Tue Jun 20 13:38:06 2023 -0500

    Merge branch 'LostRuins:concedo' into koboldcpp-rocm

commit a7e74b3
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jun 19 22:04:18 2023 -0500

    Update README.md

commit 5e99b3c
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jun 19 22:03:42 2023 -0500

    Update Makefile

commit 9190b17
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jun 19 21:47:10 2023 -0500

    Update README.md

commit 5dd2fbe
Merge: 67e229b 20568fe
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jun 20 01:23:12 2023 +0300

    Merge 'origin/master' into hipblas

commit 2780ea2
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jun 18 15:48:00 2023 -0500

    Update Makefile

commit 04a3e64
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jun 18 14:33:39 2023 -0500

    remove extra line

commit cccbca9
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jun 18 14:31:17 2023 -0500

    attempt adding ROCM hipblas

commit a44a1d4
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jun 18 14:31:01 2023 -0500

    attempt adding ROCM hipblas

commit b088184
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jun 18 14:30:54 2023 -0500

    attempt adding ROCM hipblas

commit 67e229b
Merge: 6f7c156 b241649
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun Jun 18 00:36:54 2023 +0300

    Merge 'origin/master' into hipblas

commit 6f7c156
Merge: 61df8e9 fc45a81
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat Jun 17 16:53:22 2023 +0300

    Merge 'origin/master' into hipblas

commit 61df8e9
Author: Henri Vasserman <henv@hot.ee>
Date:   Wed Jun 14 22:46:10 2023 +0300

    add cudaMemset

commit a836529
Merge: 85f902d 254a7a7
Author: Henri Vasserman <henv@hot.ee>
Date:   Wed Jun 14 22:41:55 2023 +0300

    Merge 'origin/master' into hipblas

commit 85f902d
Merge: 4362e80 b50b570
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu Jun 8 10:50:28 2023 +0300

    Merge 'origin/master' into hipblas

commit 4362e80
Merge: fa5b3d7 17366df
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jun 6 23:14:40 2023 +0300

    Merge 'origin/master' into hipblas

commit fa5b3d7
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jun 6 18:47:00 2023 +0300

    fix makefile.

commit 1ba4ce4
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jun 6 18:41:08 2023 +0300

    Revert "warp size fixes"

    It seems like 32 is faster for me, at least and it won't cause so many conflicts.

    This reverts commit 5d6eb72.

commit 5d6eb72
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jun 6 18:32:41 2023 +0300

    warp size fixes

commit 33091a9
Merge: 9fdaa1d 2d43387
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jun 6 16:19:23 2023 +0300

    Merge  'origin/master' into hipblas

commit 9fdaa1d
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat May 27 19:17:53 2023 +0300

    Add more defs

    For forward compatibility ggerganov#1607

commit a4648c1
Merge: 4c8b3fb 0ecb1bb
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat May 27 18:22:39 2023 +0300

    Merge 'origin/master' into hipblas

commit 4c8b3fb
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri May 26 01:08:53 2023 +0300

    add configurable vars

commit 30d921a
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri May 26 01:03:56 2023 +0300

    and makefile

commit a593a4f
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri May 26 00:55:28 2023 +0300

    Add missing parameters

commit 174bf6a
Merge: f80ce7a 1fcdcc2
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri May 26 00:44:23 2023 +0300

    Merge 'origin/master' into hipblas

commit f80ce7a
Merge: 600ace3 ac7876a
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu May 25 00:02:50 2023 +0300

    Merge branch 'origin/master' into hipblas

commit 600ace3
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat May 20 23:42:20 2023 +0300

    update warp size

commit b19fefe
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat May 20 23:28:08 2023 +0300

    Forwardcompat

commit c66115b
Merge: a0b2d5f b8ee340
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat May 20 18:29:31 2023 +0300

    Merge 'origin/master' into hipblas

commit a0b2d5f
Merge: 8bab456 2a5ee02
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue May 16 17:08:29 2023 +0300

    Merge 'origin/master' into hipblas

commit 8bab456
Merge: 2956630 b5c9295
Author: Henri Vasserman <henv@hot.ee>
Date:   Mon May 15 00:01:12 2023 +0300

    Merge 'origin/master' into hipblas

commit 2956630
Merge: 0fe6384 f048af0
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat May 13 13:12:52 2023 +0300

    Merge 'origin/master' into hipblas

commit 0fe6384
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri May 12 17:22:11 2023 +0300

    fix makefile

commit 605560d
Merge: 127f68e 089b1c9
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri May 12 16:12:53 2023 +0300

    Merge 'origin/master' into hipblas

commit 127f68e
Merge: 070cbcc b608b55
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu May 11 20:21:27 2023 +0300

    Merge 'origin/master' into hipblas

commit 070cbcc
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun May 7 18:10:56 2023 +0300

    occupanct function

commit a3296d5
Merge: 0aefa6a e129551
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun May 7 18:06:04 2023 +0300

    Merge 'origin/master' into hipblas

commit 0aefa6a
Merge: baeb482 1b0fd45
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun May 7 12:24:41 2023 +0300

    Merge 'origin/master' into hipblas

commit baeb482
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun May 7 12:24:12 2023 +0300

    Revert to default copy

commit 289073a
Merge: 1107194 173d0e6
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat May 6 19:59:41 2023 +0300

    Merge 'origin/master' into hipblas

commit 1107194
Merge: 04c0d48 a3b85b2
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat May 6 00:38:20 2023 +0300

    Merge 'origin/master' into hipblas

commit 04c0d48
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu May 4 12:31:16 2023 +0300

    Move all HIP stuff to ggml-cuda.cu

commit d83cfba
Merge: b67cc50 799fdc1
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu May 4 11:31:16 2023 +0300

    Merge 'origin/master' into hipblas

commit b67cc50
Merge: fcbc262 e216aa0
Author: Henri Vasserman <henv@hot.ee>
Date:   Wed May 3 15:04:51 2023 +0300

    Merge 'origin/master' into hipblas

commit fcbc262
Merge: c73def1 f4cef87
Author: Henri Vasserman <henv@hot.ee>
Date:   Mon May 1 22:45:29 2023 +0300

    Merge 'origin/master' into hipblas

commit c73def1
Merge: d8ea75e f0d70f1
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun Apr 30 18:40:42 2023 +0300

    Merge 'origin/master' into hipblas

commit d8ea75e
Merge: d194586 334637e
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat Apr 29 11:25:51 2023 +0300

    Merge 'origin/master' into hipblas

commit d194586
Merge: 2ab9d11 7f15c5c
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Apr 28 23:03:52 2023 +0300

    Merge 'origin/master' into hipblas

commit 2ab9d11
Merge: 3b4a531 04aaae1
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Apr 28 16:30:05 2023 +0300

    Merge 'origin/master' into hipblas

commit 3b4a531
Merge: a1caa48 0b2da20
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Apr 28 10:08:41 2023 +0300

    Merge 'origin/master' into hipblas

commit a1caa48
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Apr 28 10:08:21 2023 +0300

    add more cuda defines

    This is so 'slaren/cuda-f16f32' would merge.

commit ecc0565
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Apr 28 01:58:27 2023 +0300

    only .cu file needs to be complied as device

commit ef51e9e
Merge: d571d16 4afcc37
Author: Henri Vasserman <henv@hot.ee>
Date:   Wed Apr 26 12:46:26 2023 +0300

    Merge branch 'ggerganov:master' into hipblas

commit d571d16
Merge: 608aa33 dd0eabc
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Apr 25 21:15:33 2023 +0300

    Merge 'origin/master' into hipblas

commit 608aa33
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Apr 25 21:15:04 2023 +0300

    change default GPU arch to match CMake

commit 3a004b2
Author: Henri Vasserman <henv@hot.ee>
Date:   Mon Apr 24 02:24:54 2023 +0300

    add rpath

commit db7a012
Merge: 3677235 284685f
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun Apr 23 21:49:28 2023 +0300

    Merge 'origin/master' into hipblas

commit 3677235
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat Apr 22 23:28:00 2023 +0300

    More build file changes

commit d3e1984
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Apr 21 03:32:06 2023 +0300

    add rpath

commit 0e005f7
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Apr 21 02:13:00 2023 +0300

    Build file changes

    Now HIP Clang is not required, the CMake scripts will configure the
    needed compiler, which can be system clang++. Also other code can
    still use GCC, but CMake will force the clang to link.

commit 54a63c1
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu Apr 20 22:19:22 2023 +0300

    Update Makefile for the Cuda kernels

commit 0fd8363
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu Apr 20 02:04:00 2023 +0300

    use hipblas based on cublas
YellowRoseCx added a commit that referenced this pull request Aug 29, 2023
* koboldcpp-ROCm Port

commit 3416c98
Merge: 5eb17f0 4c4e435
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Fri Aug 25 13:46:56 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit 5eb17f0
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Fri Aug 25 13:38:21 2023 -0500

    ROCm Port update

    * use hipblas based on cublas
    * Update Makefile for the Cuda kernels
    * Expand arch list and make it overrideable
    * Fix multi GPU on multiple amd architectures with rocblas_initialize() (#5)
    * add hipBLAS to README
    * new build arg LLAMA_CUDA_MMQ_Y
    * fix half2 decomposition
    * Add intrinsics polyfills for AMD
    * AMD assembly optimized __dp4a
    * Allow overriding CC_TURING
    * use "ROCm" instead of "CUDA"
    * ignore all build dirs
    * Add Dockerfiles
    * fix llama-bench
    * fix -nommq help for non CUDA/HIP

    ---------

    Co-Authored-By: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
    Co-Authored-By: ardfork <134447697+ardfork@users.noreply.github.com>
    Co-Authored-By: funnbot <22226942+funnbot@users.noreply.github.com>
    Co-Authored-By: Engininja2 <139037756+Engininja2@users.noreply.github.com>
    Co-Authored-By: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
    Co-Authored-By: jammm <2500920+jammm@users.noreply.github.com>
    Co-Authored-By: jdecourval <7315817+jdecourval@users.noreply.github.com>

commit b34f4bd
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Aug 19 17:12:52 2023 -0500

    Update README.md

commit 7d11961
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Aug 14 23:03:12 2023 -0500

    remove force DMMV

commit cd61aa0
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Aug 12 17:24:31 2023 -0500

    restore main_gpu parameter

commit 4a042f3
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat Aug 12 10:51:46 2023 +0300

    gfx1100 support

    ---------

    Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com>
    Co-authored-by: jammm <2500920+jammm@users.noreply.github.com>
    Co-authored-by: jdecourval <7315817+jdecourval@users.noreply.github.com>

commit 8913bc6
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Aug 11 10:16:02 2023 +0300

    Allow overriding CC_TURING

commit e77a4c3
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Aug 11 10:00:07 2023 +0300

    Merge 'origin/master' into hipblas

commit cc4c4e3
Author: Engininja2 <139037756+Engininja2@users.noreply.github.com>
Date:   Fri Aug 11 09:43:14 2023 +0300

    New __dp4a assembly

    Now compatible with gfx900 and faster as well.

commit 1a03b70
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Aug 11 09:30:28 2023 +0300

    Undo mess

    ---------

    Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com>

commit 4366ff9
Author: DannyDaemonic <DannyDaemonic@gmail.com>
Date:   Thu Aug 10 13:11:36 2023 -0700

    Handle `ENABLE_VIRTUAL_TERMINAL_PROCESSING` more gracefully on earlier versions of Windows.

commit 811ff85
Author: Christian Demsar <crasm@git.vczf.us>
Date:   Thu Aug 10 10:28:27 2023 -0400

    Add --n-predict -2 for stopping generation on full context (ggerganov#2565)

commit 37c9717
Author: Martin Krasser <krasserm@googlemail.com>
Date:   Thu Aug 10 12:16:38 2023 +0200

    Fix grammar-based sampling issue in server (ggerganov#2566)

commit d18ecd5
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Aug 10 13:19:41 2023 -0500

    make mmq gen faster for amd

commit 243894a
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu Aug 10 12:14:40 2023 +0300

    ws fix

commit ac2f14d
Author: Engininja2 <139037756+Engininja2@users.noreply.github.com>
Date:   Thu Aug 10 12:11:27 2023 +0300

    AMD assembly optimized __dp4a

    Doesn't seem to work for gfx900, so commented out.

commit 9dba0c9
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu Aug 10 12:09:28 2023 +0300

    Fix merge

    ---------

    Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com>
    Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>

commit f570b5c
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Aug 9 22:11:20 2023 -0500

    Revert "revert cuda changes as they are bugggy"

    This reverts commit 1541bf8.

commit 1541bf8
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Wed Aug 9 22:36:41 2023 +0800

    revert cuda changes as they are bugggy

commit bacc202
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Aug 9 20:37:17 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit b7cb4cf
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Aug 9 20:00:52 2023 -0500

    additional fixes

commit fadae72
Merge: 518eb2a 8f8ab6c
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Aug 9 18:45:50 2023 -0500

    Merge branch 'hipblas' into develop4Main

commit 518eb2a
Merge: bda0215 cae6a84
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Aug 9 18:32:10 2023 -0500

    Merge remote-tracking branch 'upstream/concedo' into develop2Main

commit bda0215
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Aug 9 18:17:54 2023 -0500

    update makefile to multisystem path

commit 8f8ab6c
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Aug 9 18:05:03 2023 -0500

    hipLDFLAG Path change Unix to multisystem in Makefile

    changed the hardcoded linux distro hipblas LD path from -L/opt/rocm/lib to use the defined ROCM_PATH variable to be flexible with ROCm on non-Linux OS

commit 610ba4c
Merge: 4024f91 25d43e0
Author: Henri Vasserman <henv@hot.ee>
Date:   Wed Aug 9 23:54:58 2023 +0300

    Merge 'origin/master' into hipblas

commit 4024f91
Author: Henri Vasserman <henv@hot.ee>
Date:   Wed Aug 9 01:56:44 2023 +0300

    Add intrinsics polyfills for AMD

    ---------

    Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com>
    Co-authored-by: funnbot <22226942+funnbot@users.noreply.github.com>
    Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com>

commit ab62128
Merge: d91456a f5bfea0
Author: Henri Vasserman <henv@hot.ee>
Date:   Wed Aug 9 00:37:01 2023 +0300

    Merge 'origin/master' into hipblas

commit ee9fa2a
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Aug 2 01:53:58 2023 -0500

    Update Makefile

commit d91456a
Author: ardfork <134447697+ardfork@users.noreply.github.com>
Date:   Mon Jul 31 20:35:00 2023 +0300

    fix half2 decomposition

commit c1cb70d
Author: Henri Vasserman <henv@hot.ee>
Date:   Mon Jul 31 19:56:44 2023 +0300

    new build arg LLAMA_CUDA_MMQ_Y

commit c1664a0
Merge: 4336231 0728c5a
Author: Henri Vasserman <henv@hot.ee>
Date:   Mon Jul 31 19:32:27 2023 +0300

    Merge 'origin/master' into hipblas

commit 848558d
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 30 20:02:52 2023 -0500

    import vars logic fix

commit b650b84
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 30 00:21:36 2023 -0500

    Update easy_KCPP-ROCm_install.sh

commit 8573a67
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 29 21:31:12 2023 -0500

    remove duplicate code and fix typo

    remove duplicate tooltip

commit 430986e
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 29 21:07:34 2023 -0500

    hide "missing" if all are built

    move tooltip functions to helper functions section. hides the string "Missing: ..." from showing if all backends are available
    " if len(runopts)==6 else + "

commit dd0db72
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 29 20:52:31 2023 -0500

    hide "missing" if all are built

    move tooltip functions to helper functions section. hides the string "Missing: ..." from showing if all backends are available

commit 43fffb6
Merge: 0ed65a4 b40550c
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 29 19:13:15 2023 -0500

    Merge branch 'concedo'

commit 0ed65a4
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 29 18:34:21 2023 -0500

    Hide unavailable backends & Add tooltip over backend count

    Hides unavailable backends from the user and if the program is launched without any backends made, it shows an error message to them stating no backends were found and to make them using the 'make' command

    Add tooltip when hovering over backend count label

    hovering over the new label that shows the backend count will explain what the numbers are, and show the users which backends are not available or built

commit 2a26398
Merge: cee2e9d 31486eb
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 29 15:16:33 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit 4336231
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat Jul 29 18:35:56 2023 +0300

    add hipBLAS to README

    ---------

    Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com>

commit f8e3fc6
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat Jul 29 14:16:46 2023 +0300

    rocblas init stuff

commit d2ade63
Merge: cde52d6 8a88e58
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat Jul 29 12:59:48 2023 +0300

    Merge 'origin/master' into hipblas

commit cee2e9d
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 26 23:36:55 2023 -0500

    Only Show Available Backends in GUI

    Hides unavailable backends from the user and if the program is launched without any backends made, it shows an error message to them stating no backends were found and to make them using the 'make' command

commit 7863610
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 26 13:27:22 2023 -0500

    Update easy_KCPP-ROCm_install.sh

commit 731cd6e
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Tue Jul 25 22:39:50 2023 -0500

    Create easy_rocm_install.sh

commit f154685
Merge: cbdc1f3 94e0a06
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Tue Jul 25 22:25:10 2023 -0500

    Merge branch 'concedo_experimentalMAIN'

commit cbdc1f3
Merge: 5b838d4 9731682
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 24 16:53:21 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit cde52d6
Merge: 8e8054a 84e09a7
Author: Henri Vasserman <henv@hot.ee>
Date:   Mon Jul 24 12:22:58 2023 +0300

    Merge 'origin/master' into hipblas

commit 8e8054a
Author: Henri Vasserman <henv@hot.ee>
Date:   Mon Jul 24 12:20:49 2023 +0300

    Add rocblas to build files

commit 1f6294d
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 24 03:52:01 2023 -0500

    Fix multi GPU on multiple amd architectures with rocblas_initialize() (#5)

    * initialize rocblas

commit 5b838d4
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 24 03:10:35 2023 -0500

    amd multigpu full layer offload w/o vram scratch

commit 9bfb2fd
Merge: b379f9d 66328fc
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 24 03:07:44 2023 -0500

    Merge branch 'concedo_experimental'

commit b379f9d
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 24 03:07:00 2023 -0500

    Revert "amd multigpu full layer offload w/o vram scratch"

    This reverts commit 9adfc8e.

commit 9adfc8e
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 24 02:56:40 2023 -0500

    amd multigpu full layer offload w/o vram scratch

commit 05c792e
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 24 00:18:48 2023 -0500

    initialize rocblas

commit ade68d0
Merge: 521ad6b 56995ca
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 23 20:25:05 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit 521ad6b
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jul 20 21:42:33 2023 -0500

    lazy import_var error handling for saves

commit 9553e52
Merge: cac6650 f036109
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jul 20 19:59:41 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit cac6650
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 17 23:05:02 2023 -0500

    Makefile fix! Allows hip/clblast build together

commit 3db70b5
Merge: 2ec4466 7568d1a
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jul 18 01:54:17 2023 +0300

    Merge 'origin/master' into hipblas

commit f208670
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Fri Jul 14 02:56:03 2023 -0500

    improve error handling with gpu names

commit 860e738
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Fri Jul 14 00:33:03 2023 -0500

    Show GPU names in GUI, Only show GPUs that exist

    changed the pre-set 1,2,3 and 1,2,3,all settings that the GPU selector had and replaced them with a function that grabs the GPU names and sets the names as the values for the selector boxes.

commit 2ec4466
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu Jul 13 13:44:02 2023 +0300

    Update build flags.

    GGML_CUDA_DMMV_Y is now GGML_CUDA_MMV_Y
    so update your build instructions.

    GGML_CUDA_FORCE_DMMV is always enabled.

    ---------

    Co-authored-by: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>

commit cd36b18
Merge: afcb8fe 1cbf561
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu Jul 13 13:03:01 2023 +0300

    Merge 'origin/master' into hipblas

commit ac7ebc3
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 12 18:32:18 2023 -0500

    add hipBLAS name scheme to GUI and update README

commit 7f85cc5
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 12 17:35:54 2023 -0500

    update makefile and ggml.c

commit 6ca3499
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 12 15:43:45 2023 -0500

    ggml.c fix

commit 770e674
Merge: 2b289cd 5941514
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 12 15:24:36 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit 2b289cd
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 12 14:30:00 2023 -0500

    Update c-cpp.yml

commit 5dae95a
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 12 14:28:51 2023 -0500

    Update c-cpp.yml

commit b37cd73
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 12 14:27:04 2023 -0500

    Create c-cpp.yml to test Actions

commit afcb8fe
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jul 11 18:09:27 2023 +0300

    Add new config option

commit 8c2c497
Merge: e610466 2347463
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jul 11 17:53:54 2023 +0300

    Merge 'origin/master' into hipblas

commit e610466
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jul 11 17:53:14 2023 +0300

    Expand arch list and make it overrideable

commit 80e4e54
Merge: 7735c5a 1d16309
Author: Henri Vasserman <henv@hot.ee>
Date:   Mon Jul 10 02:09:28 2023 +0300

    Merge 'origin/master' into hipblas

commit 8432e9d
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 9 16:55:30 2023 -0500

    Update Makefile

commit b58c189
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 9 16:20:00 2023 -0500

    Add multi-gpu CuBLAS support to new GUI

commit 0c1c71b
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 8 07:56:57 2023 -0500

    Update Makefile

commit f864f60
Author: Johannes Gäßler <johannesg@5d6.de>
Date:   Sat Jul 8 00:25:15 2023 +0200

    CUDA: add __restrict__ to mul mat vec kernels (ggerganov#2140)

commit 4539bc2
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 8 01:36:14 2023 -0500

    update makefile for changes

commit 912e31e
Merge: 74e2703 ddaa4f2
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Fri Jul 7 23:15:37 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit 74e2703
Merge: cf65429 f9108ba
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 5 15:16:49 2023 -0500

    Merge branch 'LostRuins:concedo' into main

commit 7735c5a
Merge: c3e3733 7ee76e4
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jul 4 17:09:16 2023 +0300

    Merge 'origin/master' into hipblas

commit cf65429
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 3 16:56:40 2023 -0500

    print cuda or opencl based on what's used

commit 72c16d2
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 3 16:45:39 2023 -0500

    Revert "fix my mistake that broke other arches"

    This reverts commit 777aed5.

commit 777aed5
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 3 15:53:32 2023 -0500

    fix my mistake that broke other arches

commit 27780a9
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 2 16:03:27 2023 -0500

    rocm fixes

commit f52c7d4
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 2 16:02:58 2023 -0500

    Revert "rocm fixes"

    This reverts commit 2fe9927.

commit 2fe9927
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 2 15:58:21 2023 -0500

    rocm fixes

commit efe7560
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 2 15:55:43 2023 -0500

    Revert "move HIPBLAS definitions into ggml-cuda.h"

    This reverts commit bf49a93.

commit 4fc0181
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 2 15:55:36 2023 -0500

    Revert "move hipblas definitions to header files"

    This reverts commit 2741ffb.

commit 89eb576
Merge: 2741ffb 3d2907d
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 2 14:44:13 2023 -0500

    Merge branch 'LostRuins:concedo' into main

commit c3e3733
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun Jul 2 15:51:31 2023 +0300

    ROCm fixes

commit 15db19a
Merge: 04419f1 46088f7
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun Jul 2 15:39:57 2023 +0300

    Merge 'origin/master' into hipblas

commit 2741ffb
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 1 17:07:42 2023 -0500

    move hipblas definitions to header files

commit bf49a93
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 1 16:38:50 2023 -0500

    move HIPBLAS definitions into ggml-cuda.h

commit 540f4e0
Merge: 2c3b46f eda663f
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 1 14:58:32 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit 2c3b46f
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 29 18:43:43 2023 -0500

    changes to fix build

commit c9e1103
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 29 18:20:07 2023 -0500

    Update ggml_v2-cuda-legacy.cu for ROCM

commit b858fc5
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 29 17:49:39 2023 -0500

    changes to work with upstream

commit 69a0c25
Merge: 096f0b0 1347d3a
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 29 16:59:06 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit 04419f1
Merge: bb16eff d3494bb
Author: Henri Vasserman <henv@hot.ee>
Date:   Wed Jun 28 23:30:10 2023 +0300

    Merge 'origin/master' into hipblas

commit bb16eff
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jun 28 15:27:10 2023 -0500

    headers fix; add kquants_iter for hipblas and add gfx803 (#1)

    * kquants_iter for hipblas and add gfx803
    * Update CMakeLists.txt with hipblas kquants_iter and DMMV_F16
    * remove dmmv_f16 for now

commit 096f0b0
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jun 28 15:27:02 2023 -0500

    revert unnecessary hipblas conditionals

commit d81e81a
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jun 28 14:48:23 2023 -0500

    Update Makefile hipblas nvcc correction

commit c8ae945
Merge: c1e5c83 0be54f7
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jun 27 10:50:37 2023 +0300

    Merge 'origin/master' into hipblas

commit 2579ecf
Merge: abed427 d2034ce
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jun 25 17:50:04 2023 -0500

    Merge branch 'LostRuins:concedo' into main

commit c1e5c83
Merge: 35a6031 447ccbe
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun Jun 25 21:40:05 2023 +0300

    Merge 'origin/master' into hipblas

commit 35a6031
Merge: df7346c 66a2555
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun Jun 25 10:57:48 2023 +0300

    Merge 'origin/master' into hipblas

commit abed427
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jun 24 19:16:30 2023 -0500

    reorganize If statements to include proper headers

commit 06c3bf0
Merge: ea6d320 8342fe8
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jun 24 16:57:20 2023 -0500

    Merge branch 'LostRuins:concedo' into main

commit ea6d320
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Fri Jun 23 01:53:28 2023 -0500

    Update README.md

commit 4d56ad8
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 22 16:19:43 2023 -0500

    Update README.md

commit 21f9308
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 22 15:42:05 2023 -0500

    kquants_iter for hipblas and add gfx803

commit df7346c
Merge: 5dd2fbe 7487137
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu Jun 22 20:51:09 2023 +0300

    Merge 'origin/master' into hipblas

commit b6ff890
Merge: eb094f0 e6ddb15
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 22 12:42:09 2023 -0500

    Merge branch 'LostRuins:concedo' into main

commit eb094f0
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jun 21 23:59:18 2023 -0500

    lowvram parameter description

commit 3a5dfeb
Merge: 665cc11 b1f00fa
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jun 21 16:53:03 2023 -0500

    Merge branch 'LostRuins:concedo' into koboldcpp-rocm

commit 665cc11
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jun 21 01:13:19 2023 -0500

    add lowvram parameter

commit 222cbbb
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Tue Jun 20 19:03:28 2023 -0500

    add additional hipblas conditions for cublas

commit e1f9581
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Tue Jun 20 16:51:59 2023 -0500

    Add hip def for cuda v2

commit 3bff5c0
Merge: a7e74b3 266d47a
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Tue Jun 20 13:38:06 2023 -0500

    Merge branch 'LostRuins:concedo' into koboldcpp-rocm

commit a7e74b3
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jun 19 22:04:18 2023 -0500

    Update README.md

commit 5e99b3c
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jun 19 22:03:42 2023 -0500

    Update Makefile

commit 9190b17
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jun 19 21:47:10 2023 -0500

    Update README.md

commit 5dd2fbe
Merge: 67e229b 20568fe
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jun 20 01:23:12 2023 +0300

    Merge 'origin/master' into hipblas

commit 2780ea2
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jun 18 15:48:00 2023 -0500

    Update Makefile

commit 04a3e64
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jun 18 14:33:39 2023 -0500

    remove extra line

commit cccbca9
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jun 18 14:31:17 2023 -0500

    attempt adding ROCM hipblas

commit a44a1d4
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jun 18 14:31:01 2023 -0500

    attempt adding ROCM hipblas

commit b088184
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jun 18 14:30:54 2023 -0500

    attempt adding ROCM hipblas

commit 67e229b
Merge: 6f7c156 b241649
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun Jun 18 00:36:54 2023 +0300

    Merge 'origin/master' into hipblas

commit 6f7c156
Merge: 61df8e9 fc45a81
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat Jun 17 16:53:22 2023 +0300

    Merge 'origin/master' into hipblas

commit 61df8e9
Author: Henri Vasserman <henv@hot.ee>
Date:   Wed Jun 14 22:46:10 2023 +0300

    add cudaMemset

commit a836529
Merge: 85f902d 254a7a7
Author: Henri Vasserman <henv@hot.ee>
Date:   Wed Jun 14 22:41:55 2023 +0300

    Merge 'origin/master' into hipblas

commit 85f902d
Merge: 4362e80 b50b570
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu Jun 8 10:50:28 2023 +0300

    Merge 'origin/master' into hipblas

commit 4362e80
Merge: fa5b3d7 17366df
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jun 6 23:14:40 2023 +0300

    Merge 'origin/master' into hipblas

commit fa5b3d7
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jun 6 18:47:00 2023 +0300

    fix makefile.

commit 1ba4ce4
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jun 6 18:41:08 2023 +0300

    Revert "warp size fixes"

    It seems like 32 is faster for me, at least and it won't cause so many conflicts.

    This reverts commit 5d6eb72.

commit 5d6eb72
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jun 6 18:32:41 2023 +0300

    warp size fixes

commit 33091a9
Merge: 9fdaa1d 2d43387
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jun 6 16:19:23 2023 +0300

    Merge  'origin/master' into hipblas

commit 9fdaa1d
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat May 27 19:17:53 2023 +0300

    Add more defs

    For forward compatibility ggerganov#1607

commit a4648c1
Merge: 4c8b3fb 0ecb1bb
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat May 27 18:22:39 2023 +0300

    Merge 'origin/master' into hipblas

commit 4c8b3fb
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri May 26 01:08:53 2023 +0300

    add configurable vars

commit 30d921a
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri May 26 01:03:56 2023 +0300

    and makefile

commit a593a4f
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri May 26 00:55:28 2023 +0300

    Add missing parameters

commit 174bf6a
Merge: f80ce7a 1fcdcc2
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri May 26 00:44:23 2023 +0300

    Merge 'origin/master' into hipblas

commit f80ce7a
Merge: 600ace3 ac7876a
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu May 25 00:02:50 2023 +0300

    Merge branch 'origin/master' into hipblas

commit 600ace3
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat May 20 23:42:20 2023 +0300

    update warp size

commit b19fefe
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat May 20 23:28:08 2023 +0300

    Forwardcompat

commit c66115b
Merge: a0b2d5f b8ee340
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat May 20 18:29:31 2023 +0300

    Merge 'origin/master' into hipblas

commit a0b2d5f
Merge: 8bab456 2a5ee02
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue May 16 17:08:29 2023 +0300

    Merge 'origin/master' into hipblas

commit 8bab456
Merge: 2956630 b5c9295
Author: Henri Vasserman <henv@hot.ee>
Date:   Mon May 15 00:01:12 2023 +0300

    Merge 'origin/master' into hipblas

commit 2956630
Merge: 0fe6384 f048af0
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat May 13 13:12:52 2023 +0300

    Merge 'origin/master' into hipblas

commit 0fe6384
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri May 12 17:22:11 2023 +0300

    fix makefile

commit 605560d
Merge: 127f68e 089b1c9
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri May 12 16:12:53 2023 +0300

    Merge 'origin/master' into hipblas

commit 127f68e
Merge: 070cbcc b608b55
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu May 11 20:21:27 2023 +0300

    Merge 'origin/master' into hipblas

commit 070cbcc
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun May 7 18:10:56 2023 +0300

    occupanct function

commit a3296d5
Merge: 0aefa6a e129551
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun May 7 18:06:04 2023 +0300

    Merge 'origin/master' into hipblas

commit 0aefa6a
Merge: baeb482 1b0fd45
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun May 7 12:24:41 2023 +0300

    Merge 'origin/master' into hipblas

commit baeb482
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun May 7 12:24:12 2023 +0300

    Revert to default copy

commit 289073a
Merge: 1107194 173d0e6
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat May 6 19:59:41 2023 +0300

    Merge 'origin/master' into hipblas

commit 1107194
Merge: 04c0d48 a3b85b2
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat May 6 00:38:20 2023 +0300

    Merge 'origin/master' into hipblas

commit 04c0d48
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu May 4 12:31:16 2023 +0300

    Move all HIP stuff to ggml-cuda.cu

commit d83cfba
Merge: b67cc50 799fdc1
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu May 4 11:31:16 2023 +0300

    Merge 'origin/master' into hipblas

commit b67cc50
Merge: fcbc262 e216aa0
Author: Henri Vasserman <henv@hot.ee>
Date:   Wed May 3 15:04:51 2023 +0300

    Merge 'origin/master' into hipblas

commit fcbc262
Merge: c73def1 f4cef87
Author: Henri Vasserman <henv@hot.ee>
Date:   Mon May 1 22:45:29 2023 +0300

    Merge 'origin/master' into hipblas

commit c73def1
Merge: d8ea75e f0d70f1
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun Apr 30 18:40:42 2023 +0300

    Merge 'origin/master' into hipblas

commit d8ea75e
Merge: d194586 334637e
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat Apr 29 11:25:51 2023 +0300

    Merge 'origin/master' into hipblas

commit d194586
Merge: 2ab9d11 7f15c5c
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Apr 28 23:03:52 2023 +0300

    Merge 'origin/master' into hipblas

commit 2ab9d11
Merge: 3b4a531 04aaae1
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Apr 28 16:30:05 2023 +0300

    Merge 'origin/master' into hipblas

commit 3b4a531
Merge: a1caa48 0b2da20
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Apr 28 10:08:41 2023 +0300

    Merge 'origin/master' into hipblas

commit a1caa48
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Apr 28 10:08:21 2023 +0300

    add more cuda defines

    This is so 'slaren/cuda-f16f32' would merge.

commit ecc0565
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Apr 28 01:58:27 2023 +0300

    only .cu file needs to be complied as device

commit ef51e9e
Merge: d571d16 4afcc37
Author: Henri Vasserman <henv@hot.ee>
Date:   Wed Apr 26 12:46:26 2023 +0300

    Merge branch 'ggerganov:master' into hipblas

commit d571d16
Merge: 608aa33 dd0eabc
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Apr 25 21:15:33 2023 +0300

    Merge 'origin/master' into hipblas

commit 608aa33
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Apr 25 21:15:04 2023 +0300

    change default GPU arch to match CMake

commit 3a004b2
Author: Henri Vasserman <henv@hot.ee>
Date:   Mon Apr 24 02:24:54 2023 +0300

    add rpath

commit db7a012
Merge: 3677235 284685f
Author: Henri Vasserman <henv@hot.ee>
Date:   Sun Apr 23 21:49:28 2023 +0300

    Merge 'origin/master' into hipblas

commit 3677235
Author: Henri Vasserman <henv@hot.ee>
Date:   Sat Apr 22 23:28:00 2023 +0300

    More build file changes

commit d3e1984
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Apr 21 03:32:06 2023 +0300

    add rpath

commit 0e005f7
Author: Henri Vasserman <henv@hot.ee>
Date:   Fri Apr 21 02:13:00 2023 +0300

    Build file changes

    Now HIP Clang is not required, the CMake scripts will configure the
    needed compiler, which can be system clang++. Also other code can
    still use GCC, but CMake will force the clang to link.

commit 54a63c1
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu Apr 20 22:19:22 2023 +0300

    Update Makefile for the Cuda kernels

commit 0fd8363
Author: Henri Vasserman <henv@hot.ee>
Date:   Thu Apr 20 02:04:00 2023 +0300

    use hipblas based on cublas

* Merge Fixes

* readme merge fix

* remove old ggmlv2 changes

* bring ggml v2_cuda up to date with AMD changes

* Revert ggml v2_cuda changes BC they werent needed

This reverts commit 3385dd4.

* avoid launching subprocesses to get device names for now, but other than that seems to be working

---------

Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
YellowRoseCx pushed a commit that referenced this pull request Sep 30, 2023
* fix track_max_mem in forward_batch_wo_cache_flash_attn_train

* remove unnecessary Adam(W) optimizer tensors.

reduces optimizer memory overhead from 7*modelsize to 2*modelsize.

additionally allows to optimize models with more than 2^31 parameters by replacing int with int64_t.

bumps training checkpoint file version, but old checkpoints can still be read.
new version with less tensors is saved.

* add gradient clipping to AdamW

* Fix reset of unused g->nodes and g->grads to NULL

* implement gradient checkpointing for training

reduces memory overhead from O(n_layer) to O(sqrt(n_layer))

as explained in readme of https://github.com/cybertronai/gradient-checkpointing

* remove unused compute buffer 3

* add and use function ggml_build_backward_expand to avoid stack overflows with large maximum number of nodes

GGML_API void ggml_build_backward_expand(struct ggml_context * ctx, struct ggml_cgraph * gf, struct ggml_cgraph * gb, bool keep);

* change AdamW decay parameter to work like the torch AdamW decay parameter

It is now relative to Adam learning rate `alpha*sched`.
Before that it was relative to `sched` only.

`alpha` being the maximum learning rate and `sched` being a scaling parameter in [0..1]

* change default AdamW weight decay parameter used in training to 0.1 as used in nanoGPT

* change default AdamW weight decay parameter defined in ggml to 0.0, making Adam default instead of AdamW

btw: the default weight decay parameter for torch.optim.AdamW is 0.01

* bug fixes for cross entropy loss

ggml_cross_entropy_loss: sums where not correctly added in workload of each thread
ggml_cross_entropy_loss_back: simplify backward process, reducing numerical issues

guard usage of exp f16 lookup in cross entropy by #define GGML_CROSS_ENTROPY_EXP_FP16

cross entropy loss is only used once during training, but it is quite sensitive to numerical errors introduced by exp-f16-lookup.
so exp-f16-lookup for cross entropy loss is disabled by default, trading better gradients for very slightly worse runtime performance.

* fix test-grad0 for cross_entropy_loss

the second argument to cross_entropy_loss must sum up to 1 for each row

* fix test-grad0 for soft_max

dont use only sum as aggregation, because sum of softmax is always 1 -> finite differences should not work
instead use sum(log(soft_max()*(1-eps)+eps)); use eps to avoid log(0)

* improve finite differences of test-grad0 by using double instead of float

* change cross_entropy_loss to output average over all rows

this helps keeping the loss and gradients in a sane range

* improve gradient checkpointing

sqrt(n_layers) is only the best checkpoint step when mem size of checkpoints and mem size of layers are equal.
since layers require more memory than the single-tensor-checkpoint we use, the optimal values are compute different:

```
  given: n, u, v
  objective: minimize(a*u+b*v) where a*b=n, a>0, b>0
  b=n/a
  minimize(a*u+v*n/a)
  diff(a*u+v*n/a, a) = u - (v*n/a)/a
  diff(a*u+v*n/a, a) == 0
  u - (v*n/a)/a == 0
  u == v*n/(a*a)
  u*a*a = v*n
  a*a = v*n/u
  a = sqrt(n*v/u)
```

this change results in more checkpoints, requiring less layers to store between checkpoints, overall improving memory usage.

* disable gradient checkpointing debug output

* llama : fix rope usage in train-text-from-scratch after ChatGLM change

* add more training parameters:

--enable-restart N         Only for Adam optimizer. Enable restarts of cos-decay
--disable-restart N        Only for Adam optimizer. Disable restarts of cos-decay
--opt-past N               Number of optimization iterations to track for delta convergence test. Disabled when zero.
--opt-delta N              Maximum delta for delta convergence test. Disabled when <= zero.
--opt-max-no-improvement N Maximum number of optimization iterations with no improvement. Disabled when <= zero.
--adam-epsf N              AdamW epsilon for convergence test. Disabled when <= zero.
--adam-min-alpha N         Adam minimum learning rate alpha, usually 0.1 * alpha

* replace memcpy with reshape operation so that the graph is not cut at the input

this makes it possible to store other values into the input tensor and then simply recompute the graph without rebuilding it

* remove unused function argument from get_example_targets_batch

* measure and print total training time

* add optimization callback to ggml_opt_resume_g

this callback is called before each iteration with custom data and pointer to learning schedule parameter (only used in Adam(W)).

can be used for dynamic learning schedule and setting input data for batches before each iteration

* use optimization callback in training

allows dynamic learning schedule and different batch data for each iteration without relying on low n_iter and high n_examples parameters

reduces runtime by avoiding restart of optimization function and improves training convergence by providing a different batch for each iteration

* add minimum number of tensor dimensions to apply weight decay (default 2)

this allows to not apply weight decay to bias parameters

* rename training parameter cos-decay-alpha to cos-decay-min and clarify that adam-min-alpha also applies to warmup

* fix increase of model.train_samples and model.train_tokens

now that each optimizer iteration gets its own batch we need to multiply by number of opt iterations

* change sampling parameters for prediction after training to defaults of common.h

and clarify what is context for prediction and what are generated tokens

* tighten abs error bounds for cross_entropy_loss in test-grad0

* add conditional compilation of using F16 exp in flash attention

uncomment `// #define GGML_FLASH_ATTN_EXP_FP16` to enable usage of f16 exp in flash attention

* tighten abs error bounds for flash_attn in test-grad0

* tighten abs error bounds for sqrt in test-grad0

* remove out-commented vectorized code of opt_adam

the vectorized code might be bit faster for low number of parameters, but it had a big memory usage overhead

* ggml : update ggml_rms_norm_back with configurable eps

* llama training : fix ggml_rms_norm_back calls to pass configurable eps

* remove trailing whitespace

* add train function using automatic gradient checkpointing backward pass and allocator

* in train function replace add_inplace by regular add

because using add_inplace seems to result in different gradients

* don't use allocate hash_map on context

because the context has no_alloc=True when using memory allocator resulting in NULL data pointers

* correctly clone reshape and permute operations by also cloning tensor->nb values

* fix variable name and add missing type cast

* terminate recursive tensor cloning when reaching tensor without src tensors

* correctly clone view tensors by setting data pointers

without this the checkpointing would only work when being used together with memory allocator

* fix variable names

* swap arguments to commutative ops to be the same as in `forward_batch_wo_cache_flash_attn`

* add input tensors as checkpoints

so that recursive tensor cloning of gradient checkpointing terminates on input tensors

* fix variable name and add missing boolean negation

* make sure some tensors are not reallocated by inserting new temporary nodes depending on them:

output and parameter gradient tensors need to be available at the end of the graph execution

parameter gradient tensors also need to be available before the graph execution because they are set to zero before each optimizer iteration

checkpoint tensors are allocated all together to reduce memory allocator fragmentation

afterwards, in addition to the temporary nodes, we also need to reset the temporary leafs

* fix ASSERT to work with zero layers

* add training options whether to use allocator and/or unified training function

* integrate unified training function which may use memory allocator

the unified training function also supports arguments whether to use flash attention and/or gradient checkpointing

* format name of cloned tensors with " (clone)" suffix

* set names for tensors in unified train function for easier debugging

* allocate graph on context using ggml_new_graph

* remove handwritten training functions

* remove unused training parameters "use_scratch" and "use_unified"

* remove trailing whitespace

* remove unused train params: mem_compute1_gb & mem_compute2_gb

mem_compute_gb is used for compute when automatic memory allocator is not enabled, otherwise it can be very small to only hold the tensor definitions
mem_compute0_gb is used for automatic memory allocator (as long as measurement of max required size is not implemented)

* remove unused forward_batch function

* add debug asserts in ggml_allocr_alloc to some common pitfalls when using this function directly

* only use ggml_allocr_alloc when tensor has NULL data and is no view

* fix test when to create temporary backward graph

temporary backward graph is only necessary when using checkpointing

* fix memory "leak" in optimizers

each iteration a new cplan with new memory for work data was allocated.
now cplan creation only happens at the start of optimization, with each iteration reusing the cplan and its work data.

* reverse order of for loop in ggml_build_backward_expand to save memory when using gradient checkpointing and allocator

with this loop order gradient checkpointing with allocator on 16 layer model saves 13% memory; 2 layer memory it saves 2% memory.

the computation results are the same

* add API functions to access llama model tensors

* add stub example for finetuning, based on train-text-from-scratch

* move and remove code

* add API functions to access remaining model parameters:

mult, head and rot

* first draft for LORA finetune training

* remove const model and layer arguments in API functions for accessing model tensors

* bug fixes to make finetune compile

automatic allocator does not work yet

* add debug prints for training memory improvements

* fix names of lora tensors

* avoid stack overflow resulting from big ggml_cgraph

replace stack allocation and ggml_build_forward by ggml_new_graph in combination with ggml_build_forward_expand

* replace llama API functions to get model tensors by one function to get model tensor by name

LLAMA_API struct ggml_tensor * llama_get_model_tensor(struct llama_model * model, const char * name);

* remove unused call to not existing llama_get_layer_from_model

* implement ggml_compute_forward_out_prod_q_f32

* remove trailing whitespace

* add lora finetune support on quantized base model tensors

* add ggml_add_cast API function

this function works like ggml_add, but accepts a data type for the resulting tensor.
only supported for quantized src0 input.

* use ggml_add_cast in finetuning

lora-applied weights will now have data type F32, which improves gradients when finetuning quantized base models

* bug fix: actually use result type passed to ggml_add_cast

* make sure base model tensors data cannot be used in viewable operations

memory allocator would try to make lora application inplace on base model tensors.
since those are memory mapped this will result in memory access violations

* fix bug in ggml_out_prod which resulted in wrong n_dims of result tensors

* avoid keeping in memory ALL of the gradients

The problem here stems from ggml_graph_reset. This function is called in the optimization function, before each graph computation, to reset the gradients to zero. This required a unique memory slot for each gradient: allocating memory from a previosly freed memory location might lead to non-zero input gradients.

During ggml_compute_backward the gradients are build stepwise by adding or substracting new values, starting from a OP_NONE tensor which needs to contain zero-values. This requires the graph reset.

To avoid this I now remember in ggml_build_backward_expand the original OP_NONE gradient tensors in a hash table, which is passed to ggml_compute_backward. There instead of using add (or sub or similar) I test whether the existing gradient to be changed is a zero-valued-tensor by looking up its existence in the hash table. When it is such a zero-tensor it will not be modified, but replaced by the value to be added, otherwise the regular add (not inplace, allocator will take care of this) will be used. This way none of those zero-tensor values will be necessary in the final backward graph and more importantly they won't need a unique memory slot, just to make them zero.

* remove trailing whitespace

* remove debug prints and function to compute tensor data hash

* improve optimization iteration prints

* adjust maximal values to support finetuning 3B models

* change default finetune params lora_r and lora_alpha to match the n_rank parameters of 4

* bug fix: make sure finetune input gradient is allocated at begin and kept until end

* remove unnecessary src tensor from ggml_get_rows_back

we don't need data of src[2] for computation, only to setup the correct output shape.
remove dependency on src[2], so that allocator can work more freely.

the computational graph is still completely determined, because the output shape is naturally included.
this is similar to how ggml_reshape does it.

* remove unnecessary src tensor from ggml_repeat & ggml_repeat_back

we don't need data of src[1] for computation, only to setup the correct output shape.
remove dependency on src[1], so that allocator can work more freely.

the computational graph is still completely determined, because the output shape is naturally included

* resolve todo

allocator will only make it inplace when they are of the same type

* mixing multiple LORA adapters is now possible

pass more than one '--lora FNAME' argument to apply more than one LORA.
use '--lora-scaled FNAME S' when you want to specify a user-defined scale for an adapter.

* add option to save finetune output every N iterations

* also save latest finetune output with ITERATION="LATEST" and print where files are saved

saving with LATEST makes it easier to resume training from the latest checkpoint
the string "LATEST" can be configured with command line option "--fn-latest STR"

* update checkpoint train stats before saving via "--save-every"

* add command line option `--rank-wo N` for rank of wo tensor

* update finetune README

* fix dump_non_result_info_yaml to output multiple lora adapters

* bug fix: replace GGML_TYPE_SIZE[t] by ggml_type_size(t)

* replace llama_n_mult by llama_n_ff

* finetune bug fixes to compile with merged in code from master

* remove prediction related code to reduce duplicated code with main

use main instead

* reduce large memory overhead in train-text-from-scratch

all gradients had to be pinned so that graph_reset works correctly.
this is no longer necessary with the changes to ggml_compute_backward introduced in this PR.

* add comment explaining why finetune checkpoints are allocated in one block

* make default value of float member a float literal

* handle rms_norm and rope parameters the same as in train-text-from-scratch

* remove unused code

* remove vocab related code as it is unnecessary

* add LLM_KV_TRAINING_TYPE to train-text-from-scratch checkpoints

so that they can be differentiated from lora finetune checkpoints

* add gguf constants and load/save functions from train-text-from-scratch

* add load & save lora finetune checkpoints via gguf

* add python script to convert old finetune checkpoint files to gguf

* remove old checkpoint save & load code

* remove code to print data checksums which was used to verify correctness of new gguf code

* omit tokenization when training is disabled, only save llama lora adapter

training can be disabled by passing '-n 0' to finetune

* remove trailing whitespace

* update README.md

* implement ggml_compute_forward_repeat_f16

* avoid stack overflow of large cgraphs in test-grad0

* add ggml API functions ggml_unravel_index, ggml_get_i32_nd and its analogs for set and for f32

ggml_get_i32_1d, ggml_set_i32_1d, ggml_get_f32_1d, ggml_set_f32_1d now support non-contiguous tensors.
in case of non-contiguous tensor, the 1d index is unraveled into a multi index using ggml_unravel_index to be passed to '_nd' function equivalent.

this fixes a bug in test-grad0 which happens due to ggml_build_backward not building purely contiguous tensors anymore

* increase test-grad0 context mem size to accommodate for bigger cgraph

* add sanity check to ggml_compute_backward, asserting the correct shape of gradients

* fix ggml_acc_or_set to return tensor of correct shape

* remove unused 'inplace' argument from ggml_compute_backward function

inplace operations to add gradients are no longer created by ggml_compute_backward
use allocator to automatically make inplace operations

* add missing argument 'int i0' to ggml_get_i32_nd & ggml_set_i32_nd header declarations

* fix error message in ggml_allocr_alloc to display actual max_avail

* fix check_gradient

ggml_build_backward_expand was previously replaced by ggml_build_backward, but the assignment of forward graph to backward graph missing

* use tensor->view_src instead of ggml_is_view and get_view_source

* move gradient checkpointing code into ggml, new API function:

// build gradient checkpointing backward graph gb for gf using provided checkpoints
// gb_tmp will contain original backward graph with rewritten backward process nodes,
// but without the second forward pass nodes.
GGML_API void ggml_build_backward_gradient_checkpointing(
        struct ggml_context   * ctx,
        struct ggml_cgraph    * gf,
        struct ggml_cgraph    * gb,
        struct ggml_cgraph    * gb_tmp,
        struct ggml_tensor  * * checkpoints,
        int                     n_checkpoints);

* replace custom data getters and setters by ggml functions

* train-text-from-scratch can train (full finetune) gguf models

just pass the gguf model via `--checkpoint-in FN`.
after this, to continue training, pass the generated checkpoint instead of the original gguf model.

tested with smaller models, bigger models may exceed available memory.
use (LORA) finetune for those.

* remove trailing whitespace

* add option to save train-text-from-scratch output every N iterations

* update README.md

* fix warnings

* fix warnings

* remove finetune option to disable allocator

the allocator should always be used.
by making sure that it is always used it gets easier to implement automatic memory requirements computation

* add tensor checkpoints only when gradient checkpointing is enabled

* initialize opt ggml context if none was provided

* add ggml-alloc API function 'ggml_allocr_max_size' to get max size of alloc

GGML_API size_t ggml_allocr_max_size(struct ggml_allocr * alloc);

* finetune: automatically allocate all memory and changes to command line options

remove '--n_examples N' parameter, as it no longer makes sense to call optimization process multiple times in a loop.
add '--only_write_lora' command line option: will skip tokenization and training, to only write a llama.cpp comptabile LORA adapter.
remove memory buffer related command line options.
improve iteration console output.

* add finetune to Makefile

* update README.md

* print time per iteration and estimate remaining time

* increase measured alloc size by tensor_alignment

ggml_allocr_reset will reduce the given size by up to tensor_alignment-1

* fix README.md

* add some more allocator debug prints

* bug fix, probably solves the 'ggml_allocr_alloc: not enough space in the buffer' issue

* revert last commit

"bug fix, probably solves the 'ggml_allocr_alloc: not enough space in the buffer' issue"

"alloc was freeing an externally allocated tensor, because it calculated the end of allocator memory as alloc->data + alloc->max_size instead of alloc->data + alloc->size."

This is intentional to reduce the risk of freeing external tensors when measuring. Unless max_size is not properly calculated, I don't see why this is an issue.

* remove unnecessary "0x" before "%p" output

* move measurement memory segment to upper region of the address space

* update README.md

* fix printf format warnings

* add missing gguf_free in load_checkpoint_lora_file

* load default rms_norm and rope parameters from base model

* add gradient accumulation

specify number accumulation steps with '--grad-acc N'.
this will simulate a bigger batch size of grad_acc*batch.

* fix tracking of train_samples and train_tokens

* build : fix compile warnings

* ggml : fix L-BFGS linesearch loop

* improve finetune time measurement

fix printf warnings on system where int64_t is (long int).
change time datatypes to double because values get big with long training times.
exclude file saving from time measurement.
converge faster to actual time per iteration by removing very small first duration before first iteration was performed.
fix bug in output of total training time, the reported value was 1000 times to small.

* specify default lora rank with '--lora-r N'

'--lora-r N' will specify default rank for all tensors
'--rank-wq N', etc. will override this default rank for specific tensor types.

* fix gradient accumulation bug where the same batch was used for each microstep

* fix gradient accumulation bug where the same batch was used for each microstep

* support grouped-query-attention in ggml_flash_attn and ggml_flash_attn_back

k and v can now be repeated in q along ne[2]

in forward pass just use modulo to compute k and v indices, like ik2 = iq2 % nek2.

in backard pass this won't work as easy, because multiple threads will compete to accumulate to the same k->grad[:,ik1,ik2,ik3] and v->grad[:,iv1,iv2,iv3].
so we change the parallelization over q rows to be over k rows. this ensures non-overlapping (ik2,ik3) across threads.
in each thread we then iterate over the number of repetitions of k/v in q to compute iq2 as iq2 = ik2 + irep*nek2.

since ne2 is not the same for q,k and v we also change how the gradients are concatenated into the result tensor.
additionally the offsets of gradq, gradk and gradv in the result tensor are now memory aligned.

we also simplify the compute_backward part of flash_attn to use ggml_reshape instead of switching over the number of dimensions.
this needs a small change to ggml_reshape, removing the assertion of second argument to be contiguous.
since only the shape (ne) of the second reshape argument is of relevance, its memory layout (nb) is irrelevant -> it can very well be non-contiguous.

change test-grad0 to also test for repeated k/v in q.

this changes the rng and now results in small gradient differences in softmax. these solely come from using f16 exp table lookup in forward softmax: when temporarily changing softmax to use actual exp function, the reported gradient differences go away. gradient differences coming solely from f16 table lookup are acceptable.
added a note to explain this.

* add llama API functions to get grouped-query-attention n_head parameter 'n_head_kv'.

* fix finetune to support grouped-query-attention (using flash-attention)

note: ggml changes to ggml_out_prod are necessary to support grouped-query-attention without flash-attention.

* support broadcastable a in out_prod(a, b) and backward pass of broadcasting mul_mat(a, b)

* test broadcasting mul_mat backward pass

* decouple random number generator of each operation test

when changing one test the rng of others tests is not influenced anymore

* add comment briefly describing what ggml_repeat_back does

* simplify broadcasting mul_mat backward using ggml_repeat_back

* add cgraph evaluation order member and corresponding enum type

this controls in which order ggml_build_forward visits source nodes.
by default the nodes are visited left to right, i.e. src[0] first.
in some cases it is beneficial for ggml-alloc to visit in a different order.
two possible orders are supported: left-to-right (src[0] first) and right-to-left (src[0] last).

* measure max compute size for each cgraph eval order and use best order

this can bring huge memory savings:
e.g. codellama-34b with n_ctx=64, n_batch=1 goes from 92927.8mb down to 4627.6 MB

* remove unused command line options

* add sample start patterns and options to force new or by default resume last shuffling

* update shuffle rng state on reshuffle

* exclude known zero values from computations in flash_attn_f32 & flash_attn_back_f32

* remove probably unnecessary exception type flags from stringstream

* pass correct max number of tokens to llama_tokenize

* account for possible leading whitespace that will be added by tokenizer
e.g. '\t' will be tokenized by llama spm tokenizer to [29871, 12]

* use unrolled vec_mad in out_prod

y is vec_mad result vec.
x is vec_mad input vec.
v is vec_mad input scalar.

ggml_vec_mad_f32_unroll will internally loop over x and v with same y.

GGML_VEC_MAD_UNROLL is by default defined to 32.

This value is empirical optimized using performance test runs of out-prod in openllama-3b finetune with 256 context length and batch size 1. It gives 23% performance boost for out_prod.

Full measurements of out-prod runtime in ms:
	unroll_xv	unroll_yv
1	67014.643	87826.469
2	77117.552	89077.656
4	72091.311	109121.657
8	61077.543	88678.334
16	56914.67	79514.947
24	59024.595	84350.254
28	55952.446	83368.73
32	51476.658	85177.745
36	55973.792	84659.92
40	55139.616	93844.738
48	60736.392	93330.267
64	99856.878	116994.99

Second column is when unrollying yv instead of xv

* set lora_alpha to value of lora_r if it is not set via command line

otherwise only changing lora_r will change scaling of lora adapter used in prediction

* reshuffle original sample order instead of the previous shuffled order

otherwise resumed reshuffle will not result in same sample order

* block tiling for out-prod inspired by mul-mat

block sizes are empirically optimized

roughly doubles the flops of out-prod

* exclude some more known zero values from computations in flash_attn_f32 & flash_attn_back_f32

* add static keywords

* remove outcommented old code

* update train-text-from-scratch with tokenization, sample selection and shuffling from finetune

* remove lbfgs related train parameters

* move common train functions into common/train.[h|cpp]

* move train state into struct train_state

* move train data saving code into callback to unify code of opt_callback

train_params are still different in finetune and train-text-from-scratch, so it can't yet be moved to train.h|cpp

* move common train params into common/train

* move common opt_callback into common/train

* fix consume_common_train_arg

* save and load head_count_kv in lora checkpoints

* increase train_samples by used_samples instead of number of batches

on batch can contain more than one sample when option "fill_with_next_samples" is used

* fix usage of llama_tokenize

* remove static from process_escape since we need it exposed in header

* fix code formating of long function declarations

* fix condition in load_train_state_gguf

* use die("msg") instead of replace GGML_ASSERT(!"msg") or throw std::runtime_error("msg")

* fix saving and loading of training type

* remove terminating '\0' from tokenization

(llama_tokenize is now passed the string length instead of relying on terminating '\0')

* fix compile warnings

* fix compile warnings

* use new/delete for train_state instead of malloc/free

using malloc may result in seg faults when trying to assign string fields

* assert that sample_count > 0, avoiding division by zero

* fix frand to return value in interval [0,1)

* add train option "--sample-random-offsets"

Use samples beginning at random offsets.
The offset is only applied to the first sample in each batch context window.
Together with "--fill-with-next-samples" this may help for training endless text generation.

For example given a dataset containing samples "abcd", "ABCD", "0123".
With context size of 8 and options "--fill-with-next-samples", "--no-separate-with-eos", "--no-separate-with-bos",
the context windows of batches could only be filled with "abcdABCD", "ABCDabcd", "0123abcd", etc.

With "--sample-random-offsets" it can also be filled with "23abcdAB", "bcd0123A", etc.

* deduplicate code into function

* remove n_rot hparam, as it must always be hparam.n_embd_head()

* align code

* assert correct base model tensor shapes

* move some params from lora hparams into model hparams and load model params from gguf

this equalizes the model definition in finetune and text-from-scratch and removes the need for additional llama api functions to get model parameters

* remove now unnecessary llama API functions to get model params that where added by this PR

* train-text-from-scratch: automatically allocate model tensors, remove option '--mem-model N'

* train-text-from-scratch: automatically allocate opt context

* train-text-from-scratch: automatically allocate input tensors

* train-text-from-scratch: automatically allocate compute memory

* remove unused options and equalize train-text-from-scratch with finetune

* initialize opt->loss_after with zero

* add export-lora program

* remove trailing whitespace

* add export-lora build in Makefile

* remove unused struct tensor_info from export-lora

* add export-lora build dependency to llama

because it depends on common, which depends on llama

* update finetune README.md

* cancel optimization when specified number of epochs is completed

* improve handling of export-lora arguments

print errors and warnings when files could not be read or created

* Fix export-lora.cpp "not enough space in the context's memory pool" (#1)

* Fix export-lora.cpp "not enough space in the context's memory pool"

Without this patch, export-lora would sometimes error with "not enough space in the context's memory pool (needed 656784, available 656800)".

* increase required context size by 5*GGML_MEM_ALIGN instead of plain 16

---------

Co-authored-by: xaedes <xaedes@gmail.com>

* improve handling of not yet supported tensor types

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: meatbag-18a <145869052+meatbag-18a@users.noreply.github.com>
YellowRoseCx pushed a commit that referenced this pull request Oct 9, 2023
* vvhg-code-infill (#1)

* infill in separate example (#2)

* reverted changes to main and added infill example

* cleanup

* naming improvement

* make : add missing blank line

* fix missing semicolon

* brought infill up to current main code

* cleanup

---------

Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com>
YellowRoseCx pushed a commit that referenced this pull request Jul 26, 2024
YellowRoseCx pushed a commit that referenced this pull request Jul 26, 2024
YellowRoseCx pushed a commit that referenced this pull request Aug 20, 2024
* [example] batched-bench "segmentation fault"

When `llama-batched-bench` is invoked _without_ setting `-npl`, "number
of parallel prompts", it segfaults.

The segfault is caused by invoking `max_element()` on a zero-length
vector, `n_pl`

This commit addresses that by first checking to see if the number of
parallel prompts is zero, and if so sets the maximum sequence size to 1;
otherwise, sets it to the original, the result of `max_element()`.

Fixes, when running `lldb build/bin/llama-batched-bench -- -m models/Meta-Llama-3-8B.gguf`

```
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000010000366c llama-batched-bench`main(argc=3, argv=0x000000016fdff268) at batched-bench.cpp:72:28
   69  	    llama_context_params ctx_params = llama_context_params_from_gpt_params(params);
   70
   71  	    // ensure enough sequences are available
-> 72  	    ctx_params.n_seq_max = *std::max_element(n_pl.begin(), n_pl.end());
```

* Update examples/batched-bench/batched-bench.cpp

Co-authored-by: compilade <git@compilade.net>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: compilade <git@compilade.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants