Ability for `./main` to keep the model in memory and pass it more text #23

simonw · 2023-03-11T21:00:25Z

The ./main program currently outputs text and then quits.

How hard would it be to add a mode where it could stay running and be ready to accept more text piped to standard input?

This could help avoid the overhead of loading the model again every time the script runs.

Maybe it could output the generated text followed by a marker of some sort when it's done, so a wrapping process could see when it's finished and available to send a new prompt for evaluation.

I'm interested in wrapping it in a tiny Python web server to give myself a UI for interacting with the model.

The text was updated successfully, but these errors were encountered:

j-f1 · 2023-03-11T21:02:41Z

I made a fork (https://github.com/j-f1/forked-llama.cpp/tree/swift) that’s focused around working as a library rather than a standalone program. I don’t know how hard it would be to bridge that to Python but you might find some of the changes useful for writing a C++ command line program you can talk to via the command line.

G2G2G2G · 2023-03-12T02:23:06Z

agreed, a chat mode would be a lot better. The prompts this model generates are very bizarre lmao

ggerganov · 2023-03-12T06:31:42Z

Modifying main.cpp to continuously get new prompts with a single load of the model should be easy. Just have to manage the n_past and embd vars properly.

Interfacing this with the outside world will take some more effort.

bakkot · 2023-03-12T09:59:57Z

Interfacing this with the outside world will take some more effort.

One thing I've done previously is to drop in a single-file HTTP server, like this one, and then make an HTTP API. (Optionally also a single-file JSON parser/serializer, like this one, so that you can make the API JSON-based.)

It's a little silly, vs building as a library or adding python bindings or whatever, but it's cross-platform and very easy to get it working (~60 lines, or a little more if you want to stream the results rather than just sending them all when it's done).

example server

#include <cstdio>
#include "httplib.h"

using namespace httplib;

#define PORT 8080

std::string get_response(std::string message) {
  // call actual API here
  return "got message: " + message;
}

int main(void) {
  Server svr;

  if (!svr.is_valid()) {
    printf("server setup failed\n");
    return -1;
  }

  svr.Get("/", [=](const Request & /*req*/, Response &res) {
    res.set_content("POST api is listening on /api\n", "text/plain");
  });

  svr.Post("/api",
    [&](const Request &req, Response &res, const ContentReader &content_reader) {
      if (req.is_multipart_form_data()) {
        res.set_content("Server does not support multipart form data", "text/html");
        res.status = 500;
        return;
      }
      std::string body;
      content_reader([&](const char *data, size_t data_length) {
        body.append(data, data_length);
        return true;
      });
      // if it's JSON, change the content type to application/json
      res.set_content(get_response(body), "text/plain");
    });

  svr.set_exception_handler([](const Request& req, Response& res, std::exception_ptr ep) {
    auto fmt = "<h1>Error 500</h1><p>%s</p>";
    char buf[BUFSIZ];
    try {
      std::rethrow_exception(ep);
    } catch (std::exception &e) {
      snprintf(buf, sizeof(buf), fmt, e.what());
    } catch (...) {
      snprintf(buf, sizeof(buf), fmt, "Unknown Exception");
    }
    res.set_content(buf, "text/html");
    res.status = 500;
  });

  printf("starting server on port %d\n", PORT);
  svr.listen("localhost", PORT);

  return 0;
}

turbo · 2023-03-12T12:53:56Z

How hard would it be to add a mode where it could stay running and be ready to accept more text piped to standard input?

Would be awesome, because this would allow pre-prompting and spawning interactive sessions like tinygrads's LLaMa personalities (demo video).

G2G2G2G · 2023-03-12T13:31:53Z

well he doesn't want any deps so that's why the interfacing is the hard part, otherwise it's pretty ez part.
I think doing the

Modifying main.cpp to continuously get new prompts with a single load of the model should be easy. Just have to manage the n_past and embd vars properly.

will be most of the work, at least from there it's easy enough to ghetto hack in anyones own test bed to input stuff into stdin for the program

DelusionalLogic · 2023-03-12T16:01:16Z

Assuming nobody cares about windows it would be possible to allocate the giant buffer required for the model with shm_open to retain the loaded model in memory between executions. That way you could still faff about with the executable/parameters as long as they don't impact how you load the llama model.

blackhole89 · 2023-03-12T19:28:49Z

I'm working on adding a sort of interactive mode over in a fork, where (if a flag, -i, is passed) you can either interrupt generation with SIGINT or specify a "reverse prompt" token string which results in control being passed back to the user.

It currently looks like this:

Would you be interested in a PR once I'm done with some further cleanup and testing? I'm still planning to put the colouring behind another flag, and find a solution for some papercuts (among others, it seems that spaces tend to be part of a single token with words that follow them, so you have to use User: rather than User: as a reverse prompt and force the user to enter the space following it).

ggerganov · 2023-03-12T19:47:27Z

@blackhole89
Yes! This is exactly the way to go

Edit: if you cannot make it run on Windows, you can #ifdef it and say it's supported only on Linux / Mac OS

Edit2: regarding spaces - the tokenizer is currently broken. Probably this is causing a lot of trouble such cases and also when Unicode characters are present in the input.

blackhole89 · 2023-03-12T20:32:17Z

I unfortunately don't have access to a Windows machine to test it on right now. Is there a problem with the availability of signal.h/sigaction there? Either way, at least the "reverse prompt" triggered interaction should work even without the signal handler.

Unfortunate about the tokenizer. I guess I will leave the problem untouched for now, hoping for an eventual resolution :)

I think I got it to work to approximately the standard I was hoping for (few lingering issues: probably better to communicate limitations such as that the reverse prompt must be token-exact; subsequent user inputs are not counted towards the context size limit), so I'll go ahead and make a PR.

ggerganov · 2023-03-12T20:36:14Z

@blackhole89

It would be useful to include short section in the README with instruction how to use interactive mode and a screenshot

blackhole89 · 2023-03-12T21:05:11Z

I added a section (and made the PR). Not sure in hindsight if it's the best possible example, since it doesn't show how the usage of \ to submit multiple lines...

blackhole89 · 2023-03-12T21:18:09Z

I realised that the slight imprecision in calling London the largest city in (all of) Europe actually biased the entire generation in a less factual direction, so here's another screenshot that doesn't have that issue and also shows off '\' 🙄. Might be good to replace it...

I also found that having the high repeat_penalty tended to make it abort the conversation early (rather than repeat User:/Bob:), so the corresponding invocation also had --repeat_penalty 1.0 added.

edit:

./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 256 --repeat_penalty 1.0 --color -i -r "User:" -p "Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:"

No idea why the color resets after "Moscow", needs some investigation...

ggerganov · 2023-03-12T21:33:36Z

@blackhole89
Thanks to your demonstration, prompting and parameters I finally start to feel confident that the llama.cpp inference works correctly. So far the generation examples were a bit underwhelming and I was having doubts about the accuracy of the computation, since I have nothing to compare it to.
Thanks for this contribution!

Will update the README with the new example

blackhole89 · 2023-03-12T21:43:37Z

Great, thanks! I also figured out what was going wrong with the color getting cancelled early; I can make a quick PR for that shortly, though it might be necessary to sit down and think a little bit about code quality (as I'm picking locations in the code to emit the colorcodes that happen to work but aren't particularly related to the lifecycle of tokens as they migrate from input -> embd_inp -> embd -> output).

I found that few-shotting it carefully makes a big difference. I would actually recommend adding more than one example interaction into the prompt, but have been lazy about it because my machine isn't doing too well with 13B (probably at around 500ms per token, and spending a considerable amount of time even to do the processing for the user-provided prompt/input tokens - can this be optimised somehow?).

ggerganov · 2023-03-12T21:49:16Z

Yes - this is a very cool task actually.
We can add a --cache_prompt flag that if added will dump the computed KV caches of the prompt processing to the disk in a file with name produced by the hash of the prompt. Next time you run, it will first check if we have stored KV cache for this hash and load it straight from disk instead of computing it.

blackhole89 · 2023-03-12T22:18:29Z

Ah, nice. If all the work it does there can be cached, this might make the LLaMa quite usable even without finetuning.

I'll do some more experiments to see how well it can do in chatbot mode with other personae than "assistant". I haven't quite gone down the "Shakespearean tsundere goblin with dyslexia" route of specificity yet, but the basics do look pretty decent:

thomasantony · 2023-03-13T00:05:51Z

@simonw I am working on a fork with some python bindings here: https://github.com/thomasantony/llama.cpp/tree/feature/pybind

Dicklesworthstone · 2023-03-13T02:28:53Z

This has been working well for me, using a response from ChatGPT that I shortened by a couple sentences to save more of the context space (oddly, while this works well on my M1 iMac, it gives very wrong looking results on Ubuntu 22.10 using Amd64):

./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 512 --repeat_penalty 1.0 --color -i -r "User:"
-p
"Transcript of a dialog, where the User interacts with an brilliant assistant name HAL;
HAL is extremely smart, knowledgeable, capable, and helpful;
HAL always give complete, accurate, and very detailed responses to questions, and never stops a response in mid-sentence or mid-thought.
HAL always answers the question given by the user and then stops to await the next question. Below is an example of a dialog between a user and HAL:

HAL: Hello. I am HAL. How may I help you today?

User: What’s the history of bullfighting in Spain?

HAL: Bullfighting, also known as "tauromachia," has a long and storied history in Spain, with roots that can be traced back to ancient civilizations. The sport is believed to have originated in 7th-century BCE Iberian Peninsula as a form of animal worship, and it evolved over time to become a sport and form of entertainment. Bullfighting as it is known today became popular in Spain in the 17th and 18th centuries. During this time, the sport was heavily influenced by the traditions of medieval jousts and was performed by nobles and other members of the upper classes. Over time, bullfighting became more democratized and was performed by people from all walks of life. Bullfighting reached the height of its popularity in the 19th and early 20th centuries and was considered a national symbol of Spain.

User:"

jart · 2023-03-16T12:11:09Z

Assuming nobody cares about windows it would be possible to allocate the giant buffer required for the model with shm_open to retain the loaded model in memory between executions.

The way to solve this problem is to use mmap(2) which on Windows is called CreateFileMapping() + MapViewOfFileEx() which are available since Vista. I believe using mmap will reduce startup latency to effectively zero. In order to do that, we need to refactor the codebase so that data structures can be directly mappable, without needing to be loaded and constructed. The issue where everyone is encouraged to help us design that is #91.

jart · 2023-03-16T12:12:22Z

Removing the help wanted tag because it's here. That doesn't mean you can't still participate in helping me do it!

tarruda · 2023-03-20T12:52:27Z

How hard would it be to add a mode where it could stay running and be ready to accept more text piped to standard input?

#278 implements a way to do this, I've also added example shell scripts that you can use to spawn llama.cpp in server mode and open multiple clients to it (limitation is how many CPU threads you have to process generation in parallel).

anzz1 · 2023-03-20T23:24:32Z

Assuming nobody cares about windows it would be possible to allocate the giant buffer required for the model with shm_open to retain the loaded model in memory between executions. That way you could still faff about with the executable/parameters as long as they don't impact how you load the llama model.

I wouldn't go as far to assume:
Statista: Global market share held by operating systems for desktop PCs, January 2023
Windows 74.14% , macOS 15.33% , Linux 2.91% , Chrome OS 2.35% , Others/unknown 5.28%

Maybe the project should split into non-portable and portable forks one since there's a lot of PR's already which decrease portability. There's already the tcp_server and mmap branches, neither of which are exactly portable.

I made the suggestion before that instead of stabbing the main code with non-portable stuff, that implenting a light C API to handle saving/loading the model and its' state in/from/to memory woud allow using whatever platform- or library-dependent solution on top of it , instead of integrating deeply to the main program.

Why even the need of spawning multiple processes instead of multiple threads? Threads are already built-in, unlike sockets or mmap. The only truly portable thing is standard I/O, which can be redirected and also easily communicated with using simple file streams which are supported by everything. Instead of changing the main implementation much at all, you could just build any modules outside the main implementation as modules and communicate using these simple file streams. The main llama input would not need any sort of 'protocol' but just listen to '\n' or EOF like it currently does, and whatever modules could just follow that paradigm while communicating through the file streams. Am I missing something here?

Again if the case is that more processes is what is wanted and an ability to share the state between them, a more general approach would be making a C-style API with something simple like struct state{...} , save_state(*state), load_state(*state). Then any implementation could just live as a separate module and use those general funcs to manipulate the state however they wish, and this would keep the main program clean of any non-portable code.

Originally posted by @anzz1 in #278 (comment)

@tarruda's counter point was that the current code is not thread-safe, but why exactly is spawning more processes in a non-portable manner ( fork() ) exactly better than just implementing thread-safety? The performance penalty that comes with it could be simply #ifdef THREAD_SAFE , instead of ending up with a main program full of #ifdef's with whatever platform-specific implementations for the other options to do it.

If you want only the model to be preloaded in memory and not the context and have the ability of serving multiple input/output streams concurrently, you could simply spawn new threads with their separate states and only have them share the pointer to the preloaded model. If the preloaded model is read-only , you don't need to even implement any thread-safety at all. Thread-safety would only be needed for sharing the context. I don't understand why fork() is needed to accomplish this?

Something like this? In the case the context doesn't need to be shared, only the preloaded model. No need for fork() or anything else than just threads (which work in every platform). No need for thread-safety, since all the concurrent accesses are just read operations. The context isn't shared between threads, only the model is.

main()
1. preload_model(&model)
2. loop:
     wait for input or signal
     case signal_from_thread_to_exit:
       break
     case input:
       create thread serve_instance(&model, stdin, stdout)
3. exit process

preload_model(struct model *mdl)
  1. load model to *mdl
  2. return

create_context(struct model *mdl, struct context *ctx)
  1. create new context to ctx
  2. return

serve_instance(struct model *mdl, stdin, stdout)
  1. create_context(mdl, &ctx)
  2. loop:
       if stdin==ctrl+c , free_context(ctx), signal_main(exit), end thread
       evaluate(mdl, ctx, stdin, stdout)
       until stdin==eof
  3. free_context(ctx)
  4. end thread

Now if the context was to be shared, you would need to either copy the context or implement a thread-safe access to it, both of which comes with their caveats, thread-safety sacrifices speed while copying increases memory usage and it would only be shared up to the point it was copied. But both the fork() implementation and mmap() share exactly the same caveats anyway. Why is the overhead of spawning a new process of fork() or the increased complexity of mmap() needed here at all? Please enlighten me if there is something I'm completely missing here, as it seems we're trying to use a chainsaw to cut a matchstick here.

With this sort of generic implementation, any non-portable implementation could just use the preload_model, create_context, free_context, serve_instance, evaluate functions however they saw fit, and it could be done outside the main implementation, keeping the main program lean, clean and fast. The stdin and stdout could be piped to wherever the new implementation requires, be it a HTTP server, some node.js implementation, a file on the file system, or even a hardware device like a serial port.

Since all the operations are atomic in regards to each other, you could even load multiple models to memory if you wanted. Create threads or don't create threads, whatever. Load multiple models to memory at the same time and run a single evaluation for each one of them and compare the results. All without introducing any increased complexity or lessened portability to the code.

With the generic implementation using multiple processes could also be done if that is needed. Just have a module which shares the context and model structs using whatever means you want to, save to disk, mmap to memory, whatever. The whole point is that you could do anything you want with it, without any of it having to go inside the main program but could live as its' own standalone module instead.

tarruda · 2023-03-21T09:06:25Z

I wouldn't go as far to assume:
Statista: Global market share held by operating systems for desktop PCs, January 2023
Windows 74.14% , macOS 15.33% , Linux 2.91% , Chrome OS 2.35% , Others/unknown 5.28%

Since a few years ago Microsoft has embraced Linux, and installing Ubuntu on Windows has never been easier with WSL2. You can literally go to the app store and click a button to get a Linux shell that is fully integrated into Windows.

That means you can easily run the Linux version inside windows (the tcp_server branch) and consume from a socket in a native win32 app.

but why exactly is spawning more processes in a non-portable manner ( fork() ) exactly better than just implementing thread-safety?

fork() is part of POSIX and very portable, it is likely that windows is the only mainstream OS that doesn't support it.

If the preloaded model is read-only , you don't need to even implement any thread-safety at all. Thread-safety would only be needed for sharing the context.

I can't give an opinion there since I'm a machine learning newbie and still don't understand how inference is done in language models. But if ggml supports this usage, then this could work.

In any case I can't be of much help to write a cross-platform solution, the only times I wrote C threading/network code that works across Unix/Win32 was using libuv, which is not an option since dependencies are not wanted in this project. Maybe C++ supports cross-platform threads/networking as part of its standard, but that is also outside of my comfort zone.

Meanwhile I will just keep rebasing the tcp_server branch, if someone eventually comes up with a better solution I will close the PR.

anzz1 · 2023-03-21T21:06:29Z

The problem with WSL2 is not about simplicity, it's about performance. While great improvements in speed have been made recently, a virtualization layer simply can never be as efficient as native code is. "Just emulate Linux" is not an answer when the goal is to have native cross-platform compatibility.

You can see from the ggml code that it uses the atomic_* functions for thread-safety between threads using a single context. The work has already been done by @ggerganov . It would be a shame to undo this cross-platform compatibility, don't you think? Just by using separate contexts for the additional "main" threads which can serve I/O concurrently should work.

The beauty of using C structs in favor of STL also means that anything that is read-only is also inherently thread-safe since all the read operations are atomic at the bare metal level.

Currently though the llama_model_load loads the whole model and it's context to a single mutable llama_model struct.
This doesn't pose a problem though, at least not one which fork() solves.

fork() could be replaced with this:

load model using llama_load_model, get a llama_model struct in return.
now instead of directly using this llama_model for evaluation, do:
1. allocate new memory page with sizeof(llama_model)
2. mark the new page as copy-on-write
3. memcpy the llama_model to a new llama_model
use the new model, free it when done, rinse and repeat 2-3

This is essentially what fork() does anyway, but instead of duplicating the whole process (and incurring a performance penalty), only the model would be duplicated instead.

I looked through your tcp_server.cpp and it's simple and succinct, I could easily port it for winsock for you and you wouldn't need to worry about it. No external libraries required.

So the problem isn't your network/socket code at all, it really doesn't even need much changes at all to work with winsock.

The problems are these:
a) use of fork()
b) the PosixStream class , which is a STL construct not in line with the code guidelines. No reason to have the overhead of C++ classes when the same can be implemented in C-style like the rest of the project is.

edit:
Stop the press! As of right now there is this: #370 moving towards an expandable C-style API. Consider the information in this post regarding how the models are loaded and evaluated outdated.

tarruda · 2023-03-22T14:04:06Z

This is essentially what fork() does anyway, but instead of duplicating the whole process (and incurring a performance penalty), only the model would be duplicated instead.

Fork is very lightweight and almost as efficient as spawning new threads in modern Unixes, I suggest man clone if you want to understand how it is done on Linux. Both fork and pthreads use the same clone system call under the hoods, only difference are the parameters. Even if it was not efficient, it is insignificant when compared to the overhead of loading new models.

the PosixStream class , which is a STL construct not in line with the code guidelines. No reason to have the overhead of C++ classes when the same can be implemented in C-style like the rest of the project is.

I agree with this, check this discussion from a couple of days ago: 5c19c70#r105107843

I added the PosixStream class because there's no standard API for wrapping a file descriptor into a C++ stream. Tried to use some non-standard solutions but resulted in failed Mac compilation.

anzz1 · 2023-03-26T15:25:35Z

@tarruda

I checked out the discussion and I agree that the replacement of STL functions with their standard C runtime counterparts is a good idea, especially the part of using standard fopen() file streams for loading the models, as seen in this PR #216 was able to achieve upwards to 10x loading speed by using fopen instead of std::ifstream.

Since nowadays you can achieve +5GB/s sequential read speeds with NVMe SSD's, the read functions are now the bottleneck and not the disk speeds unlike just a few short years ago. The less abstraction overhead there is between the raw kernel I/O syscalls and the consumer program, the better. There is the effect though that by making the read functions fast enough the less benefit overall will be achieved with preloading the model to memory. Like said, modern SSD's are already very fast and they are only getting faster.

About fork() , I'm not an expert on this but from what I've gathered I understand that it's pretty well implemented and lightweight nowadays, I mean the people contributing to the kernel development on Unixes are very smart.

The problem with it was the non-portability of it, as the copy-on-write functionality it uses under the hood to clone the process could be used for just cloning the state and not the whole process, making it portable.

However the points I made in earlier are pretty much outdated now since the C API was introduced. I'm not calling any shots here, but I wouldn't oppose having less-portable solutions under the examples, like having a "examples/linux", "examples/mac", "examples/windows" , "examples/android" style folders since there can be valid cases where portability isn't an option. Or have non-portable examples in their own repos, importing the main llama.cpp repo as a git submodule. I think that would be the cleanest solution overall, especially if the solutions need to span over multiple files and thus would clutter up the main repo. The whole point I tried to make was about keeping the main functionality sleek and portable, which it is, and there is now a simple way of interfacing with llama through the C API which anything can be built upon.

You're obviously free to do whatever you wish, but I have a suggestion:

Since the ability to do exactly what is needed here, sharing the model and its' state, is in the current short-term roadmap , what do you think of the idea of putting the tcp_server functionality on ice for now until that feature is finished.

The thing is that I'm also very interested in the tcp_server functionality. I think it has great promise for developing whatever integrations using whatever languages of anyone's choosing because binding a C/C++ module might not be easy in every programming language / development environment, but the ability of connecting to a TCP socket is implemented in pretty much everything out there.

Using the shared state after it's completed and threads instead of fork() , it could be made easily portable. It could be done in a single file examples/tcp_server.cpp. The great benefit of the shared state would also be using way less memory, since the states are a lot smaller than the model itself. Also it would remove the copy-on-write overhead, which is pretty small but overhead is always overhead.

I would also be interested in working with you on that. You said that you work pretty much exclusively on Linux and aren't too familiar with winsock. I am the opposite side of that coin, working mostly with Windows and am very familiar with winsock. So joining our forces, I'm certain we could make it work very well. You would take care of the linux socket side, while I could implement the same in winsock. Put together, resulting in a portable solution. Food for thought?

jart · 2023-03-26T22:39:43Z

While great improvements in speed have been made recently, a virtualization layer simply can never be as efficient as native code is.

This indicates a real misunderstanding of how virtualization works. Modern Windows is virtualized. WSL2 and Windows itself are both virtualized. The reason why people want WIN32 is because they want to use the MSVC debugger.

tarruda · 2023-03-27T01:29:32Z

@anzz1 I'm also interested in the tcp_server functionality, which is why I'm rebasing and using it on my private fork (Not doing any updates though, so you can consider it "frozen" for now). Not sure if you followed the discussion in #278, but I'm no longer trying to get it merged since there's no interest in having the stdio abstraction which is required to avoid duplicating IO loop logic.

You're free to take the code from my branch and adapt it to use threads and the upcoming model sharing feature. There's nothing special about that code, you'd simply replace fork() with a syscall to spawn thread and handle the connection there.

If you are serious about implementing a cross-platform tcp server functionality, I highly recommend using an existing win32/unix abstraction like libuv. It is a waste of effort redoing functionality that already exists in lightweight C libraries with very permissive licenses. Just add a libuv dependency and use their API, and you will have an efficient network server that works seamless across win32 and unix.

jart · 2023-03-27T03:24:49Z

Or just use POSIX and Cygwin / Mingw / Cosmo / etc. all have you covered.

anzz1 · 2023-03-28T16:42:07Z

Not sure if you followed the discussion in #278, but I'm no longer trying to get it merged since there's no interest in having the stdio abstraction which is required to avoid duplicating IO loop logic.

Yeah I followed it when it was current but that discussion is pretty much outdated now. Funny how it feels like ancient history even though its just been a few days. Nature of everything AI related, haha.

If you are serious about implementing a cross-platform tcp server functionality, I highly recommend using an existing win32/unix abstraction like libuv. It is a waste of effort redoing functionality that already exists in lightweight C libraries with very permissive licenses. Just add a libuv dependency and use their API, and you will have an efficient network server that works seamless across win32 and unix.

I'll take a look at libuv, though for now I don't think it's necessary to use any library for a simple tcp server. There isn't really anything too much to implement the way I'm currently imagining it. Maybe I'll run into a issue which makes me change my mind.

Generally I dislike using libraries, but certainly not all of them all bad. I am especially a fan of single-header C/C++ libraries like the awesome stb libraries by nothings.

In any case I'll start working on it once the model state sharing change is implemented as the environment to work with becomes more clear.

This indicates a real misunderstanding of how virtualization works. Modern Windows is virtualized. WSL2 and Windows itself are both virtualized. The reason why people want WIN32 is because they want to use the MSVC debugger.

I'm not going to get into a argument about semantics what is and isn't virtualization as I specifically said virtualization and not emulation. If you are talking about how the Windows kernel is a hypervisor translating syscalls to low-level I/O and the OS itself runs on top of it, yeah I guess you can call Windows virtualized. And sure, it would be theoretically possible to optimize the code in such a way that a fopen() operation under WSL2 and ZwCreateFile under the native windows OS would amount to exactly the same amount of cpu instructions and use the same codepath resulting in the same amount of cpu cycles used and have the same results in thread scheduling, branch prediction and instruction caching. Obviously in real life this isn't the case though.

I am not saying WSL2 is bad performance wise, nowadays it's actually quite amazing that it comes within spitting distance to running Linux natively. For example, in this benchmark you can see WSL2 achieve 86% performance in Linpack with 5800X.

It still has some way to go though, and it isn't fully POSIX compliant yet. But the way things are going, it's probably only going to improve. Unlike everything else Windows, it's one thing which seems to consistently move in the right direction while the native OS itself is going down the drain, lul.

In any case, I don't want 86% performance, I want 100%. You cannot tell me to settle for lower performance, simple as that. Might be fine for many, and maybe I value performance and optimization too much in someone else's opinion, but you can do you and I can do me , right?

Cygwin / MSYS2 aren't even considered in this conversation. Their performance is god awful. They can be useful for many applications, but for anything where performance is a requirement they are completely out of the question.

noonien · 2023-05-24T22:24:34Z

The discussion here seems to have veered waaaay off topic.

It would also be great to be able to reset the prompt without having to reload the model.

gengel · 2023-05-30T18:32:39Z

Second the above comment. But really I'd like to option to perform lots of potential new generations with the model in-memory. E.g., tweak temperature, parameters, etc. without re-loading the model each time.

Performance increase of 6-7% on K-type quants (40B model only)

Register `freqs_cis` as non-persistent buffer

iTestAndroid · 2023-12-30T03:27:18Z

Any progress on this? Ability to reset the prompt / context without reloading the whole process/model?

penghe2021 · 2024-01-20T01:33:13Z

I made some progress using the ./main example to reset the prompt without reloading the model.

The core idea is to add an if statement after the user input part

llama.cpp/examples/main/main.cpp

Line 804 in 381ee19

std::string line;

like if (buffer.compare(SYSTEM_RESET) == 0) {}

Inside the if statement, remember to

use the buffer again to capture new user input (new prompt), and update the params
reset all related variables
tokenize prmpt into emb_inp
push emb_inp into emb

You can also update other parameters in the first step. It should work, but I haven't tested it.

github-actions · 2024-04-09T01:10:25Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

asdf4w3t5 · 2024-04-09T03:16:40Z

an issue isn't magically solved just because people stop posting on it

phymbert · 2024-04-09T03:21:18Z

And do you have a better solution to offer ?

issues: ci - close inactive issue with workflow #6053

duncster94 · 2024-06-02T01:07:14Z

Hi, are there any plans to implement prompt/context resetting? I have a text summarization task involving 100,000's of distinct, unrelated prompts. Currently, the model has to be reloaded for each prompt, adding significant overhead. It would be nice to have a CLI argument such as --file-dir that specifies a directory containing prompt files which would then be looped over.

This was referenced Mar 12, 2023

Inference on the model shawwn/llama-dl#1

Open

[Request] Support for llama.cpp oobabooga/text-generation-webui#250

Closed

ggerganov added enhancement New feature or request help wanted Extra attention is needed labels Mar 12, 2023

gjmulder mentioned this issue Mar 12, 2023

Fine Tuning #55

Closed

blackhole89 mentioned this issue Mar 12, 2023

Add interactive mode #61

Merged

ggerganov mentioned this issue Mar 12, 2023

Store KV cache of computed prompts to disk to avoid re-compute in follow-up runs #64

Closed

leszekhanusz mentioned this issue Mar 13, 2023

Longer and infinite output #71

Closed

gjmulder mentioned this issue Mar 13, 2023

Improving quality with 8bit? #53

Closed

MLTQ mentioned this issue Mar 14, 2023

Create json api service #88

Closed

aratic mentioned this issue Mar 15, 2023

Reset context instead of quitting in interactive mode #145

Closed

jart removed the help wanted Extra attention is needed label Mar 16, 2023

This was referenced Mar 21, 2023

Proof of concept TCP server mode #278

Closed

Should use mmap for model loading #91

Closed

MMAP for Windows (not working atm) #341

Closed

flowgrad pushed a commit to flowgrad/llama.cpp that referenced this issue Jun 27, 2023

Adapted the pull 1862 from ikawrakow for ggllm.cpp (ggerganov#23)

d896ebf

Performance increase of 6-7% on K-type quants (40B model only)

rooprob pushed a commit to rooprob/llama.cpp that referenced this issue Aug 2, 2023

Merge pull request ggerganov#23 from awgu/pt2

bd9e837

Register `freqs_cis` as non-persistent buffer

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 9, 2024

phymbert mentioned this issue Apr 9, 2024

issues: ci - close inactive issue with workflow #6053

Merged

Ability for ./main to keep the model in memory and pass it more text #23

Ability for ./main to keep the model in memory and pass it more text #23

Comments

simonw commented Mar 11, 2023

j-f1 commented Mar 11, 2023 • edited Loading

G2G2G2G commented Mar 12, 2023

ggerganov commented Mar 12, 2023

bakkot commented Mar 12, 2023 • edited Loading

turbo commented Mar 12, 2023

G2G2G2G commented Mar 12, 2023

DelusionalLogic commented Mar 12, 2023

blackhole89 commented Mar 12, 2023

ggerganov commented Mar 12, 2023 • edited Loading

blackhole89 commented Mar 12, 2023

ggerganov commented Mar 12, 2023

blackhole89 commented Mar 12, 2023

blackhole89 commented Mar 12, 2023 • edited Loading

ggerganov commented Mar 12, 2023

blackhole89 commented Mar 12, 2023

ggerganov commented Mar 12, 2023

blackhole89 commented Mar 12, 2023

thomasantony commented Mar 13, 2023

Dicklesworthstone commented Mar 13, 2023 • edited Loading

jart commented Mar 16, 2023

jart commented Mar 16, 2023

tarruda commented Mar 20, 2023

anzz1 commented Mar 20, 2023 • edited Loading

tarruda commented Mar 21, 2023

anzz1 commented Mar 21, 2023 • edited Loading

tarruda commented Mar 22, 2023 • edited Loading

anzz1 commented Mar 26, 2023 • edited Loading

jart commented Mar 26, 2023

tarruda commented Mar 27, 2023

jart commented Mar 27, 2023

anzz1 commented Mar 28, 2023 • edited Loading

noonien commented May 24, 2023

gengel commented May 30, 2023

iTestAndroid commented Dec 30, 2023

penghe2021 commented Jan 20, 2024 • edited Loading

github-actions bot commented Apr 9, 2024

asdf4w3t5 commented Apr 9, 2024

phymbert commented Apr 9, 2024

duncster94 commented Jun 2, 2024

Ability for `./main` to keep the model in memory and pass it more text #23

Ability for `./main` to keep the model in memory and pass it more text #23

j-f1 commented Mar 11, 2023 •

edited

Loading

bakkot commented Mar 12, 2023 •

edited

Loading

ggerganov commented Mar 12, 2023 •

edited

Loading

blackhole89 commented Mar 12, 2023 •

edited

Loading

Dicklesworthstone commented Mar 13, 2023 •

edited

Loading

anzz1 commented Mar 20, 2023 •

edited

Loading

anzz1 commented Mar 21, 2023 •

edited

Loading

tarruda commented Mar 22, 2023 •

edited

Loading

anzz1 commented Mar 26, 2023 •

edited

Loading

anzz1 commented Mar 28, 2023 •

edited

Loading

penghe2021 commented Jan 20, 2024 •

edited

Loading