Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce model loading time #43

Merged
merged 3 commits into from
Mar 13, 2023
Merged

Conversation

maekawatoshiki
Copy link
Contributor

@maekawatoshiki maekawatoshiki commented Mar 12, 2023

Hello!

I noticed that the model loader is not using buffered IO, so I added a piece of code for buffering.
I measured the loading time only for llama 7B on my M1 Pro Macbook, but it reduced the time from 1316ms to 749ms.

main.cpp Outdated
@@ -496,6 +501,8 @@ bool llama_model_load(const std::string & fname, llama_model & model, gpt_vocab
fin.close();
}

free(f_buf);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f_buf will not be free if this function returns earlier, but I think it does not matter since it's a small amount of memory :)

@maekawatoshiki
Copy link
Contributor Author

maekawatoshiki commented Mar 13, 2023

Thank you for your review. Fixed as you mentioned.

@ggerganov ggerganov merged commit 63fd76f into ggml-org:master Mar 13, 2023
rooprob pushed a commit to rooprob/llama.cpp that referenced this pull request Aug 2, 2023
Speed up rmsnorm by using sqrtf/expf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants