-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random discussions #2
Comments
@cutoken Have you tried removing the |
Oops. That seems to be the reason. |
@cutoken Be careful! With a higher learning-rate, you may get trapped into a sub-optimal solution :) |
Will keep that in mind! Btw, here is what it produces at step 1000 while training on a micro sample of nursery rhymes :D
|
Okay one more question: |
@cutoken You mean using 8-bit floating-point instead of f32? Wow I love that "Quack, quack." LOL 😆 Yes you can, but it's a bit tricky. You have to make sure the vocab and vocab-sizes of previous dataset and new dataset are equal, otherwise, you have to somehow make sure the char->int mappings are the same in both datasets. (And also change the shape of |
Yes 8 bit Float instead of f32. Can you explain a little on embedding.dat, optimizer.dat, pos_embedding.dat and tensor_num.dat files. Where do the weights for the layers get stored and is vocab same as character set or different. Sorry too many naive questions. If there is some paper or article that makes sense to get the basic understanding, you drop that instead :) |
Right now, the implementation is not generic, but you can just find/replace f32 with other types 🙂 Watch this: https://www.youtube.com/watch?v=kCc8FmEb1nY
|
Got it. If I could get the weights alone loadable so that we can switch the model around different data sets, I'll raise a PR. |
@keyvank |
@cutoken I just added a ASCII tokenizer! :) |
Awesome @keyvank ! |
I'm trying to use the ascii tokenizer but may be a bit of implementation is missing. Shouldn't the new Tokenizer trait be used instead of SimpleTokenizer everywhere ? |
@cutoken It's not used by default. In Btw, have you been able to get good results with a larger model? |
Understood. I haven't tried with larger models yet. I'm actually playing with even smaller ones than the default settings. The previous output I provided was with 4 layer 4 head network. Today I'm trying with 2 layer 2 head and 1 layer 1 head networks. Thanks to the ascii tokenizing I'm able to switch datasets easily. However what I would love to see is being able to save and load the weights, biases and other network representations more granular so that it is possible to visualize the changing weights or even transfer weights more easily between networks.Were you able to get your larger network going ? I'm interested in seeing whether the output has improved and what it's producing :) PS: This is what my 1L/1H network is producing after 2.5 hours of training (probably around 8k steps). I don't think the quality will improve as the loss is not reducing anymore but I'm curious to see what longer training would do. Will run it for a few more hours.
|
@cutoken Unfortunately I'm away from my powerful machine so couldn't experiment much. But yeah, probably we should try bigger models before getting anything meaningful. P.s. I just pushed an optimization that makes everything around x3-4 times faster. |
That is great news. I'm going to give it a whirl! Edit: Yes definitely faster. Seeing 3x improvement with same settings! |
btw, I notice that even with rayon while all my hyper-threads are engaged they are not hitting 100% usage. I tried increasing the threads but it made it slower. Not sure if it's just an artifact of hyper threading/display or some bug in rayon. |
This is after 9 hours of training. I think the words are starting to make sense. Model is 6L/6H. Loss is continuing to reduce nicely. As I mentioned earlier, I have been so far using much higher learning rate (0.001). When the loss stops reducing I'll go with higher model size
|
@cutoken |
The loss was hovering between 1.9 and 2.1 for above result. I stopped it after a couple of hours to try the nanogpt for comparison. Here are the observations:
Going forward I'll use above params for testing. We can compare it with nanogpt on quality at same number of iterations so that we know the implementation is correct. From my points above I think there is something different in the optimizer part which we can focus on as I think femto is converging slower than the nano one :) |
@cutoken Yeah something is definitely wrong with the optimizer😨 Could you plz send me your training data when you got a chance? |
Sent it. Hope you can figure it out. I'm going to run this same settings model for a couple of days continuous to see if the loss will eventually reduce and we get equivalent output as nanogpt. |
@cutoken You are an angel. Thank you! |
Ran it for 2 days or so. Unfortunately the loss hasn't fallen below 1.7 (it's below 1.8 consistently). Sample output:
|
@cutoken Looks impressive. Have you tried the new optimizations? We are like x10 faster now. P.s. plz join our discord! https://discord.gg/wTJFaDVn45 |
Nice! May be we could implement bpe/use the corresponding crate for encoding @keyvank . Will let us get even better results and faster training times. |
@cutoken I've implemented It's experimental and @keyvank most likely will want to have an in-house implementation of these tokenizers without using external libraries, though. I'll try training using SentencePiece + your optimizations and will report back. |
@pcranaway i won’t insist writing our own tokenizer though. We can perhaps write our own sentencepiece model “reader” (probably just a hashmap?) smth called MappingTokenizer for example, which will be filled with sentencepiece or similar libraries |
that's basically what I meant! we can surely implement a SentencePiece reader ourselves |
@pcranaway let’s do this! |
How is the training going @pcranaway ? Hope loss has come down and you are able to see some good outputs! |
@cutoken Haven't you seen the updated README yet? |
omg! that's insane! that too in just 10 hours. What model config were you using ? How much loss was it able to drop down to ? |
I got disappointed because it was all random words and stopped (also I've been training for a few days and my electricity bill will probably make me homeless) |
@cutoken Check out the model along with train_data on |
@pcranaway now I'm worried about my elec bill :D @keyvank I'll check it out. Would be nice if we can dump whole training into a single shareable file for inference and also a dedicated executable/command line on existing one for easy inference. With how fast this is training won't be surprised if people start using it to get minor real world tasks done! |
@cutoken That's a really great idea! We can even have a pre-trained model on english dataset, and just let users fine-tune it with their own dataset |
Ya exactly. Remember that there are barely any from scratch 'usable' models on this front in the wild. everything is derived from a certain grass eating animal! :D |
hi @keyvank ,
Settings are:
Looks like some bug in the infer code in main.rs vs reddit branch. Below works well while the one in main branch doesn't:
|
@cutoken Can you run reddit branch properly? without integrating anything in main branch? Edit: Oh sorry I sorry I just saw you said reddit branch is fine. I will try rebasing the reddit branch with main today |
Already done running reddit branch (with shakespeare as I want everything to be comparable) :) This sentencepiece thingy is quite good. Loss takes a while to stabilize but the quality even at higher loss is way better (as expected - since char gpt have to get a lot more correct to look correct). After the above infer code modification, main branch also works fine. I'm using 1L/1H layers to quickly compare things (lower models don't move past a certain loss value obviously) I gave "He who" as input and look what I got:
|
@cutoken Did you use a new sentencepiece vocab ran on Shakespeare dataset? Or did you use the model for Reddit dataset? This is really good I like it:) |
Generated a new sentencepiece vocab on the shakespeare with vocab size as 128. We will have to think how to package sentencepiece with femto. While it is good, I hope it won't end up bringing python eco system horrors to femto which is simple to run so far with no dependencies. Probably shipping the whole lib would make sense. |
@cutoken as you said that error was because some words which occurred in the dataset were not in the vocab I'm working on a in-house sentencepiece implementation, maybe we can make it generate a vocab and train a sentencepiece model inside femto if it's fast enough |
hi @keyvank , Also why keep computations as a btree when in reality it's being used more like a Vec as we are not even using the id against which each computation is stored (please correct me if I missed something :) ) |
Hey @cutoken, let's continue these kind of questions inside separate github issues, this will help us organize things. Could you make a copy? |
hi @keyvank , |
@cutoken Learn more about this by researching on: “Embedding layers” |
@cutoken Each CPU thread is processing 1 instance, that's why I passed 1 here |
Okay. One more question (apologies for so many - till now I have only used femto, while now I'm trying to dig into the code and understand it): For positional encoding as per the transformer paper, we are to calculate the sin and cos values for each token (event, odd respectively( and then provide then add them to the embeddings for each token and provide than info to the first layer. While I see we are looking up the embeddings, I couldn't find out where we are doing this sine/cos magic. |
@cutoken |
Actually let’s implement this! |
Once you do let me know. Always ready to test the new things! :D |
Hi,
Assert fails in tensor/mod.rs when number of heads = number of layers. Not sure if it's not a good number but in nanogpt, equal values are supported - see the cpu section (apples to oranges so please ignore if it's not a good comparison):
https://github.com/karpathy/nanoGPT
The text was updated successfully, but these errors were encountered: