Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random discussions #2

Open
cutoken opened this issue Jun 3, 2023 · 81 comments
Open

Random discussions #2

cutoken opened this issue Jun 3, 2023 · 81 comments

Comments

@cutoken
Copy link
Contributor

cutoken commented Jun 3, 2023

Hi,
Assert fails in tensor/mod.rs when number of heads = number of layers. Not sure if it's not a good number but in nanogpt, equal values are supported - see the cpu section (apples to oranges so please ignore if it's not a good comparison):

https://github.com/karpathy/nanoGPT

@keyvank
Copy link
Owner

keyvank commented Jun 3, 2023

@cutoken Have you tried removing the train_data/ folder before starting the training process on a different architecture?

@cutoken
Copy link
Contributor Author

cutoken commented Jun 3, 2023

Oops. That seems to be the reason.
Btw, the default learning rate given for Adam seems to be really low. With around 0.001 it seems to be converging faster. That's the beauty of this project though: to play around and see how the training and quality is affected.

@keyvank
Copy link
Owner

keyvank commented Jun 3, 2023

@cutoken Be careful! With a higher learning-rate, you may get trapped into a sub-optimal solution :)

@cutoken
Copy link
Contributor Author

cutoken commented Jun 3, 2023

Will keep that in mind!
Is there any possibility to quantize with this model ? I'm curious on where we should make the necessary changes to make it use 8 bit for example.

Btw, here is what it produces at step 1000 while training on a micro sample of nursery rhymes :D

Lis tt ucame lintle dor bamy
“Quack, quack.”
Butwont one  s litttle o o ducaour two dame q
A fhe the

@cutoken
Copy link
Contributor Author

cutoken commented Jun 3, 2023

Okay one more question:
Let's say I'm training it on one dataset.txt. After some time I stop the training and switch it to another dataset.txt. Is it possible to port only the weights from previous training and start training on new data set? Currently it asserts in the same place as above. I'm guessing it's because more things than weights are getting stored to disk.

@keyvank
Copy link
Owner

keyvank commented Jun 3, 2023

@cutoken You mean using 8-bit floating-point instead of f32?

Wow I love that "Quack, quack." LOL 😆

Yes you can, but it's a bit tricky. You have to make sure the vocab and vocab-sizes of previous dataset and new dataset are equal, otherwise, you have to somehow make sure the char->int mappings are the same in both datasets. (And also change the shape of token_embedding tensor, which is tricky)

@cutoken
Copy link
Contributor Author

cutoken commented Jun 3, 2023

Yes 8 bit Float instead of f32.

Can you explain a little on embedding.dat, optimizer.dat, pos_embedding.dat and tensor_num.dat files. Where do the weights for the layers get stored and is vocab same as character set or different. Sorry too many naive questions. If there is some paper or article that makes sense to get the basic understanding, you drop that instead :)

@keyvank
Copy link
Owner

keyvank commented Jun 3, 2023

Right now, the implementation is not generic, but you can just find/replace f32 with other types 🙂

Watch this: https://www.youtube.com/watch?v=kCc8FmEb1nY

embedding.dat contains the token-embeddings
pos_embedding.dat contrains the positional embeddings
optimizer.dat contains the state of the Adam optimizer
tensor_{tensor_id}.dat contains weights, biases, ... of different operations (E.g Linear, LayerNorm, ...) in the network

@cutoken
Copy link
Contributor Author

cutoken commented Jun 3, 2023

Got it. If I could get the weights alone loadable so that we can switch the model around different data sets, I'll raise a PR.

@cutoken cutoken closed this as completed Jun 3, 2023
@cutoken
Copy link
Contributor Author

cutoken commented Jun 3, 2023

@keyvank
Solved the lack of interchangeability of weights between datasets issue by the simpler solution of limiting the vocab to a range of ascii characters. Unfortunately as you can guess, it would not only impose restrictions on datasets having other kinds of characters but also impose a penalty on simpler data sets. However it would let same model be used across similar data sets. So no PR is coming but I can raise one with a optional command line switch like --vocab=ascii if needed.

@keyvank
Copy link
Owner

keyvank commented Jun 3, 2023

@cutoken I just added a ASCII tokenizer! :)

@cutoken
Copy link
Contributor Author

cutoken commented Jun 4, 2023

Awesome @keyvank !

@cutoken
Copy link
Contributor Author

cutoken commented Jun 4, 2023

I'm trying to use the ascii tokenizer but may be a bit of implementation is missing. Shouldn't the new Tokenizer trait be used instead of SimpleTokenizer everywhere ?

@keyvank
Copy link
Owner

keyvank commented Jun 4, 2023

@cutoken It's not used by default. In src/main.rs you should replace let tokenizer = ... to let tokenizer = AsciiTokenizer {}

Btw, have you been able to get good results with a larger model?

@cutoken
Copy link
Contributor Author

cutoken commented Jun 4, 2023

Understood.

I haven't tried with larger models yet. I'm actually playing with even smaller ones than the default settings. The previous output I provided was with 4 layer 4 head network. Today I'm trying with 2 layer 2 head and 1 layer 1 head networks. Thanks to the ascii tokenizing I'm able to switch datasets easily. However what I would love to see is being able to save and load the weights, biases and other network representations more granular so that it is possible to visualize the changing weights or even transfer weights more easily between networks.Were you able to get your larger network going ?

I'm interested in seeing whether the output has improved and what it's producing :)

PS: This is what my 1L/1H network is producing after 2.5 hours of training (probably around 8k steps). I don't think the quality will improve as the loss is not reducing anymore but I'm curious to see what longer training would do. Will run it for a few more hours.

He wen thoun muthere tit wapatit ther bare.
Ththet I trim me o.
My das tearishen ale wadoopha serist

@keyvank
Copy link
Owner

keyvank commented Jun 4, 2023

@cutoken Unfortunately I'm away from my powerful machine so couldn't experiment much. But yeah, probably we should try bigger models before getting anything meaningful.

P.s. I just pushed an optimization that makes everything around x3-4 times faster.

@keyvank keyvank changed the title Assertion fail when num_heads = num_layers Random discussions Jun 4, 2023
@keyvank keyvank reopened this Jun 4, 2023
@cutoken
Copy link
Contributor Author

cutoken commented Jun 4, 2023

That is great news. I'm going to give it a whirl!

Edit: Yes definitely faster. Seeing 3x improvement with same settings!

@cutoken
Copy link
Contributor Author

cutoken commented Jun 4, 2023

btw, I notice that even with rayon while all my hyper-threads are engaged they are not hitting 100% usage. I tried increasing the threads but it made it slower. Not sure if it's just an artifact of hyper threading/display or some bug in rayon.

@keyvank
Copy link
Owner

keyvank commented Jun 4, 2023

@cutoken
Could have different reasons. Maybe some parts of the code are still single-threaded. Maybe there is more memory-operations compared to computations, leading to an idle cpu.
#4 could be an important reason why the cpu is idle.

@cutoken
Copy link
Contributor Author

cutoken commented Jun 5, 2023

This is after 9 hours of training. I think the words are starting to make sense. Model is 6L/6H. Loss is continuing to reduce nicely. As I mentioned earlier, I have been so far using much higher learning rate (0.001). When the loss stops reducing I'll go with higher model size


What like but wore pad wo me che nogns yous dares,
As supt it nind bupart 'the reed:
And hils not es

@keyvank
Copy link
Owner

keyvank commented Jun 5, 2023

@cutoken
Im getting convinced this is working!! Is the loss already under 2.0?

@cutoken
Copy link
Contributor Author

cutoken commented Jun 5, 2023

The loss was hovering between 1.9 and 2.1 for above result. I stopped it after a couple of hours to try the nanogpt for comparison. Here are the observations:

  1. They use larger embedding size but less layers as default for cpu: 4L/4H/128 Embedding degree
  2. They are doing some learning rate varying magic with adam: The loss drops like a rock with each iteration (200 iteration loss is same our 2k iteration loss)
  3. they kept dropout as 0.0
  4. each step takes around 500ms vs our 2kms which all things considered is not bad at all for rust version (they peg the cpu at 100%)

Going forward I'll use above params for testing. We can compare it with nanogpt on quality at same number of iterations so that we know the implementation is correct. From my points above I think there is something different in the optimizer part which we can focus on as I think femto is converging slower than the nano one :)

@keyvank
Copy link
Owner

keyvank commented Jun 5, 2023

@cutoken Yeah something is definitely wrong with the optimizer😨 Could you plz send me your training data when you got a chance?

keyvankambakhsh@gmail.com

@cutoken
Copy link
Contributor Author

cutoken commented Jun 5, 2023

Sent it. Hope you can figure it out. I'm going to run this same settings model for a couple of days continuous to see if the loss will eventually reduce and we get equivalent output as nanogpt.

@keyvank
Copy link
Owner

keyvank commented Jun 5, 2023

@cutoken You are an angel. Thank you!

@cutoken
Copy link
Contributor Author

cutoken commented Jun 7, 2023

Ran it for 2 days or so. Unfortunately the loss hasn't fallen below 1.7 (it's below 1.8 consistently). Sample output:

Giring city. Wherer I is thou losan, You come.

SERIANA:
We is the kis o parst ict as lipp I'd ou nord thought Vearnce
Wiveie ve are minder. Wis nushe shall wome
Frtue quien of hou alok:' avre we year stake,
we ususore sch have prinut to our fileng

@keyvank
Copy link
Owner

keyvank commented Jun 7, 2023

@cutoken Looks impressive. Have you tried the new optimizations? We are like x10 faster now.

P.s. plz join our discord! https://discord.gg/wTJFaDVn45

@cutoken
Copy link
Contributor Author

cutoken commented Jun 9, 2023

Nice! May be we could implement bpe/use the corresponding crate for encoding @keyvank . Will let us get even better results and faster training times.

@pcranaway
Copy link

@cutoken I've implemented Tokenizers for Sentencepiece and Hugging Face's tokenizers in my fork

It's experimental and @keyvank most likely will want to have an in-house implementation of these tokenizers without using external libraries, though.

I'll try training using SentencePiece + your optimizations and will report back.

@keyvank
Copy link
Owner

keyvank commented Jun 9, 2023

@pcranaway i won’t insist writing our own tokenizer though. We can perhaps write our own sentencepiece model “reader” (probably just a hashmap?) smth called MappingTokenizer for example, which will be filled with sentencepiece or similar libraries

@pcranaway
Copy link

@pcranaway i won’t insist writing our own tokenizer though. We can perhaps write our own sentencepiece model “reader” (probably just a hashmap?) smth called MappingTokenizer for example, which will be filled with sentencepiece or similar libraries

that's basically what I meant! we can surely implement a SentencePiece reader ourselves

@keyvank
Copy link
Owner

keyvank commented Jun 9, 2023

@pcranaway let’s do this!

@cutoken
Copy link
Contributor Author

cutoken commented Jun 10, 2023

How is the training going @pcranaway ? Hope loss has come down and you are able to see some good outputs!

@keyvank
Copy link
Owner

keyvank commented Jun 10, 2023

@cutoken Haven't you seen the updated README yet?

@cutoken
Copy link
Contributor Author

cutoken commented Jun 10, 2023

omg! that's insane! that too in just 10 hours. What model config were you using ? How much loss was it able to drop down to ?

@pcranaway
Copy link

How is the training going @pcranaway ? Hope loss has come down and you are able to see some good outputs!

I got disappointed because it was all random words and stopped (also I've been training for a few days and my electricity bill will probably make me homeless)

@keyvank
Copy link
Owner

keyvank commented Jun 10, 2023

@cutoken Check out the model along with train_data on reddit branch

@cutoken
Copy link
Contributor Author

cutoken commented Jun 10, 2023

@pcranaway now I'm worried about my elec bill :D
With a good model config and sentencepiece I'm sure this can be done quite easily!

@keyvank I'll check it out. Would be nice if we can dump whole training into a single shareable file for inference and also a dedicated executable/command line on existing one for easy inference. With how fast this is training won't be surprised if people start using it to get minor real world tasks done!

@keyvank
Copy link
Owner

keyvank commented Jun 10, 2023

@cutoken That's a really great idea! We can even have a pre-trained model on english dataset, and just let users fine-tune it with their own dataset

@cutoken
Copy link
Contributor Author

cutoken commented Jun 10, 2023

Ya exactly. Remember that there are barely any from scratch 'usable' models on this front in the wild. everything is derived from a certain grass eating animal! :D

@cutoken
Copy link
Contributor Author

cutoken commented Jun 10, 2023

hi @keyvank ,
I'm trying to join in on the sentencepiece bandwagon but unfortunately when trying to modify the main branch with reddit branch changes, I get below exception. The reddit branch works fine though. This is with the shakespeare dataset (which works great in reddit branch):


Vocab-size: 128 unique characters
Number of parameters: 832896
Generating text:
thread 'main' panicked at 'assertion failed: ind < self.shape[0]', /home/gocnet/nn/femtoGPT/src/tensor/mod.rs:38:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:578:5
   1: core::panicking::panic_fmt
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/panicking.rs:67:14
   2: core::panicking::panic
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/panicking.rs:117:5
   3: femto_gpt::tensor::TensorOps::get
   4: femto_gpt::gpt::GPT<O,R>::infer
   5: femto_gpt::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Settings are:


    let batch_size = 32;

    let num_tokens = 64;
    let vocab_size = tokenizer.vocab_size();
    let embedding_degree = 64;
    let num_layers = 2;
    let num_heads = 2;
    let head_size = embedding_degree / num_heads;
    let dropout = 0.0;

Looks like some bug in the infer code in main.rs vs reddit branch. Below works well while the one in main branch doesn't:


    println!("Generating text:");

    for _ in 0..50 {
        // Generate 100 character with the currently trained model before
        // starting the training loop.
        println!(
            "{}",
            tokenizer.untokenize(&gpt.infer(&tokenizer.tokenize("He who"), 20, 0.7, |ch| {}))
        );
    }

@keyvank
Copy link
Owner

keyvank commented Jun 10, 2023

@cutoken Can you run reddit branch properly? without integrating anything in main branch?

Edit: Oh sorry I sorry I just saw you said reddit branch is fine. I will try rebasing the reddit branch with main today

@cutoken
Copy link
Contributor Author

cutoken commented Jun 10, 2023

Already done running reddit branch (with shakespeare as I want everything to be comparable) :)

This sentencepiece thingy is quite good. Loss takes a while to stabilize but the quality even at higher loss is way better (as expected - since char gpt have to get a lot more correct to look correct). After the above infer code modification, main branch also works fine. I'm using 1L/1H layers to quickly compare things (lower models don't move past a certain loss value obviously)

I gave "He who" as input and look what I got:


He whose my such To more to Rome
He whose point, good a glass,
He whose have what I can country,
He whom, That we should in the grou
He whom inter me gentleman: And
He whose honour! Shall us upon th
He whose is a prove my grown,
He whom I have purse: Sir, I w
He whose not with not, And thou d
He wholess of the fair. DUKE VIN
He whose have dold, good inst
He whole god secress is my mothe
He whom the rest at make a place.
He wholess against be me most g
He whose shall see, give thee, si
He whose me stir. ANTIGONUS:
He whose your such beg, And been
He whose that you, Believe, that'

@keyvank
Copy link
Owner

keyvank commented Jun 10, 2023

@cutoken Did you use a new sentencepiece vocab ran on Shakespeare dataset? Or did you use the model for Reddit dataset? This is really good I like it:)

@cutoken
Copy link
Contributor Author

cutoken commented Jun 10, 2023

Generated a new sentencepiece vocab on the shakespeare with vocab size as 128. We will have to think how to package sentencepiece with femto. While it is good, I hope it won't end up bringing python eco system horrors to femto which is simple to run so far with no dependencies. Probably shipping the whole lib would make sense.

@pcranaway
Copy link

@cutoken as you said that error was because some words which occurred in the dataset were not in the vocab

I'm working on a in-house sentencepiece implementation, maybe we can make it generate a vocab and train a sentencepiece model inside femto if it's fast enough

@cutoken
Copy link
Contributor Author

cutoken commented Jun 11, 2023

hi @keyvank ,
let's say I want to add a new decoder layer (the one that gets constructed as part of 0..num_layers loop) at run time after the gpt::new() call, how do I go about it ? As I understand you are just pushing the computations one by one incremented by tensorid so adding a layer at a later point of time will need incrementing the ids for the next layers as well (for example adding one more decoder layer along with all the sub layers like attention etc means incrementing the vocab out and other variables outside the for loop ?)

Also why keep computations as a btree when in reality it's being used more like a Vec as we are not even using the id against which each computation is stored (please correct me if I missed something :) )

@keyvank
Copy link
Owner

keyvank commented Jun 11, 2023

Hey @cutoken, let's continue these kind of questions inside separate github issues, this will help us organize things. Could you make a copy?

@cutoken
Copy link
Contributor Author

cutoken commented Jun 13, 2023

hi @keyvank ,
I see that we are calculating gradients for input layers as well as well as adjusting the weights during training step for the input layers. Is there a reason for this ? My naive understanding is that we only change the weights and biases of the intermediate layers so that we finally get the optimal solution for the same inputs and outputs. Let me know why we do this.

@keyvank
Copy link
Owner

keyvank commented Jun 13, 2023

@cutoken
Hey! We are kind of letting the model find the best word to vec encoding for the words. Words that have similar meaning will be closer in this vector space.

Learn more about this by researching on: “Embedding layers”

@cutoken
Copy link
Contributor Author

cutoken commented Jun 13, 2023

hi @keyvank ,
Is this supposed to be passed as 1 for batch_size:

let (xs, ys) = sample_dataset(dataset, 1, self.num_tokens, &mut rng);

@keyvank
Copy link
Owner

keyvank commented Jun 13, 2023

@cutoken Each CPU thread is processing 1 instance, that's why I passed 1 here

@cutoken
Copy link
Contributor Author

cutoken commented Jun 14, 2023

Okay. One more question (apologies for so many - till now I have only used femto, while now I'm trying to dig into the code and understand it):

For positional encoding as per the transformer paper, we are to calculate the sin and cos values for each token (event, odd respectively( and then provide then add them to the embeddings for each token and provide than info to the first layer. While I see we are looking up the embeddings, I couldn't find out where we are doing this sine/cos magic.

@keyvank
Copy link
Owner

keyvank commented Jun 14, 2023

@cutoken
We don’t do that here. We let the model discover a code for each position itself (Dynamically). While this is simpler, giving it a pattern like sine/cosine probably makes it faster, and reduces num-params.

@keyvank
Copy link
Owner

keyvank commented Jun 14, 2023

Actually let’s implement this!

@cutoken
Copy link
Contributor Author

cutoken commented Jun 14, 2023

Once you do let me know. Always ready to test the new things! :D

Repository owner deleted a comment from ULis3h Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants