Random discussions #2

cutoken · 2023-06-03T12:27:11Z

Hi,
Assert fails in tensor/mod.rs when number of heads = number of layers. Not sure if it's not a good number but in nanogpt, equal values are supported - see the cpu section (apples to oranges so please ignore if it's not a good comparison):

https://github.com/karpathy/nanoGPT

keyvank · 2023-06-03T12:29:39Z

@cutoken Have you tried removing the train_data/ folder before starting the training process on a different architecture?

cutoken · 2023-06-03T12:53:45Z

Oops. That seems to be the reason.
Btw, the default learning rate given for Adam seems to be really low. With around 0.001 it seems to be converging faster. That's the beauty of this project though: to play around and see how the training and quality is affected.

keyvank · 2023-06-03T12:55:04Z

@cutoken Be careful! With a higher learning-rate, you may get trapped into a sub-optimal solution :)

cutoken · 2023-06-03T12:59:23Z

Will keep that in mind!
Is there any possibility to quantize with this model ? I'm curious on where we should make the necessary changes to make it use 8 bit for example.

Btw, here is what it produces at step 1000 while training on a micro sample of nursery rhymes :D

Lis tt ucame lintle dor bamy
“Quack, quack.”
Butwont one  s litttle o o ducaour two dame q
A fhe the

cutoken · 2023-06-03T13:11:50Z

Okay one more question:
Let's say I'm training it on one dataset.txt. After some time I stop the training and switch it to another dataset.txt. Is it possible to port only the weights from previous training and start training on new data set? Currently it asserts in the same place as above. I'm guessing it's because more things than weights are getting stored to disk.

keyvank · 2023-06-03T13:20:01Z

@cutoken You mean using 8-bit floating-point instead of f32?

Wow I love that "Quack, quack." LOL 😆

Yes you can, but it's a bit tricky. You have to make sure the vocab and vocab-sizes of previous dataset and new dataset are equal, otherwise, you have to somehow make sure the char->int mappings are the same in both datasets. (And also change the shape of token_embedding tensor, which is tricky)

cutoken · 2023-06-03T13:25:16Z

Yes 8 bit Float instead of f32.

Can you explain a little on embedding.dat, optimizer.dat, pos_embedding.dat and tensor_num.dat files. Where do the weights for the layers get stored and is vocab same as character set or different. Sorry too many naive questions. If there is some paper or article that makes sense to get the basic understanding, you drop that instead :)

keyvank · 2023-06-03T13:35:13Z

Right now, the implementation is not generic, but you can just find/replace f32 with other types 🙂

Watch this: https://www.youtube.com/watch?v=kCc8FmEb1nY

embedding.dat contains the token-embeddings
pos_embedding.dat contrains the positional embeddings
optimizer.dat contains the state of the Adam optimizer
tensor_{tensor_id}.dat contains weights, biases, ... of different operations (E.g Linear, LayerNorm, ...) in the network

cutoken · 2023-06-03T13:49:57Z

Got it. If I could get the weights alone loadable so that we can switch the model around different data sets, I'll raise a PR.

cutoken · 2023-06-03T18:08:31Z

@keyvank
Solved the lack of interchangeability of weights between datasets issue by the simpler solution of limiting the vocab to a range of ascii characters. Unfortunately as you can guess, it would not only impose restrictions on datasets having other kinds of characters but also impose a penalty on simpler data sets. However it would let same model be used across similar data sets. So no PR is coming but I can raise one with a optional command line switch like --vocab=ascii if needed.

keyvank · 2023-06-03T21:51:54Z

@cutoken I just added a ASCII tokenizer! :)

cutoken · 2023-06-04T03:09:55Z

Awesome @keyvank !

cutoken · 2023-06-04T06:20:00Z

I'm trying to use the ascii tokenizer but may be a bit of implementation is missing. Shouldn't the new Tokenizer trait be used instead of SimpleTokenizer everywhere ?

keyvank · 2023-06-04T10:07:42Z

@cutoken It's not used by default. In src/main.rs you should replace let tokenizer = ... to let tokenizer = AsciiTokenizer {}

Btw, have you been able to get good results with a larger model?

cutoken · 2023-06-04T10:54:30Z

Understood.

I haven't tried with larger models yet. I'm actually playing with even smaller ones than the default settings. The previous output I provided was with 4 layer 4 head network. Today I'm trying with 2 layer 2 head and 1 layer 1 head networks. Thanks to the ascii tokenizing I'm able to switch datasets easily. However what I would love to see is being able to save and load the weights, biases and other network representations more granular so that it is possible to visualize the changing weights or even transfer weights more easily between networks.Were you able to get your larger network going ?

I'm interested in seeing whether the output has improved and what it's producing :)

PS: This is what my 1L/1H network is producing after 2.5 hours of training (probably around 8k steps). I don't think the quality will improve as the loss is not reducing anymore but I'm curious to see what longer training would do. Will run it for a few more hours.

He wen thoun muthere tit wapatit ther bare.
Ththet I trim me o.
My das tearishen ale wadoopha serist

keyvank · 2023-06-04T12:42:04Z

@cutoken Unfortunately I'm away from my powerful machine so couldn't experiment much. But yeah, probably we should try bigger models before getting anything meaningful.

P.s. I just pushed an optimization that makes everything around x3-4 times faster.

cutoken · 2023-06-04T13:52:35Z

That is great news. I'm going to give it a whirl!

Edit: Yes definitely faster. Seeing 3x improvement with same settings!

cutoken · 2023-06-04T14:18:59Z

btw, I notice that even with rayon while all my hyper-threads are engaged they are not hitting 100% usage. I tried increasing the threads but it made it slower. Not sure if it's just an artifact of hyper threading/display or some bug in rayon.

keyvank · 2023-06-04T14:45:47Z

@cutoken
Could have different reasons. Maybe some parts of the code are still single-threaded. Maybe there is more memory-operations compared to computations, leading to an idle cpu.
#4 could be an important reason why the cpu is idle.

cutoken · 2023-06-05T01:08:11Z

This is after 9 hours of training. I think the words are starting to make sense. Model is 6L/6H. Loss is continuing to reduce nicely. As I mentioned earlier, I have been so far using much higher learning rate (0.001). When the loss stops reducing I'll go with higher model size


What like but wore pad wo me che nogns yous dares,
As supt it nind bupart 'the reed:
And hils not es

keyvank · 2023-06-05T07:14:41Z

@cutoken
Im getting convinced this is working!! Is the loss already under 2.0?

cutoken · 2023-06-05T07:40:24Z

The loss was hovering between 1.9 and 2.1 for above result. I stopped it after a couple of hours to try the nanogpt for comparison. Here are the observations:

They use larger embedding size but less layers as default for cpu: 4L/4H/128 Embedding degree
They are doing some learning rate varying magic with adam: The loss drops like a rock with each iteration (200 iteration loss is same our 2k iteration loss)
they kept dropout as 0.0
each step takes around 500ms vs our 2kms which all things considered is not bad at all for rust version (they peg the cpu at 100%)

Going forward I'll use above params for testing. We can compare it with nanogpt on quality at same number of iterations so that we know the implementation is correct. From my points above I think there is something different in the optimizer part which we can focus on as I think femto is converging slower than the nano one :)

keyvank · 2023-06-05T10:42:33Z

@cutoken Yeah something is definitely wrong with the optimizer😨 Could you plz send me your training data when you got a chance?

keyvankambakhsh@gmail.com

cutoken · 2023-06-05T13:59:39Z

Sent it. Hope you can figure it out. I'm going to run this same settings model for a couple of days continuous to see if the loss will eventually reduce and we get equivalent output as nanogpt.

keyvank · 2023-06-05T14:09:36Z

@cutoken You are an angel. Thank you!

cutoken · 2023-06-07T10:16:11Z

Ran it for 2 days or so. Unfortunately the loss hasn't fallen below 1.7 (it's below 1.8 consistently). Sample output:

Giring city. Wherer I is thou losan, You come.

SERIANA:
We is the kis o parst ict as lipp I'd ou nord thought Vearnce
Wiveie ve are minder. Wis nushe shall wome
Frtue quien of hou alok:' avre we year stake,
we ususore sch have prinut to our fileng

keyvank · 2023-06-07T10:18:12Z

@cutoken Looks impressive. Have you tried the new optimizations? We are like x10 faster now.

P.s. plz join our discord! https://discord.gg/wTJFaDVn45

cutoken · 2023-06-09T14:49:02Z

Nice! May be we could implement bpe/use the corresponding crate for encoding @keyvank . Will let us get even better results and faster training times.

pcranaway · 2023-06-09T14:58:04Z

@cutoken I've implemented Tokenizers for Sentencepiece and Hugging Face's tokenizers in my fork

It's experimental and @keyvank most likely will want to have an in-house implementation of these tokenizers without using external libraries, though.

I'll try training using SentencePiece + your optimizations and will report back.

keyvank · 2023-06-09T15:39:32Z

@pcranaway i won’t insist writing our own tokenizer though. We can perhaps write our own sentencepiece model “reader” (probably just a hashmap?) smth called MappingTokenizer for example, which will be filled with sentencepiece or similar libraries

pcranaway · 2023-06-09T15:46:45Z

@pcranaway i won’t insist writing our own tokenizer though. We can perhaps write our own sentencepiece model “reader” (probably just a hashmap?) smth called MappingTokenizer for example, which will be filled with sentencepiece or similar libraries

that's basically what I meant! we can surely implement a SentencePiece reader ourselves

keyvank · 2023-06-09T17:20:53Z

@pcranaway let’s do this!

cutoken · 2023-06-10T10:26:04Z

How is the training going @pcranaway ? Hope loss has come down and you are able to see some good outputs!

keyvank · 2023-06-10T10:38:29Z

@cutoken Haven't you seen the updated README yet?

cutoken · 2023-06-10T10:41:21Z

omg! that's insane! that too in just 10 hours. What model config were you using ? How much loss was it able to drop down to ?

pcranaway · 2023-06-10T10:42:14Z

How is the training going @pcranaway ? Hope loss has come down and you are able to see some good outputs!

I got disappointed because it was all random words and stopped (also I've been training for a few days and my electricity bill will probably make me homeless)

keyvank · 2023-06-10T10:47:16Z

@cutoken Check out the model along with train_data on reddit branch

cutoken · 2023-06-10T10:51:46Z

@pcranaway now I'm worried about my elec bill :D
With a good model config and sentencepiece I'm sure this can be done quite easily!

@keyvank I'll check it out. Would be nice if we can dump whole training into a single shareable file for inference and also a dedicated executable/command line on existing one for easy inference. With how fast this is training won't be surprised if people start using it to get minor real world tasks done!

keyvank · 2023-06-10T10:59:29Z

@cutoken That's a really great idea! We can even have a pre-trained model on english dataset, and just let users fine-tune it with their own dataset

cutoken · 2023-06-10T11:01:08Z

Ya exactly. Remember that there are barely any from scratch 'usable' models on this front in the wild. everything is derived from a certain grass eating animal! :D

cutoken · 2023-06-10T13:43:14Z

hi @keyvank ,
I'm trying to join in on the sentencepiece bandwagon but unfortunately when trying to modify the main branch with reddit branch changes, I get below exception. The reddit branch works fine though. This is with the shakespeare dataset (which works great in reddit branch):


Vocab-size: 128 unique characters
Number of parameters: 832896
Generating text:
thread 'main' panicked at 'assertion failed: ind < self.shape[0]', /home/gocnet/nn/femtoGPT/src/tensor/mod.rs:38:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:578:5
   1: core::panicking::panic_fmt
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/panicking.rs:67:14
   2: core::panicking::panic
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/panicking.rs:117:5
   3: femto_gpt::tensor::TensorOps::get
   4: femto_gpt::gpt::GPT<O,R>::infer
   5: femto_gpt::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Settings are:


    let batch_size = 32;

    let num_tokens = 64;
    let vocab_size = tokenizer.vocab_size();
    let embedding_degree = 64;
    let num_layers = 2;
    let num_heads = 2;
    let head_size = embedding_degree / num_heads;
    let dropout = 0.0;

Looks like some bug in the infer code in main.rs vs reddit branch. Below works well while the one in main branch doesn't:


    println!("Generating text:");

    for _ in 0..50 {
        // Generate 100 character with the currently trained model before
        // starting the training loop.
        println!(
            "{}",
            tokenizer.untokenize(&gpt.infer(&tokenizer.tokenize("He who"), 20, 0.7, |ch| {}))
        );
    }

keyvank · 2023-06-10T14:22:53Z

@cutoken Can you run reddit branch properly? without integrating anything in main branch?

Edit: Oh sorry I sorry I just saw you said reddit branch is fine. I will try rebasing the reddit branch with main today

cutoken · 2023-06-10T14:29:52Z

Already done running reddit branch (with shakespeare as I want everything to be comparable) :)

This sentencepiece thingy is quite good. Loss takes a while to stabilize but the quality even at higher loss is way better (as expected - since char gpt have to get a lot more correct to look correct). After the above infer code modification, main branch also works fine. I'm using 1L/1H layers to quickly compare things (lower models don't move past a certain loss value obviously)

I gave "He who" as input and look what I got:


He whose my such To more to Rome
He whose point, good a glass,
He whose have what I can country,
He whom, That we should in the grou
He whom inter me gentleman: And
He whose honour! Shall us upon th
He whose is a prove my grown,
He whom I have purse: Sir, I w
He whose not with not, And thou d
He wholess of the fair. DUKE VIN
He whose have dold, good inst
He whole god secress is my mothe
He whom the rest at make a place.
He wholess against be me most g
He whose shall see, give thee, si
He whose me stir. ANTIGONUS:
He whose your such beg, And been
He whose that you, Believe, that'

keyvank · 2023-06-10T14:39:11Z

@cutoken Did you use a new sentencepiece vocab ran on Shakespeare dataset? Or did you use the model for Reddit dataset? This is really good I like it:)

cutoken · 2023-06-10T14:42:25Z

Generated a new sentencepiece vocab on the shakespeare with vocab size as 128. We will have to think how to package sentencepiece with femto. While it is good, I hope it won't end up bringing python eco system horrors to femto which is simple to run so far with no dependencies. Probably shipping the whole lib would make sense.

pcranaway · 2023-06-10T16:54:12Z

@cutoken as you said that error was because some words which occurred in the dataset were not in the vocab

I'm working on a in-house sentencepiece implementation, maybe we can make it generate a vocab and train a sentencepiece model inside femto if it's fast enough

cutoken · 2023-06-11T15:11:20Z

hi @keyvank ,
let's say I want to add a new decoder layer (the one that gets constructed as part of 0..num_layers loop) at run time after the gpt::new() call, how do I go about it ? As I understand you are just pushing the computations one by one incremented by tensorid so adding a layer at a later point of time will need incrementing the ids for the next layers as well (for example adding one more decoder layer along with all the sub layers like attention etc means incrementing the vocab out and other variables outside the for loop ?)

Also why keep computations as a btree when in reality it's being used more like a Vec as we are not even using the id against which each computation is stored (please correct me if I missed something :) )

keyvank · 2023-06-11T16:07:34Z

Hey @cutoken, let's continue these kind of questions inside separate github issues, this will help us organize things. Could you make a copy?

cutoken · 2023-06-13T08:03:59Z

hi @keyvank ,
I see that we are calculating gradients for input layers as well as well as adjusting the weights during training step for the input layers. Is there a reason for this ? My naive understanding is that we only change the weights and biases of the intermediate layers so that we finally get the optimal solution for the same inputs and outputs. Let me know why we do this.

keyvank · 2023-06-13T08:19:44Z

@cutoken
Hey! We are kind of letting the model find the best word to vec encoding for the words. Words that have similar meaning will be closer in this vector space.

Learn more about this by researching on: “Embedding layers”

cutoken · 2023-06-13T16:04:01Z

hi @keyvank ,
Is this supposed to be passed as 1 for batch_size:

femtoGPT/src/gpt.rs

Line 330 in 44a6fa5

let (xs, ys) = sample_dataset(dataset, 1, self.num_tokens, &mut rng);

keyvank · 2023-06-13T17:27:22Z

@cutoken Each CPU thread is processing 1 instance, that's why I passed 1 here

cutoken · 2023-06-14T06:00:22Z

Okay. One more question (apologies for so many - till now I have only used femto, while now I'm trying to dig into the code and understand it):

For positional encoding as per the transformer paper, we are to calculate the sin and cos values for each token (event, odd respectively( and then provide then add them to the embeddings for each token and provide than info to the first layer. While I see we are looking up the embeddings, I couldn't find out where we are doing this sine/cos magic.

keyvank · 2023-06-14T07:28:56Z

@cutoken
We don’t do that here. We let the model discover a code for each position itself (Dynamically). While this is simpler, giving it a pattern like sine/cosine probably makes it faster, and reduces num-params.

keyvank · 2023-06-14T07:29:59Z

Actually let’s implement this!

cutoken · 2023-06-14T07:32:35Z

Once you do let me know. Always ready to test the new things! :D

cutoken closed this as completed Jun 3, 2023

keyvank changed the title ~~Assertion fail when num_heads = num_layers~~ Random discussions Jun 4, 2023

keyvank reopened this Jun 4, 2023

Repository owner deleted a comment from ULis3h Feb 23, 2024

Random discussions #2

Random discussions #2

Comments

cutoken commented Jun 3, 2023

keyvank commented Jun 3, 2023

cutoken commented Jun 3, 2023 • edited Loading

keyvank commented Jun 3, 2023

cutoken commented Jun 3, 2023 • edited Loading

cutoken commented Jun 3, 2023

keyvank commented Jun 3, 2023

cutoken commented Jun 3, 2023 • edited Loading

keyvank commented Jun 3, 2023

cutoken commented Jun 3, 2023

cutoken commented Jun 3, 2023

keyvank commented Jun 3, 2023

cutoken commented Jun 4, 2023

cutoken commented Jun 4, 2023

keyvank commented Jun 4, 2023

cutoken commented Jun 4, 2023 • edited Loading

keyvank commented Jun 4, 2023

cutoken commented Jun 4, 2023 • edited Loading

cutoken commented Jun 4, 2023

keyvank commented Jun 4, 2023

cutoken commented Jun 5, 2023 • edited Loading

keyvank commented Jun 5, 2023

cutoken commented Jun 5, 2023

keyvank commented Jun 5, 2023

cutoken commented Jun 5, 2023

keyvank commented Jun 5, 2023

cutoken commented Jun 7, 2023

keyvank commented Jun 7, 2023 • edited Loading

cutoken commented Jun 9, 2023

pcranaway commented Jun 9, 2023

keyvank commented Jun 9, 2023

pcranaway commented Jun 9, 2023

keyvank commented Jun 9, 2023

cutoken commented Jun 10, 2023

keyvank commented Jun 10, 2023

cutoken commented Jun 10, 2023

pcranaway commented Jun 10, 2023

keyvank commented Jun 10, 2023

cutoken commented Jun 10, 2023

keyvank commented Jun 10, 2023

cutoken commented Jun 10, 2023

cutoken commented Jun 10, 2023 • edited Loading

keyvank commented Jun 10, 2023 • edited Loading

cutoken commented Jun 10, 2023

keyvank commented Jun 10, 2023

cutoken commented Jun 10, 2023

pcranaway commented Jun 10, 2023

cutoken commented Jun 11, 2023 • edited Loading

keyvank commented Jun 11, 2023

cutoken commented Jun 13, 2023

keyvank commented Jun 13, 2023

cutoken commented Jun 13, 2023 • edited Loading

keyvank commented Jun 13, 2023

cutoken commented Jun 14, 2023

keyvank commented Jun 14, 2023

keyvank commented Jun 14, 2023

cutoken commented Jun 14, 2023

cutoken commented Jun 3, 2023 •

edited

Loading

cutoken commented Jun 3, 2023 •

edited

Loading

cutoken commented Jun 3, 2023 •

edited

Loading

cutoken commented Jun 4, 2023 •

edited

Loading

cutoken commented Jun 4, 2023 •

edited

Loading

cutoken commented Jun 5, 2023 •

edited

Loading

keyvank commented Jun 7, 2023 •

edited

Loading

cutoken commented Jun 10, 2023 •

edited

Loading

keyvank commented Jun 10, 2023 •

edited

Loading

cutoken commented Jun 11, 2023 •

edited

Loading

cutoken commented Jun 13, 2023 •

edited

Loading