implement gsam in jax #8

juntang-zhuang · 2022-07-16T00:30:38Z

Hi, @lucasb-eyer thanks for your review and comments. I reformated the files and squashed commits into a new PR (sorry I messed up the old PR and could not squash commits there). This PR includes:

Put GSAM related configs into config.gsam and call gsam with l, grads = gsam_gradient(loss_fn=loss_fn, base_opt=opt, inputs=images, targets=labels, lr=learning_rate, **config["gsam"])
Add big_vision/configs/proj/gsam/vit_1k_gsam_no_aug.py, the network used in GSAM paper used pool_type='gap' and rep_size=False, which is different from the default config.
Fix format issues and squash commits.

Regarding reproducing the experiments, I wonder if it's possible for you to run the script (with 8x8 TPU cores to exactly match the paper)? I'm sorry I don't have access to TPU resources since I'm not affiliated with Google now, so I can't run experiments, though the checkpoints and the old version code that I used were kept in server. Thanks so much for your code review and help!

lucasb-eyer

Thanks for addressing the comments. There's a few new ones regarding the config and my understanding of the schedule.

lucasb-eyer · 2022-07-19T15:54:34Z

big_vision/trainers/proj/gsam/gsam.py

+  if lr_max == lr_min:
+    sam_rho = rho_max
+  else:
+    sam_rho = rho_min + (rho_max - rho_min) * (lr - lr_min) / (lr_max - lr_min)


From #4:

Lucas:

This makes me wonder (sorry I haven't read the GSAM paper), do you really want to linearly interpolate rho, or would you ideally want to apply the same scheduling function as the learning-rate, e.g. cosine for example?

Juntang:

Sorry for the confusion. I want to apply the same scheduler but with a different scale / upper_lower bound.
In the paper I only used linear lr scheduler for experiments, and in theory (and proofs part of paper) the two schedules are assumed to be both of inverse sqrt.

Ah this is really unfortunate, there should be a much cleaner way to implement this eg using a squashed version of sched_fns from the trainer!
But if you don't want to change the code to do this, then you should put an assert config.schedule.decay_type == "linear", "GSAM only implemented for linear lr schedule" into the train.py and add a little comment here in the code that goes something like

# Ideally, we'd use the same schedule as the lr here, just stretched to a different min/max. # However, here we hard-code the linear scheduler only for convenience.

Hi, sorry I did not explain this clearly. Suppose learning rate is lr(t) for step t, and there's an effective rho(t) for each step t. The code restricts rho(t) to be linear w.r.t lr(t), however rho(t) is not linear w.r.t t. If we change lr(t) to be some non-linear schedule such as cosine, the code here will generate a rho(t) also in the shape of cosine, except lr_max != rho_max and lr_min != rho_min.

I tried to use a separate sched_fn for rho(t), but it seems some schedules such as cosine does not have the option to specify a non-zero min value rho_min.

I wonder if you have any suggestions for a neater version using sched_fn with configurable min value, or we keep the schedule code here?

big_vision/trainers/proj/gsam/train.py

big_vision/configs/proj/gsam/vit_1k_gsam_no_aug.py

lucasb-eyer · 2022-07-19T16:01:24Z

big_vision/configs/proj/gsam/vit_1k_gsam_no_aug.py

+  config.wd = 0.3 # default is 0.0001; paper used 0.3, effective wd=0.3*lr
+  config.schedule = dict(
+      warmup_steps=10_000, 
+      decay_type='linear',


Maybe append a short inline comment # only linear supported

lucasb-eyer · 2022-07-19T16:05:48Z

big_vision/configs/proj/gsam/vit_1k_gsam_no_aug.py

+  # config.optax = dict(beta2_cap=0.95)
+
+  config.lr = 0.003
+  config.wd = 0.3 # default is 0.0001; paper used 0.3, effective wd=0.3*lr


If I understand you correctly, this is actually not correct anymore. We changed the code to always use "decoupled" values now. So you should specify here the effective wd you want, which is independent of the lr value (eg I think you want 0.001 here? as in 0.3 * 0.003 ≈ 0.001?)

Thanks for pointing it out. Since the old version code uses lr * wd as the effective wd, and lr changes with a schedule, the effective wd also has a schedule. Switching to the new configuration, is effective wd schedule available? I'm concerned if the effective wd schedule is disabled, using the same hyper-param might not be able to reproduce.

lucasb-eyer · 2022-07-19T16:10:03Z

Regarding running experiments, I could give it a try at some point, but definitely impossible to do so this week. I would run exactly the config you provide, and you need to tell me exactly which number in which table of the paper it's supposed to reproduce.

juntang-zhuang · 2022-07-20T06:10:54Z

Regarding running experiments, I could give it a try at some point, but definitely impossible to do so this week. I would run exactly the config you provide, and you need to tell me exactly which number in which table of the paper it's supposed to reproduce.

Thanks a lot! If the effective wd schedule is not figured out, I might need to find some way to either implement the old versioned weight decay schedule, or tune the hyper-param with the new setting. I wonder if you could point Ting to the docs on how to run this repository internally, and I'll submit codes from external, so we could re-run some experiments to reproduce?

lucasb-eyer · 2022-08-05T19:44:38Z

hey, sorry I got distracted by something urgent to finish, will get back to this in one of the next two weeks and am optimistic we can get it to work well :)

edit: however, you did not yet tell me which exact number from the paper the config should be reproducing?

juntang-zhuang · 2022-08-08T03:15:29Z

Thanks for the response. Sorry about the missing number, it's supposed to reproduce the 76.8 for ViT-B/32 in Table 1 of https://openreview.net/pdf?id=edONMAnhLu- .

I'm not fully sure about the new wdecay and lr scheduler. In the old version, lr scheduler is a single function (here lr scheduler func seems to be chained with a bunch of other schedulers); in the old version, wdecay is multiplied by lr, so wdecay is actually a scheduler rather than constant, is the new wdecay set to a constant?

lucasb-eyer

Hi again, I am ow giving it a try, and there were a few more issues remaining. I have written them as comments, as well as given instructions on how to fix them. I am now able to actually run the trainer and the config, and will train it over night and see if it already reproduces the result or not.

I'll try a couple weight decay values to see what's the right one, but FYI, the weight decay is still following the schedule of the lr in the new code (linear decay in this case), it's just that the base lr is not multiplied to it.

lucasb-eyer · 2022-08-08T15:09:34Z

big_vision/configs/proj/gsam/vit_1k_gsam_no_aug.py

+
+def get_config(arg=None):
+  """Config for training."""
+  arg = bvcc.parse_arg(arg, variant='B/32', runlocal=False, aug='')


aug='' is not used anymore and should be removed.

lucasb-eyer · 2022-08-08T15:10:01Z

big_vision/configs/proj/gsam/vit_1k_gsam_no_aug.py

+This configuration makes use of the "arg" to get_config to select which model
+to run, so a few examples are given below:
+
+Run training of a B/16 model:


All of these example commands need to be updated to this config file.

lucasb-eyer · 2022-08-08T20:36:39Z

big_vision/configs/proj/gsam/vit_1k_gsam_no_aug.py

+      rho_min=0.1,
+      alpha=0.6,
+      adaptive_perturbation=False,
+      minimize_fp=True,


Here we need to add two more parameters:

lr_max=config.get_ref('lr'), lr_min=config.schedule.get_ref('linear_end'),

lucasb-eyer · 2022-08-08T20:39:08Z

big_vision/trainers/proj/gsam/train.py

+  opt_cpu = jax.jit(tx.init, backend="cpu")(params_cpu)
+  sched_fns_cpu = [jax.jit(sched_fn, backend="cpu") for sched_fn in sched_fns]
+
+  @partial(jax.pmap, axis_name="batch", donate_argnums=(0, 1))


You need to add , static_broadcasted_argnums=(5,)) here or this will not work: step is a scalar, so we need to tell pmap that, or it expects it to be replicated. So the final line should look like:

@partial(jax.pmap, axis_name="batch", donate_argnums=(0, 1), static_broadcasted_argnums=(5,)) def update_fn...

wait no that's not what we should do, or it will recompile a new function every step 😅 Instead, we should indeed replicate the step we're passing, for example by passing flax.jax_utils.replicate(step) at call-site.

However, this is creating a synchronization point, blocks prefetching, and creates a transfer at each step. Instead, we should really use the step number which is already replicated inside the optimizer. I'll find out how exactly tomorrow.

Sorry, I lost track of this. What we need to do is to not pass any step at all to the function, but instead get the step like this, around line 208:

step = bv_optax.get_count(opt) learning_rate = schd_fns[0](step) * config.lr

However, it turns out there's a minor issue with get_count so that it can't be called inside a compiled function. I have a fix for it, but let's not roll too much into this PR, you could leave this as it is currently, and I'll fix it myself after the PR is merged.

lucasb-eyer · 2022-08-08T20:43:17Z

big_vision/trainers/proj/gsam/gsam.py

+  Get the GSAM gradient (https://openreview.net/pdf?id=edONMAnhLu-) of the loss function.
+  Args: 
+    loss_fn: the loss function.
+    base_opt: the base optimizer.


[1/2] This (base_opt.target used below) does not work anymore with optax. Although it looks like you really use base_opt only for getting to the params, so you can replace the argument by an actual params argument, and then use that everywhere where you currently use base_opt.target in this function.

lucasb-eyer · 2022-08-08T20:43:46Z

big_vision/trainers/proj/gsam/train.py

+          logits=logits, labels=labels)
+
+    learning_rate = sched_fns[0](step)
+    l, grads = gsam_gradient(loss_fn=loss_fn, base_opt=opt, inputs=images, targets=labels,


[2/2] and then here you would pass params=params instead of base_opt=opt.

lucasb-eyer · 2022-08-08T20:58:34Z

oh, and you have a bunch of small issues like wrong indentations, trailing spaces, etc. It would be helpful if you could run pylint with this config over it, then I don't need to fix these later on.

lucasb-eyer · 2022-08-08T21:02:04Z

and another minor nitpick: could you rename the config from ...1k... to ...i1k...? Because we never call ImageNet 1k, but always i1k in the whole codebase. I assume you made a typo.

lucasb-eyer · 2022-08-08T21:40:31Z

big_vision/trainers/proj/gsam/gsam.py

+
+  # Per-worker perturbation.
+  if adaptive_perturbation:
+    param_sam = jax.tree_multimap(lambda a, b: a + jnp.abs(a) * sam_rho * b / (g_clean_length + eps),


jax.tree_multimap does not exist anymore. It's now just jax.tree_map.

lucasb-eyer · 2022-08-09T11:50:54Z

Here is training_loss of running this config, sweeping over wd=0.0009 (=0.3*0.003, should be exact same as in paper), 0.001 (nicer number close to previous one), and 0.3 (just in case). The loss is crazy, accuracy is and stays at random (not shown):

However, I find the fact that it starts at 693.15, roughly 100x the standard starting-loss of i1k (log1000=6.907) somewhat suspicious. I noticed the config is using sigmoid_xent loss, your paper does not mention the words "softmax" or "sigmoid" ; could it be that you trained with softmax_xent and have sigmoid_xent here in the config by mistake? I'll try a run with that instead, but please take another careful read over the config and see if you can find other sources of this.

Another thing, the config does not contain the config.init_head_bias, which we often, but not always, use. Could this also be a mistake? (I'll also schedule an experiment about this).

juntang-zhuang · 2022-08-09T16:19:04Z

Thanks a lot for the experiments, seems the config is not correct. I'll discuss it with Ting and see if we can directly compare the config file with the one we used for experiments.

lucasb-eyer · 2022-08-09T20:47:55Z

So far, no luck with any of (sigmoid->softmax, head-bias init, ) made it any better.

Then, I also tried the follwing things:

Disable weight-decay altogether, to check whether I can at least overfit. Nope, still an exploding loss, so the issue seems unrelated to wd(?)
Model with cls-token and mlp-head (repr_size=True), as this was original vit. A complete disaster :)

So, I tried all the ideas I had regarding configuration, and at this point wonder if maybe there's a bug in the implementation. Could you please try on your side? Note that you don't need TPU access to run big_vision, it works great on GPUs too, we did update the README with instructions about that. Let me know when you figure out a setting/code change such that the loss does not explode in the first hundreds of steps anymore, and I can then try longer runs for you again. (I'll also ping Ting my runs internally).

lucasb-eyer · 2022-08-11T09:31:35Z

I forgot to mention, but I also tried a run with adam 1t momentum not in bfloat16, but in regular float32, and it makes no difference. Note this bfloat16 really just affects the 1st momentum buffer, nothing else.

lucasb-eyer · 2022-08-11T21:02:09Z

Ting shared with me your exact runs from the paper numbers, so I could dig in a bit more. Carefully replicating exactly the config that was run, I still get similar behaviour, though slightly less extreme ("only" going up to hundreds, not millions):

At this point, I feel like this must be a bug in the code. It seems to go wrong after ~500 steps, potentially you can even run that on CPUs to debug?

juntang-zhuang · 2022-08-12T22:51:44Z

Thanks a lot for the feedback and experiments, I'll dig it out with Ting, and will post the working version here. Sorry for all the trouble with this PR.

lucasb-eyer · 2022-08-18T14:32:44Z

Sorry for all the trouble with this PR

No worries, I will be happy and thankful to have up-to-date GSAM and SAM in the codebase!

evcu · 2022-08-18T18:36:45Z

I also tried to run this with alpha=0, and it looks slightly better at the start, but still explodes after 1-2k step.

lucasb-eyer · 2022-08-18T21:45:40Z

I just noticed in one of your changes a few days ago, you did find a bug:

    learning_rate = sched_fns[0](step)   # Wrong
    learning_rate = sched_fns[0](step) * config["lr"]   # Your fix

This looks very promising! So I patched it in and tried another run on top of the last one I mentioned here. It looks a lot better! It doesn't explode, and reaches 75.2/81.8/61.0 validation/real/v2 accuracy after 90 epochs. This not yet the expected 76.8/82.7/63.0 we're trying to reproduce, but it's getting much closer 🥳

However, the missing 1.6% are still significant, so we should find them before merging this. I carefully compared configs (already before, but once again) and didn't find a new discrepancy.
With alpha=0 I should get SAM, right? Were the SAM and Vanilla numbers in Table1 also produced by you, or copied from somewhere? If produced by you, I could also run SAM and Vanilla and see if I can reproduce them, it would give us an indication where the remaining mistake can be.

Here are a few metrics, do they all look reasonable to you?

juntang-zhuang · 2022-08-19T04:44:03Z

@lucasb-eyer Thanks so much for running experiments! I'm also running an experiment on ViT-S/32, but takes much longer on my GPU machine, will also post results here after it finishes.

The results for SAM are copied from https://arxiv.org/abs/2106.01548 table 2. For the gap of 1.6%, it might come from

in the paper it trains for 300 epochs (here's 90) for ViT,
a bug related to point 2 below
I used 8x8 TPU cores for most experiments, for SAM-family a larger TPU core number typically increases performance.

In previous updates, I made a few changes that potentially make a difference, including the following:

pass the absolute learning rate learning_rate = sched_fns[0](step) * config["lr"] instead of learning_rate = sched_fns[0] (step)
in config.gsam sets absolute values to lr_max=config.get_ref('lr') and lr_min=config.schedule.get_ref('linear_end') * config.get_ref('lr')
in config.schedule set linear_end=0.01 (rather than linear_end=0.00003)
pass flax.jax_utils.replicate(step) when calling update_fn

(I'm not sure if 4 is necessary, just following my old code after meeting with Ting.)

For 1, it's my fault that I did not realize bv_optax defines the learning rate schedule in a relative manner, while all my code last year assumes the lr are all absolute values. This causes a bug in my previous PR, that I passed absolute lr to denominator, but relative lr to the denominator, which results in about 300x larger perturbation amplitude. Such a big perturbation would crash the network. In current version this should be fixed.

For 2 and 3, it's also caused by my mistake with lr schedule. To reproduce the paper results, the absolute learning rate is a linear decay with max_lr=0.003 and min_lr=0.00003. Switching to the relative ratio schedule, should be linear_end=0.01.

I have merged the changes above in the latest PR, let me know if you have time to take a look. I'm also reproducing a ViT-S/32 results with my machine, it's a bit slow but will post it here once I get results. Thanks again for your help with this!

lucasb-eyer · 2022-08-19T08:57:19Z

No need to blame yourself alone, I also should have noticed ALL of these during review and testing, but didn't :) Happy you found them now! Let me start some runs right away, for 300ep, and report back later today.

I actually ran all experiments on 8x8, but am curious why TPU topology would influence the results?

lucasb-eyer

I have good news. Running for 300ep largely closes the remaining gap. Here are my results:

setting	wd	val	real	v2
your paper	0.0009	76.8	82.7	63.0
gsam	0.0009	77.18	82.77	63.24
gsam	0.001	77.35	83.04	64.03
gsam (a=0)	0.0009	76.02	81.56	62.31
sam (a=0, rho=0.15)	0.0009	75.56	81.12	60.97
sam for vit/mixer paper	0.0009	73.6	80.3	60.0

I am relatively sure wd=0.0009 is what you ran, but back then it was expressed differently in our configs, and the number you used was prettier. So I also ran 0.001 which is very close and a pretty number too =)

I only left a few more small comments about to code to address, and after that we can merge!

Note: we have further refactored the code a little bit since, but it is fine for you to submit the code as-is, and I will cleanup/update and test once more on my side afterwards, you've done more than enough already!

lucasb-eyer · 2022-08-19T09:37:12Z

big_vision/configs/proj/gsam/vit_i1k_gsam_no_aug.py

+      rho_min=0.1,
+      alpha=0.6,
+      adaptive_perturbation=False,
+      minimize_fp=True,


Those two (adaptive_perturbation and minimize_fp) are set to their default values. From the doc-comment and paper, it does not seem like something a regular user would tune (contrary to rho and alpha), so let's remove them fromt he config?

lucasb-eyer · 2022-08-19T09:40:14Z

big_vision/trainers/proj/gsam/gsam.py

+        perturbation is element-wise multiplied by abs(p).
+    minimize_fp: if True, min(f_p, h), original GSAM;
+        if False, min(f, h), where f is the clean loss.
+        f_p is the perturbed loss, h is the surrogate gap.


The doc comments of both adaptive_perturbation and minimize_fp both explain what they do in very technical terms, but it would be good to have a short high-level recommendation at the end as to when or why one would want to change them.

For example (the example is clearly wrong, because I don't understand them, but just to show the spirit of what I'm looking for):

adaptive_perturbation: if False, same perturbation as SAM, treat all parameters as a single vector, perturbation norm is calculated as the norm of the whole vector; if True, for each parameter tensor p, perturbation is element-wise multiplied by abs(p). Try setting this to False when you use least-squares loss instead of KL-based ones. minimize_fp: if True, min(f_p, h), original GSAM; if False, min(f, h), where f is the clean loss. f_p is the perturbed loss, h is the surrogate gap. You probably want to leave this at its default unless you know what you're doing.

lucasb-eyer · 2022-08-19T09:49:49Z

big_vision/trainers/proj/gsam/train.py

+  opt_cpu = jax.jit(tx.init, backend="cpu")(params_cpu)
+  sched_fns_cpu = [jax.jit(sched_fn, backend="cpu") for sched_fn in sched_fns]
+
+  @partial(jax.pmap, axis_name="batch", donate_argnums=(0, 1))


Sorry, I lost track of this. What we need to do is to not pass any step at all to the function, but instead get the step like this, around line 208:

step = bv_optax.get_count(opt) learning_rate = schd_fns[0](step) * config.lr

lucasb-eyer · 2022-08-19T09:50:13Z

big_vision/trainers/proj/gsam/train.py

+      return getattr(u, config.get("loss", "sigmoid_xent"))(
+          logits=logits, labels=labels)
+
+    learning_rate = sched_fns[0](step) * config["lr"]


Since this is a ConfiDict, it can be the slightly nicer config.lr.

lucasb-eyer · 2022-08-19T09:50:27Z

big_vision/trainers/proj/gsam/train.py

+
+    learning_rate = sched_fns[0](step) * config["lr"]
+    l, grads = gsam_gradient(loss_fn=loss_fn, params=params, inputs=images,
+        targets=labels, lr=learning_rate, **config["gsam"])


Same here, slightly simpler **config.gsam

lucasb-eyer · 2022-08-19T13:29:27Z

big_vision/trainers/proj/gsam/train.py

+  opt_cpu = jax.jit(tx.init, backend="cpu")(params_cpu)
+  sched_fns_cpu = [jax.jit(sched_fn, backend="cpu") for sched_fn in sched_fns]
+
+  @partial(jax.pmap, axis_name="batch", donate_argnums=(0, 1))


However, it turns out there's a minor issue with get_count so that it can't be called inside a compiled function. I have a fix for it, but let's not roll too much into this PR, you could leave this as it is currently, and I'll fix it myself after the PR is merged.

juntang-zhuang · 2022-08-19T20:31:16Z

Cool, I'm really excited to see the updated results, they outperform numbers in the paper!
I have updated PR according to your comments, except the step is passed to update_fn rather than read out from opt.

One minor thing is, GSAM reduces to SAM requires alpha=0 and rho_max=rho_min in the gsam_gradient function, basically SAM uses a constant perturbation rho_t, GSAM scales rho_t proportional to learning rate schedule. It might not be a good idea to set constant by setting rho_max=rho_min, maybe using a bv_optax style schedule function is a better idea for code style consistency.

For TPU number, it's because that GSAM / SAM performs per-worker perturbation based on per-worker gradient in gsam_gradient, more workers will have more different perturbations, so the model effectively see more neighbors in the parameter space.

lucasb-eyer · 2022-08-19T20:45:55Z

Thanks for your comments. My "SAM" run with rho_max=rho_min=0.15 just finished, and it's quite a bit better than the paper number too. From my reading of the code, when rho_max=rho_min then we do use a constant rho value independent of learning-rate (schedule), no?

And yes, making it use the actual schedule_fn from optax would be ideal, then we could simply use SAM with all kinds of schedules, and we don't need to manually specify lr_min/lr_max in the config anymore. That would be a lot better, but I thought that I already asked a lot from you, so didn't want to ask for that too :) If you want to do it, that's great, otherwise I may do it at some point, or maybe never, if we never need it. But this is the largest argument against having it in the core trainer for now.

lucasb-eyer · 2022-08-19T20:50:04Z

Regarding the perturbations per host, I noticed that the model souping paper states that not syncing may have a significant disadvantage:

so it may be worth implementing. Do I understand correctly that it basically means doing jax.lax.pmean(g_clean)?

lucasb-eyer

Thanks for your patience overall!
I'll merge it now, and will update the trainer according to the latest refactors early next week, such that it actually works :)

lucasb-eyer · 2022-08-19T20:53:49Z

I also just realized that we should add a pointer to this from the README. I'll do so early next week too.

juntang-zhuang · 2022-08-20T06:39:03Z

Thanks so much for your help with the debug and PR!

Regarding the rho_t schedule, yes it is constant when rho_max=rho_min, I implemented it in a way that rho_t follows the same schedule as lr_t (except they have difference value scales). It might be better to pass rho_t as another sched_fn, but I'm not familiar with the chain style fn in bv_optax, so I'm not confident to implement correctly and matching the existing code base.

For per-worker perturbation, the model soup paper seems to contradict the original SAM paper https://arxiv.org/pdf/2010.01412.pdf section 4.1. It defines m-sharpness where m is the per-worker number of examples. A smaller m (hence a larger worker number when total batchsize is fixed) improves generalization.

I'm not quite sure about model soup implementations. In my implementation (and SAM), the process is:

per-worker gradient g_clean (not synced) and per-worker perturbation param_sam

big_vision/big_vision/trainers/proj/gsam/gsam.py

Line 69 in 136deda

param_sam = jax.tree_map(lambda a, b: a + \
per-worker gradient g_gsam at (per-worker) perturbed model weights param_sam

big_vision/big_vision/trainers/proj/gsam/gsam.py

Line 91 in 136deda

g_gsam = jax.tree_map(lambda a, b: a - b * alpha,
average g_gsam across workers in

big_vision/big_vision/trainers/proj/gsam/train.py

Line 211 in 136deda

l, grads = jax.lax.pmean((l, grads), axis_name="batch")

note the returned grads here is g_gsam (not g_clean) in the gsam_gradient function.
all workers update with the same value of globally averaged gsam in optimizer.

I'm not quite sure with model soup, but I suspect if it draws an opposite conclusion from SAM paper, it might come from a different implementation. For example, if it switches the order of 3 and 4, first performs per-worker parameter update with per-worker g_gsam, then average model weights across workers, this might harm performance compared to synced perturbation.

If want to perform synced perturbation, we can add g_clean = jax.pmean(g_clean) after

big_vision/big_vision/trainers/proj/gsam/gsam.py

Line 56 in 136deda

l_clean, g_clean = jax.value_and_grad(loss_fn)(params, inputs, targets)

so that param_sam is the same for all workers

lucasb-eyer requested changes Jul 19, 2022

View reviewed changes

new commit message

c6a0dda

Update vit_1k_gsam_no_aug.py

67ff195

lucasb-eyer requested changes Aug 8, 2022

View reviewed changes

lucasb-eyer reviewed Aug 8, 2022

View reviewed changes

juntang-zhuang added 13 commits August 12, 2022 17:27

replicate step, init head bias

8cde017

reorg

dcd0859

reorg

e54eb6a

reorg

28b5c4a

scale lr to absolute value

f14f615

scale lr to absolute value

b0375e1

pass params (not opt) to gsam_gradient

5736be2

remove pycache

24bec3f

pylint

a9a696c

pylint

8dec7e0

pylint

8dd3364

pylint

f420943

pylint

3c6e55e

juntang-zhuang and others added 2 commits August 13, 2022 23:40

pylint

a19e92f

Delete .DS_Store

e06c79b

lucasb-eyer requested changes Aug 19, 2022

View reviewed changes

format and cleanup

cba60f8

lucasb-eyer approved these changes Aug 19, 2022

View reviewed changes

lucasb-eyer merged commit 136deda into google-research:main Aug 19, 2022

implement gsam in jax #8

implement gsam in jax #8

Conversation

juntang-zhuang commented Jul 16, 2022 • edited Loading

lucasb-eyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucasb-eyer commented Jul 19, 2022

juntang-zhuang commented Jul 20, 2022

lucasb-eyer commented Aug 5, 2022 • edited Loading

juntang-zhuang commented Aug 8, 2022

lucasb-eyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucasb-eyer commented Aug 8, 2022

lucasb-eyer commented Aug 8, 2022

Choose a reason for hiding this comment

lucasb-eyer commented Aug 9, 2022

juntang-zhuang commented Aug 9, 2022 • edited Loading

lucasb-eyer commented Aug 9, 2022

lucasb-eyer commented Aug 11, 2022

lucasb-eyer commented Aug 11, 2022

juntang-zhuang commented Aug 12, 2022

lucasb-eyer commented Aug 18, 2022

evcu commented Aug 18, 2022

lucasb-eyer commented Aug 18, 2022

juntang-zhuang commented Aug 19, 2022 • edited Loading

lucasb-eyer commented Aug 19, 2022

lucasb-eyer left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juntang-zhuang commented Aug 19, 2022 • edited by lucasb-eyer Loading

lucasb-eyer commented Aug 19, 2022

lucasb-eyer commented Aug 19, 2022

lucasb-eyer left a comment

Choose a reason for hiding this comment

lucasb-eyer commented Aug 19, 2022

juntang-zhuang commented Aug 20, 2022

juntang-zhuang commented Jul 16, 2022 •

edited

Loading

lucasb-eyer commented Aug 5, 2022 •

edited

Loading

juntang-zhuang commented Aug 9, 2022 •

edited

Loading

juntang-zhuang commented Aug 19, 2022 •

edited

Loading

lucasb-eyer left a comment •

edited

Loading

juntang-zhuang commented Aug 19, 2022 •

edited by lucasb-eyer

Loading