Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migration tutorial #1203

Merged
merged 7 commits into from
Feb 24, 2021
Merged

Conversation

zhangguanheng66
Copy link
Contributor

Add the migration tutorial in examples/legacy_tutorial folder

@cpuhrsch
Copy link
Contributor

@zhangguanheng66
Copy link
Contributor Author

@codecov
Copy link

codecov bot commented Feb 23, 2021

Codecov Report

Merging #1203 (906e9cc) into master (db8da95) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1203   +/-   ##
=======================================
  Coverage   73.23%   73.23%           
=======================================
  Files          67       67           
  Lines        3718     3718           
=======================================
  Hits         2723     2723           
  Misses        995      995           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a1c8b50...9d09f9d. Read the comment docs.

@cpuhrsch
Copy link
Contributor

Also, can you update the tutorial to be built against the RC?

@cpuhrsch
Copy link
Contributor

cpuhrsch commented Feb 23, 2021

As a side note

def bucket_iter_func(pool, batch_size=64):
   for rand_item in pool:
      sorted_item = sorted(rand_item,
                           key=lambda x: len(tokenizer(x[1])))  # x is a tuple of (label, text)
      sorted_dataloader = DataLoader(sorted_item, batch_size=batch_size,
                                     shuffle=False,  # shuffle is set to False to keep the order  
                                     collate_fn=collate_batch)
      for item in sorted_dataloader:
         yield item

train_iter = IMDB(split='train')
train_list = list(train_iter)
batch_size = 8 # A batch size of 8
rand_pools = DataLoader(train_list, batch_size=batch_size*100,
                        shuffle=True, collate_fn=lambda x: x)
sorted_train_dataloader = bucket_iter_func(rand_pools, batch_size=batch_size)

If you just want sublists of size batch_size*100 you can also use Python builtins

shuffle(train_list)
train_lists = [train_list[i:i+ batch_size*100] for i in range(0, len(train_list), batch_size*100)]
train_lists = [sorted(train_list, key=lambda x: len(tokenizer(x[1])))  for train_list in train_lists]
train_lists = sum(train_lists, []) # Very slow way of flattening
dataloader = DataLoader(train_lists, batch_size=batch_size,
                                     shuffle=False,  # shuffle is set to False to keep the order  
                                     collate_fn=collate_batch)

@zhangguanheng66
Copy link
Contributor Author

zhangguanheng66 commented Feb 23, 2021

As a side note

Yep. Add to the tutorial.

@zhangguanheng66 zhangguanheng66 merged commit 7d2dbe9 into pytorch:master Feb 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants