You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey folks, I was hoping someone could tell me a better way to deal with this issue. I am getting a StopIteration error on the
dataset, and I am not clear on how to get around it. Here is a minimal example below which creates the error. I am using Torchtext 0.10.0.
In the real code, I am pulling the AG_NEWS dataset into the train_iter variable, building a vocabulary based on that train_iter dataset, and then trying to process batches for that same dataset using a Dataloader with collate function.
The problem seems to be that I iterate through train_iter one time, in order to build the vocabulary with the yield_tokens function. But when I try and then do next(iter(train_iter)), the iterator has already reached its end. Is there a way to copy the train_iter so that I can build the vocabulary based on the copy. I can probably write some hacky code to workaround this, but just wanted to see if there is a better or more appropriate way.
from torchtext.datasets import AG_NEWS
from torchtext.data.utils import get_tokenizer
from typing import Optional, Tuple
import torchtext
import torch
from torchtext.vocab import Vocab, build_vocab_from_iterator
import numpy as np
def yield_tokens(data_iter):
for _, text in data_iter:
yield tokenizer(text)
tokenizer = get_tokenizer('basic_english')
train_iter, test_iter = AG_NEWS()
vocab = build_vocab_from_iterator(yield_tokens(train_iter),
specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])
print(next(iter(train_iter)))
The error message generated is:
Exception has occurred: StopIteration
exception: no description
print(next(iter(train_iter)))
The text was updated successfully, but these errors were encountered:
Yes, this is a know issue/limitation of current datasets. We are working on improving this user-experience in the next release.
For now the only way around is to get new iterator every time we need it (for instance inside epochs). Getting iterator is expensive only for the 1st time around, due to download and extraction (if any) of data. But next time, it is almost immediately returned by the function.
So for eg you may do something like this:
fromtorch.utils.dataimportDataLoaderfromtorchtext.datasetsimportAG_NEWScollate_fn=None#replace with actual collate functionfor_innum_epochs:
train_iter=AG_NEWS(split='train')
data_loader=DataLoader(train_iter, batch_size=batch_size, collate_fn=collate_fn)
forlabels, inputindata_loader:
# do something
@parmeet Oh I see. That is interesting. Yeah, I thanks for explaining the issue. This helps me to understand how to run training now. I will have to copy this code and keep it in mind.
❓ Questions and Help
Description
Hey folks, I was hoping someone could tell me a better way to deal with this issue. I am getting a
StopIteration
error on thedataset, and I am not clear on how to get around it. Here is a minimal example below which creates the error. I am using
Torchtext
0.10.0.In the real code, I am pulling the AG_NEWS dataset into the
train_iter
variable, building a vocabulary based on thattrain_iter
dataset, and then trying to process batches for that same dataset using a Dataloader with collate function.The problem seems to be that I iterate through
train_iter
one time, in order to build the vocabulary with theyield_tokens
function. But when I try and then donext(iter(train_iter))
, the iterator has already reached its end. Is there a way to copy thetrain_iter
so that I can build the vocabulary based on the copy. I can probably write some hacky code to workaround this, but just wanted to see if there is a better or more appropriate way.The error message generated is:
The text was updated successfully, but these errors were encountered: