You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Token indices sequence length is longer than the specified maximum sequence length for this model (73218 > 2048). Running this sequence through the model will result in indexing errors
Expected behavior
This is proposed fix:
defget_wikitext2(tokenizer: Any, seqlen: int, nsamples: int, split: str="train"):
ifsplit=="train":
data=load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
elifsplit=="validation":
data=load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
## length of 288059 should be enough#text = "".join([" \n" if s == "" else s for s in data["text"][:1000]])dataset= []
for_inrange(nsamples):
whileTrue:
i=random.randint(0, len(data) -1)
text=data[i]["text"]
iflen(tokenizer.tokenize(text)) >=seqlen:
enc=tokenizer(text, return_tensors="pt")
breaki=random.randint(0, enc.input_ids.shape[1] -seqlen-1)
j=i+seqleninp=enc.input_ids[:, i:j]
attention_mask=torch.ones_like(inp)
dataset.append({"input_ids": inp, "attention_mask": attention_mask})
returndataset
Inspired by get_c4`` and get_c4_new```.
No warning is produced.
The text was updated successfully, but these errors were encountered:
Not sure. This was something TheBloke coded back then.Maybe this is because data[i]["text"] is pretty long so it takes to while to find a text < seqlen ?
Token indices sequence length is longer than the specified maximum sequence length for this model (73218 > 2048). Running this sequence through the model will result in indexing errors
This does not happen as we are slicing the tokenized data after:
System Info
optimum version 1.21.4 (latest) # Use the official Python image from the Docker Hub FROM public.ecr.aws/docker/library/python:3.10-slim
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
Produce warning:
Token indices sequence length is longer than the specified maximum sequence length for this model (73218 > 2048). Running this sequence through the model will result in indexing errors
Expected behavior
This is proposed fix:
Inspired by
get_c4`` and
get_c4_new```.No warning is produced.
The text was updated successfully, but these errors were encountered: