Unsupported tokenizer 'OpenAI.BPE' #1049

jeswan · 2020-09-17T18:33:12Z

Issue by lovodkin93
Thursday Apr 02, 2020 at 15:05 GMT
Originally opened as nyu-mll/jiant#1049

hello,
I've been trying to pre-process the data, as was written in the README file located in the probing idirectory.
I ran the following command (which i took from the README mentioned above):
mkdir -p $JIANT_DATA_DIR
./get_and_process_all_data.sh $JIANT_DATA_DIR

and got the following error message:
Traceback (most recent call last):
File "./retokenize_edge_data.py", line 97, in
main(sys.argv[1:])
File "./retokenize_edge_data.py", line 93, in main
retokenize_file(fname, args.tokenizer_name, worker_pool=worker_pool)
File "./retokenize_edge_data.py", line 83, in retokenize_file
for line in tqdm(worker_pool.imap(map_fn, inputs, chunksize=500), total=len(inputs)):
File "/cs/labs/oabend/lovodkin93/anaconda3/envs/jiant/lib/python3.6/site-packages/tqdm/_tqdm.py", line 1022, in iter
for obj in iterable:
File "/cs/labs/oabend/lovodkin93/anaconda3/envs/jiant/lib/python3.6/multiprocessing/pool.py", line 320, in
return (item for chunk in result for item in chunk)
File "/cs/labs/oabend/lovodkin93/anaconda3/envs/jiant/lib/python3.6/multiprocessing/pool.py", line 735, in next
raise value
ValueError: Unsupported tokenizer 'OpenAI.BPE'

i tried to download the "openai gpt-2" model which i saw had the tokenizer mentioned, but it appears it requires the python version to be 3.7, while the one the jiant environement is working on is of older version.
Has anyone seen this error before, or knows how to solve it?
@iftenney

jeswan · 2020-09-17T18:33:13Z

Comment by iftenney
Friday Apr 03, 2020 at 14:35 GMT

@pyeres @pruksmhc maybe this got renamed recently?

In the mean time, unless you're trying to probe GPT-1 you can just comment out this line: https://github.com/nyu-mll/jiant/blob/master/probing/get_and_process_all_data.sh#L35

jeswan · 2020-09-17T18:33:14Z

Comment by pyeres
Friday Apr 03, 2020 at 15:07 GMT

Looks like this is the result of PR nyu-mll/jiant#881 "Replacing old GPT implement with the one from huggingface pytorch transformers". @HaokunLiu, can you take a look?

jeswan · 2020-09-17T18:33:15Z

Comment by HaokunLiu
Friday Apr 03, 2020 at 15:10 GMT

Okay. You can either choose auto as your tokenizer or simply use the same
string as your input_module

On Fri, Apr 3, 2020 at 11:07 AM Phil Yeres notifications@github.com wrote:

Looks like this is the result of PR #881
nyu-mll/jiant#881 "Replacing old GPT implement
with the one from huggingface pytorch transformers". @HaokunLiu
https://github.com/HaokunLiu, can you take a look?

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
nyu-mll/jiant#1049 (comment), or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AIPK5GRQ42RNZOZYUI4L77TRKX3UDANCNFSM4L2RQSNA
.

jeswan · 2020-09-17T18:33:17Z

Comment by lovodkin93
Friday Apr 03, 2020 at 16:51 GMT

Okay. You can either choose auto as your tokenizer or simply use the same string as your input_module

I didn't quite follow you - where should I choose the auto tokenizer? Also, what is the auto-tokenizer? And what do you mean by using the same string as my input_module?
@HaokunLiu

jeswan · 2020-09-17T18:33:18Z

Comment by HaokunLiu
Friday Apr 03, 2020 at 16:55 GMT

Oh, sorry. I thought it was the main program. For this preprocessing
program , when you are using gpt, choose openai-gpt as your tokenizer, when
you are using gpt2, choose gpt2-medium.

On Fri, Apr 3, 2020 at 12:52 PM lovodkin93 notifications@github.com wrote:

Okay. You can either choose auto as your tokenizer or simply use the same
string as your input_module
… <#m_8940456267136264806_>
On Fri, Apr 3, 2020 at 11:07 AM Phil Yeres @.***> wrote: Looks like
this is the result of PR #881 nyu-mll/jiant#881
<#881 nyu-mll/jiant#881> "Replacing old GPT
implement with the one from huggingface pytorch transformers". @HaokunLiu
https://github.com/HaokunLiu https://github.com/HaokunLiu, can you take
a look? — You are receiving this because you were mentioned. Reply to this
email directly, view it on GitHub <#1049 (comment)
nyu-mll/jiant#1049 (comment)>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AIPK5GRQ42RNZOZYUI4L77TRKX3UDANCNFSM4L2RQSNA
.

I didn't quite follow you - where should I choose the auto tokenizer?
Also, what is the auto-tokenizer? And what do you mean by using the same
string as my input_module?
@HaokunLiu https://github.com/HaokunLiu

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
nyu-mll/jiant#1049 (comment), or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AIPK5GQJGQUWPNUV5XLG6YDRKYH3FANCNFSM4L2RQSNA
.

jeswan · 2020-09-17T18:33:19Z

Comment by lovodkin93
Friday Apr 03, 2020 at 16:58 GMT

Oh, sorry. I thought it was the main program. For this preprocessing program , when you are using gpt, choose openai-gpt as your tokenizer, when you are using gpt2, choose gpt2-medium.

You mean replace the "OpenAI.BPE" in the following line with "openai-gpt" or with "gpt2-medium"?
https://github.com/nyu-mll/jiant/blob/master/probing/get_and_process_all_data.sh#L35
@HaokunLiu

jeswan · 2020-09-17T18:33:20Z

Comment by HaokunLiu
Friday Apr 03, 2020 at 17:33 GMT

Exactly

On Fri, Apr 3, 2020 at 12:58 PM lovodkin93 notifications@github.com wrote:

Oh, sorry. I thought it was the main program. For this preprocessing
program , when you are using gpt, choose openai-gpt as your tokenizer, when
you are using gpt2, choose gpt2-medium.

You mean replace the "OpenAI.BPE" in the following line with "openai-gpt"
or with "gpt2-medium"?

https://github.com/nyu-mll/jiant/blob/master/probing/get_and_process_all_data.sh#L35
@HaokunLiu https://github.com/HaokunLiu

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
nyu-mll/jiant#1049 (comment), or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AIPK5GT57L7AY3NM37N24NTRKYIULANCNFSM4L2RQSNA
.

zphang mentioned this issue Oct 16, 2020

Unsupported tokenizer 'OpenAI.BPE' nyu-mll/jiant#1049

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unsupported tokenizer 'OpenAI.BPE' #1049

Unsupported tokenizer 'OpenAI.BPE' #1049

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020

Unsupported tokenizer 'OpenAI.BPE' #1049

Unsupported tokenizer 'OpenAI.BPE' #1049

Comments

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020