Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsupported tokenizer 'OpenAI.BPE' #1049

Open
jeswan opened this issue Sep 17, 2020 · 7 comments
Open

Unsupported tokenizer 'OpenAI.BPE' #1049

jeswan opened this issue Sep 17, 2020 · 7 comments

Comments

@jeswan
Copy link
Contributor

jeswan commented Sep 17, 2020

Issue by lovodkin93
Thursday Apr 02, 2020 at 15:05 GMT
Originally opened as nyu-mll/jiant#1049


hello,
I've been trying to pre-process the data, as was written in the README file located in the probing idirectory.
I ran the following command (which i took from the README mentioned above):
mkdir -p $JIANT_DATA_DIR
./get_and_process_all_data.sh $JIANT_DATA_DIR

and got the following error message:
Traceback (most recent call last):
File "./retokenize_edge_data.py", line 97, in
main(sys.argv[1:])
File "./retokenize_edge_data.py", line 93, in main
retokenize_file(fname, args.tokenizer_name, worker_pool=worker_pool)
File "./retokenize_edge_data.py", line 83, in retokenize_file
for line in tqdm(worker_pool.imap(map_fn, inputs, chunksize=500), total=len(inputs)):
File "/cs/labs/oabend/lovodkin93/anaconda3/envs/jiant/lib/python3.6/site-packages/tqdm/_tqdm.py", line 1022, in iter
for obj in iterable:
File "/cs/labs/oabend/lovodkin93/anaconda3/envs/jiant/lib/python3.6/multiprocessing/pool.py", line 320, in
return (item for chunk in result for item in chunk)
File "/cs/labs/oabend/lovodkin93/anaconda3/envs/jiant/lib/python3.6/multiprocessing/pool.py", line 735, in next
raise value
ValueError: Unsupported tokenizer 'OpenAI.BPE'

i tried to download the "openai gpt-2" model which i saw had the tokenizer mentioned, but it appears it requires the python version to be 3.7, while the one the jiant environement is working on is of older version.
Has anyone seen this error before, or knows how to solve it?
@iftenney

@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by iftenney
Friday Apr 03, 2020 at 14:35 GMT


@pyeres @pruksmhc maybe this got renamed recently?

In the mean time, unless you're trying to probe GPT-1 you can just comment out this line: https://github.com/nyu-mll/jiant/blob/master/probing/get_and_process_all_data.sh#L35

@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by pyeres
Friday Apr 03, 2020 at 15:07 GMT


Looks like this is the result of PR nyu-mll/jiant#881 "Replacing old GPT implement with the one from huggingface pytorch transformers". @HaokunLiu, can you take a look?

@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by HaokunLiu
Friday Apr 03, 2020 at 15:10 GMT


Okay. You can either choose auto as your tokenizer or simply use the same
string as your input_module

On Fri, Apr 3, 2020 at 11:07 AM Phil Yeres notifications@github.com wrote:

Looks like this is the result of PR #881
nyu-mll/jiant#881 "Replacing old GPT implement
with the one from huggingface pytorch transformers". @HaokunLiu
https://github.com/HaokunLiu, can you take a look?


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
nyu-mll/jiant#1049 (comment), or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AIPK5GRQ42RNZOZYUI4L77TRKX3UDANCNFSM4L2RQSNA
.

@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by lovodkin93
Friday Apr 03, 2020 at 16:51 GMT


Okay. You can either choose auto as your tokenizer or simply use the same string as your input_module

I didn't quite follow you - where should I choose the auto tokenizer? Also, what is the auto-tokenizer? And what do you mean by using the same string as my input_module?
@HaokunLiu

@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by HaokunLiu
Friday Apr 03, 2020 at 16:55 GMT


Oh, sorry. I thought it was the main program. For this preprocessing
program , when you are using gpt, choose openai-gpt as your tokenizer, when
you are using gpt2, choose gpt2-medium.

On Fri, Apr 3, 2020 at 12:52 PM lovodkin93 notifications@github.com wrote:

Okay. You can either choose auto as your tokenizer or simply use the same
string as your input_module
… <#m_8940456267136264806_>
On Fri, Apr 3, 2020 at 11:07 AM Phil Yeres @.***> wrote: Looks like
this is the result of PR #881 nyu-mll/jiant#881
<#881 nyu-mll/jiant#881> "Replacing old GPT
implement with the one from huggingface pytorch transformers". @HaokunLiu
https://github.com/HaokunLiu https://github.com/HaokunLiu, can you take
a look? — You are receiving this because you were mentioned. Reply to this
email directly, view it on GitHub <#1049 (comment)
nyu-mll/jiant#1049 (comment)>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AIPK5GRQ42RNZOZYUI4L77TRKX3UDANCNFSM4L2RQSNA
.

I didn't quite follow you - where should I choose the auto tokenizer?
Also, what is the auto-tokenizer? And what do you mean by using the same
string as my input_module?
@HaokunLiu https://github.com/HaokunLiu


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
nyu-mll/jiant#1049 (comment), or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AIPK5GQJGQUWPNUV5XLG6YDRKYH3FANCNFSM4L2RQSNA
.

@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by lovodkin93
Friday Apr 03, 2020 at 16:58 GMT


Oh, sorry. I thought it was the main program. For this preprocessing program , when you are using gpt, choose openai-gpt as your tokenizer, when you are using gpt2, choose gpt2-medium.

You mean replace the "OpenAI.BPE" in the following line with "openai-gpt" or with "gpt2-medium"?
https://github.com/nyu-mll/jiant/blob/master/probing/get_and_process_all_data.sh#L35
@HaokunLiu

@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by HaokunLiu
Friday Apr 03, 2020 at 17:33 GMT


Exactly

On Fri, Apr 3, 2020 at 12:58 PM lovodkin93 notifications@github.com wrote:

Oh, sorry. I thought it was the main program. For this preprocessing
program , when you are using gpt, choose openai-gpt as your tokenizer, when
you are using gpt2, choose gpt2-medium.

You mean replace the "OpenAI.BPE" in the following line with "openai-gpt"
or with "gpt2-medium"?

https://github.com/nyu-mll/jiant/blob/master/probing/get_and_process_all_data.sh#L35
@HaokunLiu https://github.com/HaokunLiu


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
nyu-mll/jiant#1049 (comment), or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AIPK5GT57L7AY3NM37N24NTRKYIULANCNFSM4L2RQSNA
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant