Add XGen info to README and example config #306

ethanhs · 2023-07-21T07:47:12Z

I should probably add a note that the tokenizer breaks with the sharegpt prompt format at the moment... maybe a footnote in the README?

NanoCode012 · 2023-07-21T07:53:28Z

Can you clarify what you mean by "break"?

You could add a logging.warn within the validate_config function if it's bad

ethanhs · 2023-07-21T10:26:20Z

Can you clarify what you mean by "break"?

I get the following error:

Traceback (most recent call last):
  File "scripts/finetune.py", line 355, in <module>
    fire.Fire(train)
  File ".../miniconda3/envs/llms/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File ".../miniconda3/envs/llms/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File ".../miniconda3/envs/llms/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "scripts/finetune.py", line 225, in train
    train_dataset, eval_dataset = load_prepare_datasets(
  File ".../llm-research/axolotl/src/axolotl/utils/data.py", line 395, in load_prepare_datasets
    dataset = load_tokenized_prepared_datasets(
  File ".../llm-research/axolotl/src/axolotl/utils/data.py", line 270, in load_tokenized_prepared_datasets
    samples = samples + list(d)
  File ".../llm-research/axolotl/src/axolotl/datasets.py", line 42, in __iter__
    yield self.prompt_tokenizer.tokenize_prompt(example)
  File ".../llm-research/axolotl/src/axolotl/prompt_tokenizers.py", line 345, in tokenize_prompt
    user_token = self._get_user_token()
  File ".../llm-research/axolotl/src/axolotl/prompt_tokenizers.py", line 51, in _get_user_token
    id_or_ids = self.tokenizer.convert_tokens_to_ids("<|USER|>")
  File ".../miniconda3/envs/llms/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 575, in convert_tokens_to_ids
    return self._convert_token_to_id_with_added_voc(tokens)
  File ".../miniconda3/envs/llms/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 588, in _convert_token_to_id_with_added_voc
    return self._convert_token_to_id(token)
  File ".../.cache/huggingface/modules/transformers_modules/Salesforce/xgen-7b-8k-base/1a0f468309372dbd65c17bde049c1cd35d551c14/tokenization_xgen.py", line 152, in _convert_token_to_id
    return self.encoder.encode_single_token(token)
  File ".../miniconda3/envs/llms/lib/python3.8/site-packages/tiktoken/core.py", line 226, in encode_single_token
    return self._core_bpe.encode_single_token(text_or_bytes)
KeyError: [60, 124, 85, 83, 69, 82, 124, 62]

Checking for this in validate_config could be nice (ideally it might be possible to work around it?)

winglian · 2023-07-21T13:25:20Z

@ethanhs #307 should fix that. thanks

winglian

great! thank you!

Add XGen info to README and example config

Add XGen info to README and example config

3881143

ethanhs mentioned this pull request Jul 21, 2023

XGen model #267

Closed

winglian mentioned this pull request Jul 21, 2023

better handling since xgen tokenizer breaks with convert_tokens_to_ids #307

Merged

winglian approved these changes Jul 22, 2023

View reviewed changes

winglian merged commit dcdec44 into axolotl-ai-cloud:main Jul 22, 2023

ethanhs deleted the xgen branch July 22, 2023 21:58

mkeoliya pushed a commit to mkeoliya/axolotl that referenced this pull request Dec 15, 2023

Merge pull request axolotl-ai-cloud#306 from ethanhs/xgen

6070263

Add XGen info to README and example config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add XGen info to README and example config #306

Add XGen info to README and example config #306

ethanhs commented Jul 21, 2023

NanoCode012 commented Jul 21, 2023

ethanhs commented Jul 21, 2023

winglian commented Jul 21, 2023

winglian left a comment

Add XGen info to README and example config #306

Add XGen info to README and example config #306

Conversation

ethanhs commented Jul 21, 2023

NanoCode012 commented Jul 21, 2023

ethanhs commented Jul 21, 2023

winglian commented Jul 21, 2023

winglian left a comment

Choose a reason for hiding this comment