Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add XGen info to README and example config #306

Merged
merged 1 commit into from
Jul 22, 2023

Conversation

ethanhs
Copy link
Contributor

@ethanhs ethanhs commented Jul 21, 2023

I should probably add a note that the tokenizer breaks with the sharegpt prompt format at the moment... maybe a footnote in the README?

@ethanhs ethanhs mentioned this pull request Jul 21, 2023
@NanoCode012
Copy link
Collaborator

Can you clarify what you mean by "break"?

You could add a logging.warn within the validate_config function if it's bad

@ethanhs
Copy link
Contributor Author

ethanhs commented Jul 21, 2023

Can you clarify what you mean by "break"?

I get the following error:

Traceback (most recent call last):
  File "scripts/finetune.py", line 355, in <module>
    fire.Fire(train)
  File ".../miniconda3/envs/llms/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File ".../miniconda3/envs/llms/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File ".../miniconda3/envs/llms/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "scripts/finetune.py", line 225, in train
    train_dataset, eval_dataset = load_prepare_datasets(
  File ".../llm-research/axolotl/src/axolotl/utils/data.py", line 395, in load_prepare_datasets
    dataset = load_tokenized_prepared_datasets(
  File ".../llm-research/axolotl/src/axolotl/utils/data.py", line 270, in load_tokenized_prepared_datasets
    samples = samples + list(d)
  File ".../llm-research/axolotl/src/axolotl/datasets.py", line 42, in __iter__
    yield self.prompt_tokenizer.tokenize_prompt(example)
  File ".../llm-research/axolotl/src/axolotl/prompt_tokenizers.py", line 345, in tokenize_prompt
    user_token = self._get_user_token()
  File ".../llm-research/axolotl/src/axolotl/prompt_tokenizers.py", line 51, in _get_user_token
    id_or_ids = self.tokenizer.convert_tokens_to_ids("<|USER|>")
  File ".../miniconda3/envs/llms/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 575, in convert_tokens_to_ids
    return self._convert_token_to_id_with_added_voc(tokens)
  File ".../miniconda3/envs/llms/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 588, in _convert_token_to_id_with_added_voc
    return self._convert_token_to_id(token)
  File ".../.cache/huggingface/modules/transformers_modules/Salesforce/xgen-7b-8k-base/1a0f468309372dbd65c17bde049c1cd35d551c14/tokenization_xgen.py", line 152, in _convert_token_to_id
    return self.encoder.encode_single_token(token)
  File ".../miniconda3/envs/llms/lib/python3.8/site-packages/tiktoken/core.py", line 226, in encode_single_token
    return self._core_bpe.encode_single_token(text_or_bytes)
KeyError: [60, 124, 85, 83, 69, 82, 124, 62]

Checking for this in validate_config could be nice (ideally it might be possible to work around it?)

@winglian
Copy link
Collaborator

@ethanhs #307 should fix that. thanks

Copy link
Collaborator

@winglian winglian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great! thank you!

@winglian winglian merged commit dcdec44 into axolotl-ai-cloud:main Jul 22, 2023
@ethanhs ethanhs deleted the xgen branch July 22, 2023 21:58
mkeoliya pushed a commit to mkeoliya/axolotl that referenced this pull request Dec 15, 2023
Add XGen info to README and example config
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants