Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Add Completion dataset type #15

Merged
merged 2 commits into from
May 8, 2023

Conversation

NanoCode012
Copy link
Collaborator

Proposal

Train a completion model like the base llama before fine-tuning on a specific domain. I am testing this method for fine-tuning completion model on another low-resource language (to give it knowledge about the language) before fine-tuning using instruct types (into chat).

Usage

datasets:
  - path: data/text.jsonl
    type: completion

Data format:

Row of {"text": "..."}

Misc

  • I tried to write less repetitive code while following your class format. This code can definitely be shorter. Some functions just return the input
  • I did not apply formatter like black

Results so far

On a 10k dataset, it does not perform quite well. It does manage to use answer in correct language, but nothing special. My training also failed several times, so I think the result is not conclusive yet.

@NanoCode012 NanoCode012 changed the title Feat: Add CompletionPrompt type Feat: Add Completion dataset type May 7, 2023
@NanoCode012 NanoCode012 marked this pull request as ready for review May 8, 2023 15:27
@NanoCode012
Copy link
Collaborator Author

@winglian , I'm wondering if this PR will be ok? I use a similar one for my fine-tuning.

@winglian
Copy link
Collaborator

winglian commented May 8, 2023

@NanoCode012 yup, this looks good to me. One thing I want to do down the line is to convert the prompters and tokenizers into a plugin type system where each one registers their own namespace for the dataset transform step. That way, only new classes need to be added and ideally no clunky conditionals to map them would be needed.

@winglian winglian merged commit 3f9c953 into axolotl-ai-cloud:main May 8, 2023
@NanoCode012
Copy link
Collaborator Author

Yes agreed. I was initially also going to move the _tokenizer function to the parent as all of them were using the same one. But, it would've been better on a separate PR.

Your idea sounds better though!

@NanoCode012 NanoCode012 deleted the feat/completion branch May 9, 2023 02:36
mkeoliya pushed a commit to mkeoliya/axolotl that referenced this pull request Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants