Feat: Add Completion dataset type #15

NanoCode012 · 2023-05-07T08:20:56Z

Proposal

Train a completion model like the base llama before fine-tuning on a specific domain. I am testing this method for fine-tuning completion model on another low-resource language (to give it knowledge about the language) before fine-tuning using instruct types (into chat).

Usage

datasets:
  - path: data/text.jsonl
    type: completion

Data format:

Row of {"text": "..."}

Misc

I tried to write less repetitive code while following your class format. This code can definitely be shorter. Some functions just return the input
I did not apply formatter like black

Results so far

On a 10k dataset, it does not perform quite well. It does manage to use answer in correct language, but nothing special. My training also failed several times, so I think the result is not conclusive yet.

NanoCode012 · 2023-05-08T15:28:22Z

@winglian , I'm wondering if this PR will be ok? I use a similar one for my fine-tuning.

winglian · 2023-05-08T23:03:48Z

@NanoCode012 yup, this looks good to me. One thing I want to do down the line is to convert the prompters and tokenizers into a plugin type system where each one registers their own namespace for the dataset transform step. That way, only new classes need to be added and ideally no clunky conditionals to map them would be needed.

NanoCode012 · 2023-05-09T02:36:45Z

Yes agreed. I was initially also going to move the _tokenizer function to the parent as all of them were using the same one. But, it would've been better on a separate PR.

Your idea sounds better though!

Feat: Add Completion dataset type

NanoCode012 changed the title ~~Feat: Add CompletionPrompt type~~ Feat: Add Completion dataset type May 7, 2023

NanoCode012 force-pushed the feat/completion branch from b26096a to 1bb6b52 Compare May 8, 2023 15:26

NanoCode012 marked this pull request as ready for review May 8, 2023 15:27

NanoCode012 force-pushed the feat/completion branch from 1bb6b52 to ac20072 Compare May 8, 2023 15:30

NanoCode012 added 2 commits May 9, 2023 02:49

Add CompletionPrompt type

cf68153

Rename variable to use same convention

174b74d

NanoCode012 force-pushed the feat/completion branch from 95c9467 to 174b74d Compare May 8, 2023 17:49

winglian merged commit 3f9c953 into axolotl-ai-cloud:main May 8, 2023

NanoCode012 deleted the feat/completion branch May 9, 2023 02:36

unknown-submitter-000 mentioned this pull request Nov 1, 2023

Socket Timeout after 30 minutes running Issue #809

Closed

8 tasks

mkeoliya pushed a commit to mkeoliya/axolotl that referenced this pull request Dec 15, 2023

Merge pull request axolotl-ai-cloud#15 from NanoCode012/feat/completion

db918db

Feat: Add Completion dataset type

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Add Completion dataset type #15

Feat: Add Completion dataset type #15

NanoCode012 commented May 7, 2023

NanoCode012 commented May 8, 2023

winglian commented May 8, 2023

NanoCode012 commented May 9, 2023

Feat: Add Completion dataset type #15

Feat: Add Completion dataset type #15

Conversation

NanoCode012 commented May 7, 2023

Proposal

Usage

Misc

Results so far

NanoCode012 commented May 8, 2023

winglian commented May 8, 2023

NanoCode012 commented May 9, 2023