Project codebase for the paper A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts.
The objectives:
- Linear probing to the SOTA LLMs (e.g. Llama-2, Falcon).
- Fine-tune adapters e.g. LoRA.
For the project environment and dependency management we use conda. If you do not have conda, I recommend installing miniconda
as opposed to anaconda
as the latter is shipped with redundant packages most of which we don't use at all. The initial setup can be conducted as follows.
git clone git@github.com:devrimcavusoglu/nonwestlit.git
cd nonwestlit
conda env create -f environment.yml
Activate the created conda environment by conda activate nonwestlit
.
The following packages must be installed too for the integer quantization support (4bit & 8bit).
pip install bitsandbytes>=0.41.2.post2 # For integer quantization support
pip install peft==0.7.1 # For LoRA adapters
Also, add the project root directory to PYTHONPATH to develop more smoothly without experiencing any path related problems. For earlier conda
versions this was possible by conda develop <project root dir>
, but it is deprecated (idk why?), so you can choose to manually add it to the PYTHONPATH. Heads up, this may require installing conda-build
with miniconda.
An alternative way to add the project root to the PYTHONPATH permanently for the environment, try the following:
- Go to your conda dist. path (e.g. anaconda3, miniconda3, mamba) usually located at
$HOME/anaconda3
, from now on this path is referred asCONDA_ROOT
- Find the activation bash file located at
CONDA_ROOT/envs/nonwestlit/etc/conda/activate.d
, the file name under this islibxml2_activate.sh
for conda version4.14.0
(could be different for future versions). - Open the file and add the following line and save and close it.
export PYTHONPATH="${PYTHONPATH}:/path/to/project/root"
- Deactivate and reactivate the environment
nonwestlit
. The project root is now permanently added to PYTHONPATH.
To access the CLI and get information about available commands run the following:
python nonwestlit --help
# or python nonwestlit <command> --help
❗Important Note: The terminal commands given below are only for demonstration purposes, and may not represent all
capability of the train arguments. The entry point nonwestlit train
seamlessly support all HF TrainingArguments,
just pass by the exact name and correct value. Please refer to TrainingArguments
to see all supported arguments for training.
You can start training by the following command at the project root. The following example training command was used for training nonwestlit/falcon-7b-lora-seq-cls.
python nonwestlit train --model-name-or-path tiiuae/falcon-7b --train-data-path data/data-json/train.json --eval-data-path data/data-json/val.json --output-dir outputs/falcon_7b_lora_seq_cls --adapter lora --lora-target-modules ["query_key_value"] --bnb-quantization 4bit --experiment-tracking 1 --num-labels 3 --save-strategy steps --save-steps 0.141 --save-total-limit 1 --per-device-train-batch-size 2 --weight-decay 0.1 --learning-rate 0.00003 --eval-steps 0.047 --bf16 1 --logging-steps 1 --num-train-epochs 7
The help docs/docstrings are not there yet, it will soon be ready.
For prediction
python nonwestlit predict --model-name-or-path gpt2 --device cpu
GPT-2 is used for test purposes, so no harm trying the commands above (you will need for the tests anyway). GPT-2 used for high-level functionality tests (training and prediction).
Deepspeed
It's currently experimental, and it is not yet available in the main branch, and it's not tested yet. You can only use w/ GPU's having Ampere arch. (RTX 3000 series+ or workstation GPUs).
Use the following command to push your trained model to the huggingface-hub.
❗Heads Up: Before pushing the model to HF Hub you may want to remove the optimizer file saved along with the model, usually the optimizer binary files for big LLMs (e.g. llama-2, falcon) would be around 3 GB, and thus significantly slows down the uploading phase. Afaik, there is no way of certain ignoring files for huggingface-cli, so you have to either remove the optimizer file or move it outside the model directory.
huggingface-cli upload nonwestlit LOCAL_MODEL_DIR HF_REPO_OR_MODEL_NAME --private --repo-type model
We are using Neptune.ai as an experiment tracking tool. To start logging neptune you need to create a simple config
file as below, and name it neptune.cfg
which must reside under the project root. Enter values without any quotation
marks (shown in the example below).
[credentials]
api_token=<YOUR_NEPTUNE_TOKEN>
[first-level-classification]
project=nonwestlit/first-level-classification
[second-level-classification]
project=nonwestlit/second-level-classification
# Append as necessary (if you create a new project)
[project-key]
project=<PROJECT_NAME>
Alternatively, you can set them as environment variables for project to be logged in set NEPTUNE_PROJECT
, and for
the API token set NEPTUNE_API_TOKEN
. You can use export
(MacOS, Linux) or set
(Windows). For Linux you can
set the environment variables as follows:
export NEPTUNE_PROJECT=<PROJECT_NAME> && export NEPTUNE_API_TOKEN=<YOUR_NEPTUNE_TOKEN>
Note that, since these are set for those sessions only, you need to set these environment variables once they expire (e.g on reboot).
We mainly use black
for the code formatting. Use the following command to format the codebase.
python -m scripts.run_code_style format
To check the code base is formatted, use
python -m scripts.run_code_style check
To run tests, use
python -m scripts.run_tests