Skip to content

LIBXSL is a machine learning package that leverages large language models (LLMs) for text classification tasks. It provides a flexible and scalable framework for training and evaluating text classification models using state-of-the-art LLMs.

License

Notifications You must be signed in to change notification settings

ryaninhust/LibXSL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LIBXSL: A ML Package for Applying LLMs to Text Classification

LIBXSL is a machine learning package that leverages large language models (LLMs) for text classification tasks. It provides a flexible and scalable framework for training and evaluating text classification models using state-of-the-art LLMs.

Features

  • Easy integration with Hugging Face Transformers
  • Support for distributed training with PyTorch
  • Customizable loss functions for various classification tasks
  • Comprehensive logging and evaluation metrics

Installation

To install the package, run:

pip install libxsl

Special Dependency

LIBXSL also depends on a library available from a GitHub repository. This dependency will be automatically installed:

pip install git+https://github.com/ryaninhust/pyxclib.git

Usage

Training a Model

To train a model, you need a configuration file (in YAML format) specifying the training parameters, dataset paths, and model configurations. Here's an example configuration file:

model_name: "bert-base-uncased"
train_data_file: "path/to/train/data"
test_data_file: "path/to/test/data"
max_length: 128
batch_size: 32
num_epochs: 10
pretrained_lr: 2e-5
label_embedding_lr: 1e-3
pretrained_weight_decay: 0.01
label_embedding_weight_decay: 0.01
positive_weight: 1.0
loss_fn: "LRLR"
omega: 1.0
kernel_approx: true
log_file_path: "training.log"
model_save_path: models/model.pth
prediction_save_path: outputs/predictions.npy

Run the training script with the following command:

with open(config_path, 'r') as file:
    config = yaml.safe_load(file)

tokenizer = AutoTokenizer.from_pretrained(config['model_name'])

world_size = torch.cuda.device_count()

train_dataset = TextClassificationDataset(config['train_data_file'], tokenizer, config['max_length'])
test_dataset = TextClassificationDataset(config['test_data_file'], tokenizer, config['max_length'], num_classes=train_dataset.num_classes)
mp.spawn(train, args=(world_size, train_dataset, test_dataset, config), nprocs=world_size, join=True)
mp.spawn(predict, args=(world_size, test_dataset, config), nprocs=world_size, join=True)

Customizing the Model and Loss Functions

You can customize the model and loss functions by editing the corresponding files in the package. For example, to add a new loss function, update the loss_fn.py file and add the new function to the loss function dictionary.

Contributing

We welcome contributions to the NEO project! If you have any ideas, bug reports, or improvements, please submit an issue or a pull request on our GitHub repository.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

About

LIBXSL is a machine learning package that leverages large language models (LLMs) for text classification tasks. It provides a flexible and scalable framework for training and evaluating text classification models using state-of-the-art LLMs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages