Module to facilitate the integration of a sklearn training pipeline into a deploy and retraining system
pip install gpam_training
First of all, it is needed to have a dataframe from pandas in memory. The csv must be in the following format:
process_id,page_text_extract,tema
1,Lorem ipsum dolor sit amet,1
1,Lorem ipsum dolor sit amet,2
2,Lorem ipsum dolor sit amet,2
2,Lorem ipsum dolor sit amet,3
4,Lorem ipsum dolor sit amet,1
4,Lorem ipsum dolor sit amet,2
5,Lorem ipsum dolor sit amet,2
42,Lorem ipsum dolor sit amet,2
To train the model, do as shown bellow:
from gpam_training import MultilabelTraining
import pandas as pd
df = pd.read_csv('example.csv')
model = MultilabelTraining(df)
model.train()
To dump a pickle file with the trained model, do the following:
model_pickle = model.get_pickle()
- df (default=pandas.DataFrame()): A pandas dataframe;
- x_column_name (default="page_text_extract"): The name of the text column;
- group_processes (default=True): Wheter the labels of the processes must be grouped or not. So, for a csv like the one above, where there are one row for each label associated with a single process, this argument must true;
- classifier (default=PassiveAgressiveClassifier(random_state=42)): The estimator to be used;
- vectorizer (default=HashingVectorizer(n_features=2 ** 14)): The vectorizer to be used;
- target_themes (default=DEFAULT_TARGET_THEMES): The values to be considered as target values. The ones that are different from this list will be switched to the value of the attribute other_themes_values;
- other_themes_value (default=OTHER_THEMES_VALUE): The value
- remove_processes_without_theme (default=True): If the processes labeled without theme (represented by the theme 0) must be removed;
- is_incremental_training (default=False): Wheter the train is incremental or not;
- vocab_path (default=""): Path to a list of words representing a vocabulary. Words out of this list will be removed.