trouble using OCRDataset #151

kghezelbash · 2024-02-17T12:42:07Z

kghezelbash
Feb 17, 2024

Hi again, I am trying to a make dataset with OCRDataset class of this project. In the OCRDatasetConfig, the 'path' could be a path to a csv file? or I should convert my data to another format? If it does not support csv format why it gets the column names?

Answered by arxyzan

Feb 17, 2024

Hello @kghezelbash,
To make it more clear, if you want to have your own class to be able to train your model using Hezar, you have to provide a regular PyTorch Dataset subclass. Hezar has its own dataset classes for casual tasks like OCR, image captioning, text classification, etc. There is no force to only use those classes, they're just there to make it easier for you. If your dataset is so different or need a lot of customizations, you can easily write your own dataset class.

You can use this template as an example:

from hezar.models import CRNNImage2TextConfig, CRNNImage2Text
from hezar.preprocessors import ImageProcessor
from hezar.trainer import Trainer, TrainerConfig

from hezar.data

View full answer

arxyzan · 2024-02-17T13:01:43Z

arxyzan
Feb 17, 2024
Maintainer

Hello @kghezelbash,
To make it more clear, if you want to have your own class to be able to train your model using Hezar, you have to provide a regular PyTorch Dataset subclass. Hezar has its own dataset classes for casual tasks like OCR, image captioning, text classification, etc. There is no force to only use those classes, they're just there to make it easier for you. If your dataset is so different or need a lot of customizations, you can easily write your own dataset class.

You can use this template as an example:

from hezar.models import CRNNImage2TextConfig, CRNNImage2Text
from hezar.preprocessors import ImageProcessor
from hezar.trainer import Trainer, TrainerConfig

from hezar.data import OCRDataset, OCRDatasetConfig


class PersianOCRDataset(OCRDataset):
    def __init__(self, config: OCRDatasetConfig, split=None, **kwargs):
        super().__init__(config=config, split=split, **kwargs)

    def _load(self, split=None):
        # Load a dataframe here and make sure the split is fetched
        data = pd.read_csv(self.config.path)
        # preprocess if needed
        return data

    def __getitem__(self, index):
        path, text = self.data.iloc[index].values()
        pixel_values = self.image_processor(path, return_tensors="pt")["pixel_values"][0]
        labels = self._text_to_tensor(text)
        inputs = {
            "pixel_values": pixel_values,
            "labels": labels,
        }
        return inputs


dataset_config = OCRDatasetConfig(
    path="path/to/csv",
    text_split_type="char_split",
    text_column="label",
    images_paths_column="image_path",
    reverse_digits=True,
)

train_dataset = PersianOCRDataset(dataset_config, split="train")
eval_dataset = PersianOCRDataset(dataset_config, split="test")

model = CRNNImage2Text(
    CRNNImage2TextConfig(
        id2label=train_dataset.config.id2label,
        map2seq_in_dim=1024,
        map2seq_out_dim=96
    )
)
preprocessor = ImageProcessor(train_dataset.config.image_processor_config)

train_config = TrainerConfig(
    output_dir="crnn-plate-fa-v1",
    task="image2text",
    device="cuda",
    batch_size=8,
    num_epochs=20,
    metrics=["cer"],
    metric_for_best_model="cer"
)

trainer = Trainer(
    config=train_config,
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=train_dataset.data_collator,
    preprocessor=preprocessor,
)
trainer.train()

0 replies

kghezelbash · 2024-03-17T13:43:47Z

kghezelbash
Mar 17, 2024
Author

Hi again, if we want to finetune the OCR model with text_split_type='tokenize' what id2label we should use?

0 replies

arxyzan · 2024-03-17T16:27:43Z

arxyzan
Mar 17, 2024
Maintainer

@kghezelbash Hi. the tokenize type is only applicable for transformer models like TrOCR that need a tokenizer so that id2label is not necessary. Generally, we do not recommend using that method since it was never tested.

1 reply

kghezelbash Mar 23, 2024
Author

Thank you so much for your time and consideration, and your answer. Where in the training config I should change the pre-trained model to TrOCR?

arxyzan · 2024-03-23T09:12:02Z

arxyzan
Mar 23, 2024
Maintainer

@kghezelbash Appreciate it. Thanks.
Why would you want to use TrOCR? Since this model requires a lot of labeled data (dozens of millions of samples!) and still won't even beat models like CRNN. Plus, as I mentioned we never tested TrOCR for training in Hezar since we trained it using Transformers beforehand and ported it to Hezar so that only inference was tested thoroughly.

5 replies

kghezelbash Mar 23, 2024
Author

I tried to finetune the CRNN model on a big dataset of Persian words written with BYekan, but I was not successful. After fine-tuning the model predicts "" for all the words, so I thought maybe I should change the pre-trained model.

arxyzan Mar 23, 2024
Maintainer

Actually, this is rare for CRNN to have such corrupt behavior. I think there might be a bug in your training script or Hezar's Trainer. Can you paste or send me your full training script?

kghezelbash Mar 23, 2024
Author

Of course. In this code "IDDATA2.csv" is a csv containing a path and label for each image. Here is my training script:

import csv
import os
from hezar.data.datasets.ocr_dataset import TextSplitType
from hezar.constants import TaskType
from hezar.data.datasets.ocr_dataset import OCRDatasetConfig
from hezar.preprocessors import ImageProcessor
from hezar.trainer import Trainer, TrainerConfig
import pandas as pd
from hezar.models import CRNNImage2TextConfig, CRNNImage2Text

from hezar.data import OCRDataset, OCRDatasetConfig
from hezar.preprocessors.image_processor import ImageProcessorConfig
from tqdm import tqdm

csv_file = "IDDATA2.csv"

fa_characters = [
"", "آ", "ا", "ب", "پ", "ت", "ث", "ج", "چ", "ح", "خ", "د", "ذ", "ر", "ز", "ژ", "س", "ش",
"ص", "ض", "ط", "ظ", "ع", "غ", "ف", "ق", "ک", "گ", "ل", "م", "ن", "و", "ه", "ی", " " , "ي"
]
fa_numbers = ["۱", "۲", "۳", "۴", "۵", "۶", "۷", "۸", "۹", "۰"]
fa_special_characters = ["ء", "ؤ", "ئ", "أ", "ّ", 'ٓ', 'ٕ', "ٔ", "\u200c" , "j", "p", "g"]
fa_symbols = ["/", "(", ")", "+", "-", ":", "،", "!", ".", "؛", "=", "%", "؟"]
en_numbers = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "0"]
all_characters = fa_characters + fa_numbers + fa_special_characters + fa_symbols + en_numbers

ID2LABEL = dict(enumerate(all_characters))

image_processor_config = ImageProcessorConfig(
mean=[0.5], # Example mean values for normalization
std=[0.5], # Example standard deviation values for normalization
rescale=1.0, # Example rescaling factor
resample=2, # Example resampling filter (2 for BICUBIC)
size=(224, 224), # Example target image size (width, height)
mirror=False, # Example mirror augmentation
gray_scale=True # Example grayscale conversion
)

class PersianOCRDataset(OCRDataset):
def init(self, config: OCRDatasetConfig, split=None, **kwargs):
super().init(config=config, split=split, **kwargs)

def _load(self, split=None):
    data = pd.read_csv(self.config.path)
    return data

def __getitem__(self, index):
    path = self.data.iloc[index][0]
    text = self.data.iloc[index][1]
    pixel_values = self.image_processor(path, return_tensors="pt")["pixel_values"][0]
    labels = self._text_to_tensor(text)
    inputs = {
        "pixel_values": pixel_values,
        "labels": labels,
    }
    return inputs

dataset_config = OCRDatasetConfig(
path=csv_file,
text_split_type='char_split',
id2label=ID2LABEL,
text_column="text",
images_paths_column="image_path",
max_length=100,
invalid_characters=[],
reverse_text=False,
reverse_digits=False,
image_processor_config=image_processor_config
)

train_dataset = PersianOCRDataset(dataset_config, split="train")
eval_dataset = PersianOCRDataset(dataset_config, split="test")

model = CRNNImage2Text(
CRNNImage2TextConfig(
id2label=train_dataset.config.id2label,
map2seq_in_dim=7168,
map2seq_out_dim=96
)
)

preprocessor = ImageProcessor(train_dataset.config.image_processor_config)

train_config = TrainerConfig(
output_dir="crnn-plate-fa-v5",
task="image2text",
device="CPU",
batch_size=10,
num_epochs=10,
metrics=["cer"],
metric_for_best_model="cer"
)

trainer = Trainer(
config=train_config,
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=train_dataset.data_collator,
preprocessor=preprocessor,
)

trainer.train()

arxyzan Mar 23, 2024
Maintainer

Sorry to ask, but can you wrap the code inside a python block? because this is really hard to read if not wrapped in code block.
like below:

arxyzan Mar 23, 2024
Maintainer

I think you used "quotation mark" instead of "back quote". That's why it still shows up incorrectly. You can also use the Preview button on top of the text box to see how your text would show up for me.

kghezelbash · 2024-03-23T09:44:34Z

kghezelbash
Mar 23, 2024
Author

Thank you again =)).

import csv
import os
from hezar.data.datasets.ocr_dataset import TextSplitType
from hezar.constants import TaskType
from hezar.data.datasets.ocr_dataset import OCRDatasetConfig
from hezar.preprocessors import ImageProcessor
from hezar.trainer import Trainer, TrainerConfig
import pandas as pd
from hezar.models import CRNNImage2TextConfig, CRNNImage2Text

from hezar.data import OCRDataset, OCRDatasetConfig
from hezar.preprocessors.image_processor import ImageProcessorConfig
from tqdm import tqdm

csv_file = "IDDATA2.csv"

fa_characters = [
    "", "آ", "ا", "ب", "پ", "ت", "ث", "ج", "چ", "ح", "خ", "د", "ذ", "ر", "ز", "ژ", "س", "ش",
    "ص", "ض", "ط", "ظ", "ع", "غ", "ف", "ق", "ک", "گ", "ل", "م", "ن", "و", "ه", "ی", " " ,  "ي"
]
fa_numbers = ["۱", "۲", "۳", "۴", "۵", "۶", "۷", "۸", "۹", "۰"]
fa_special_characters = ["ء", "ؤ", "ئ", "أ", "ّ",  'ٓ', 'ٕ', "ٔ", "\u200c" , "j", "p", "g"]
fa_symbols = ["/", "(", ")", "+", "-", ":", "،", "!", ".", "؛", "=", "%", "؟"]
en_numbers = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "0"]
all_characters = fa_characters + fa_numbers + fa_special_characters + fa_symbols + en_numbers

ID2LABEL = dict(enumerate(all_characters))

image_processor_config = ImageProcessorConfig(
    mean=[0.5],  # Example mean values for normalization
    std=[0.5],   # Example standard deviation values for normalization
    rescale=1.0,           # Example rescaling factor
    resample=2,            # Example resampling filter (2 for BICUBIC)
    size=(224, 224),       # Example target image size (width, height)
    mirror=False,           # Example mirror augmentation
    gray_scale=True       # Example grayscale conversion
)

class PersianOCRDataset(OCRDataset):
    def __init__(self, config: OCRDatasetConfig, split=None, **kwargs):
        super().__init__(config=config, split=split, **kwargs)

    def _load(self, split=None):
        data = pd.read_csv(self.config.path)
        return data

    def __getitem__(self, index):
        path = self.data.iloc[index][0]
        text = self.data.iloc[index][1]
        pixel_values = self.image_processor(path, return_tensors="pt")["pixel_values"][0]
        labels = self._text_to_tensor(text)
        inputs = {
            "pixel_values": pixel_values,
            "labels": labels,
        }
        return inputs


dataset_config = OCRDatasetConfig(
    path=csv_file,  
    text_split_type='char_split',  
    id2label=ID2LABEL, 
    text_column="text",
    images_paths_column="image_path",
    max_length=100,  
    invalid_characters=[],  
    reverse_text=False,   
    reverse_digits=False,   
    image_processor_config=image_processor_config  
)


train_dataset = PersianOCRDataset(dataset_config, split="train")
eval_dataset = PersianOCRDataset(dataset_config, split="test")

model = CRNNImage2Text(
    CRNNImage2TextConfig(
        id2label=train_dataset.config.id2label,
        map2seq_in_dim=7168,
        map2seq_out_dim=96
    )
)

preprocessor = ImageProcessor(train_dataset.config.image_processor_config)

train_config = TrainerConfig(
    output_dir="crnn-plate-fa-v1",
    task="image2text",
    device="CPU",
    batch_size=10,
    num_epochs=1,
    metrics=["cer"],
    metric_for_best_model="cer"
)

trainer = Trainer(
    config=train_config,
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=train_dataset.data_collator,
    preprocessor=preprocessor,
)

trainer.train()

14 replies

arxyzan Apr 2, 2024
Maintainer

Alright, now I see, your dataset is not ALPR at all, it's just standard Persian OCR.
First off, the images seem pretty easy to recognize and I tested a couple of them and the vanilla CRNN model at hezarai/crnn-fa-printed-96-long can actually recognize them pretty accurately. (I ran a test on 20 samples from your data and not a single one was recognized incorrectly).

But anyway, if you still want to train a model on this data here's the possible working code for that:

import pandas as pd

from hezar.trainer import Trainer, TrainerConfig
from hezar.models import Model
from hezar.data import OCRDataset, OCRDatasetConfig
from hezar.preprocessors import ImageProcessorConfig

csv_file = "IDDATA2.csv"

fa_characters = [
    "", "آ", "ا", "ب", "پ", "ت", "ث", "ج", "چ", "ح", "خ", "د", "ذ", "ر", "ز", "ژ", "س", "ش",
    "ص", "ض", "ط", "ظ", "ع", "غ", "ف", "ق", "ک", "گ", "ل", "م", "ن", "و", "ه", "ی", " ", "ي"
]
fa_numbers = ["۱", "۲", "۳", "۴", "۵", "۶", "۷", "۸", "۹", "۰"]
fa_special_characters = ["ء", "ؤ", "ئ", "أ", "ّ", 'ٓ', 'ٕ', "ٔ"]
fa_symbols = ["/", "(", ")", "+", "-", ":", "،", "!", ".", "؛", "=", "%", "؟"]
en_numbers = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "0"]
additional_characters = ["\u200c", "j", "p", "g"]  # Append extra characters to the end for faster convergence

all_characters = fa_characters + fa_numbers + fa_special_characters + fa_symbols + en_numbers, additional_characters
ID2LABEL = dict(enumerate(all_characters))


class PersianOCRDataset(OCRDataset):
    def __init__(self, config: OCRDatasetConfig, split=None, **kwargs):
        super().__init__(config=config, split=split, **kwargs)

    def _load(self, split=None):
        data = pd.read_csv(self.config.path)
        return data

    def __getitem__(self, index):
        path = self.data.iloc[index][0]
        text = self.data.iloc[index][1]
        pixel_values = self.image_processor(path, return_tensors="pt")["pixel_values"][0]
        labels = self._text_to_tensor(text)
        inputs = {
            "pixel_values": pixel_values,
            "labels": labels,
        }
        return inputs


if __name__ == '__main__':
    image_processor_config = ImageProcessorConfig(
        mean=[0.6595],
        std=[0.1501],
        rescale=1 / 255,
        size=(384, 32),
        mirror=True,
        gray_scale=True,
    )

    dataset_config = OCRDatasetConfig(
        path=csv_file,
        text_split_type='char_split',
        id2label=ID2LABEL,
        text_column="text",
        images_paths_column="image_path",
        max_length=48,
        invalid_characters=[],
        reverse_digits=True,
        image_processor_config=image_processor_config
    )

    train_dataset = PersianOCRDataset(dataset_config, split="train")
    eval_dataset = PersianOCRDataset(dataset_config, split="test")

    model = Model.load("hezarai/crnn-fa-printed-96-long", id2label=ID2LABEL)

    train_config = TrainerConfig(
        output_dir="crnn-fa-custom",
        task="image2text",
        device="cpu",
        batch_size=16,
        num_epochs=20,
        metrics=["cer"],
        metric_for_best_model="cer"
    )

    trainer = Trainer(
        config=train_config,
        model=model,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=train_dataset.data_collator,
    )

    trainer.train()

kghezelbash Apr 2, 2024
Author

You are right. the model works perfectly on this dataset but it has a little problem when the images become blurred, I want to fine-tune the model on a dataset of this font to read those blurred images perfectly too.

arxyzan Apr 2, 2024
Maintainer

I see. Let me know if you encounter any other problems. Good luck.

kghezelbash Apr 2, 2024
Author

Thank you so much for your help, it is working now =)))))))))))))

arxyzan Apr 2, 2024
Maintainer

So happy to hear that! 🍻🍻

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hezar AI

trouble using OCRDataset #151

{{title}}

Replies: 5 comments 20 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Hezar AI

trouble using OCRDataset #151

kghezelbash Feb 17, 2024

Replies: 5 comments · 20 replies

arxyzan Feb 17, 2024 Maintainer

kghezelbash Mar 17, 2024 Author

arxyzan Mar 17, 2024 Maintainer

kghezelbash Mar 23, 2024 Author

arxyzan Mar 23, 2024 Maintainer

kghezelbash Mar 23, 2024 Author

arxyzan Mar 23, 2024 Maintainer

kghezelbash Mar 23, 2024 Author

arxyzan Mar 23, 2024 Maintainer

arxyzan Mar 23, 2024 Maintainer

kghezelbash Mar 23, 2024 Author

arxyzan Apr 2, 2024 Maintainer

kghezelbash Apr 2, 2024 Author

arxyzan Apr 2, 2024 Maintainer

kghezelbash Apr 2, 2024 Author

arxyzan Apr 2, 2024 Maintainer

kghezelbash
Feb 17, 2024

Replies: 5 comments 20 replies

arxyzan
Feb 17, 2024
Maintainer

kghezelbash
Mar 17, 2024
Author

arxyzan
Mar 17, 2024
Maintainer

kghezelbash Mar 23, 2024
Author

arxyzan
Mar 23, 2024
Maintainer

kghezelbash Mar 23, 2024
Author

arxyzan Mar 23, 2024
Maintainer

kghezelbash Mar 23, 2024
Author

arxyzan Mar 23, 2024
Maintainer

arxyzan Mar 23, 2024
Maintainer

kghezelbash
Mar 23, 2024
Author

arxyzan Apr 2, 2024
Maintainer

kghezelbash Apr 2, 2024
Author

arxyzan Apr 2, 2024
Maintainer

kghezelbash Apr 2, 2024
Author

arxyzan Apr 2, 2024
Maintainer