Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing the target language for OCR'ing #12

Open
anhhaibkhn opened this issue Mar 2, 2022 · 5 comments
Open

Changing the target language for OCR'ing #12

anhhaibkhn opened this issue Mar 2, 2022 · 5 comments

Comments

@anhhaibkhn
Copy link

Hi @MrZilinXiao MrZilinXiao

Thank you for sharing the project. Is this possible to reproduce your model to adapt to other languages?

I could extract the cell coordinates from the table, but I am facing difficulty reconstructing the table, especially for the tables having merged cells. For example:
numerous_merged_cells_table

Could you explain further the idea of how to reconstruct the table?

Thanks for your time.

@MrZilinXiao
Copy link
Owner

Hi, @anhhaibkhn
You might refer to specific reconstruction logic here:

def build_structure(self, delta_y=10, delta_x=10, overlap_thr=0.3):

Feel free to contact me if the code troubles you.

@anhhaibkhn
Copy link
Author

Hi @MrZilinXiao ,
Thank you for your reply.

Is this possible to reproduce your model to adapt to the Non-English languages (such as Japanese)?
I did not get your reconstruction algorithm intuition, Could you explain further the input parameters (delta_y=10, delta_x=10, overlap_thr=0.3) and your final output?

Thanks in advance.

@MrZilinXiao
Copy link
Owner

MrZilinXiao commented Mar 7, 2022

In fact, the structure reconstruction model (A UNet in this project) is irrelevant with the language in your input, since it only produces the structure of the given table. If you'd like to reproduce the demo in our GIF, all you need to do is to replace the OCR module, making it adapt to your target language. An OCRHandler class is here: https://github.com/MrZilinXiao/Hyper-Table-OCR/blob/main/ocr/__init__.py. You may just subclass it, naming it JapaneseOCRHandler or something else.

Talking back to the code, here's the meaning of the default parameter list:

  • delta_y: the number of pixels between two horizontal lines, that the algorithm believes a new line begins.
  • delta_x: the number of pixels between two vertical lines, that the algorithm believes a new column begins.
  • overlap_thr: the algorithm sometimes produces boxes that overlap with others, resulting in incorrect reconstruction structure. So at the end of reconstruction, we remove those boxes that overlap with others too much (ratio more than overlap_thr)

The output is self.rows: List[List[TableCell]], which is a 2D table reconstruction result.

BTW, be aware of this issue: #2. (use google translate if Chinese troubles you)

@anhhaibkhn
Copy link
Author

Hi @MrZilinXiao !
Thank you very much for your detailed and instant reply.

From what I understand (via Google translate your provided link) this project currently only supports cell merge detection in the row direction. I am sorry, but I haven't fully understood, how the reconstruction algorithm work, for example: self.rows: np.ndarray[List[TableCell]] = np.array(self.rows) and those attributes in the TableCell constructor. Perhaps, could you give a simple table example for me to have quick look at how it works?

I will also try to adapt it to the Japanese language in the OCR settings module and let you know how it goes.

@MrZilinXiao
Copy link
Owner

Yes, this project currently only supports cell merge detection in the row direction, so it has difficulty dealing with the example image you provide in the description of this issue since it has both cell merge in row and column direction.

class OCRBlock(object):
    def __init__(self, coord, content, conf=-1.0):
        self.coord: np.ndarray = coord  # xyxyxyxy
        self.conf = conf
        assert len(coord) == 8, "xyxyxyxy not fit for OCRBlock!"
        self.shape = Polygon([coord[0:2], coord[2:4], coord[4:6], coord[6:]])
        self.ocr_content: Union[List[str], str] = content


class TableCell(OCRBlock):
    def __init__(self, coord):
        super(TableCell, self).__init__(coord, [])
        self.matched = False
        self.row_range = [-1, -1]
        self.col_range = [-1, -1]

    @property
    def upper_y(self):
        return self.coord[[1, 3]].mean()

    @property
    def left_x(self):
        return self.coord[[0, 6]].mean()

    @property
    def right_x(self):
        return self.coord[[2, 4]].mean()

TableCell subclasses OCRBlock, whose attributes are as follow:

  • coord: coordinates of a cell, in the format of xyxyxyxy

  • conf: confidence

  • shape: a Polygon object supported by shapely.geometry, used to compute overlap area

  • ocr_contect: the content that OCRHandler would fill in

  • matched: whether the TableCell belongs to an OCRBlock

row_range and col_range might be difficult to understand, so here I make a picture by hand (all indexes follow 0-index tradition):

IMG_0115

Considering a table containing only row-merged cells, the row_range of cell A, B, C is [0, 0], [0,0], [2,2] respectively, and the col_range of cell A, B, C is [0, 0], [1,2] and [1,2]. The dotted line is only for clear depiction and it does not exists in the table.

Hope this solves your problem. Sorry that I don't have a running example with debug info since I already shifted my research interest from Table OCR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants