##Task ## Your task is to develop a machine learning approach for spotting keywords in the provided documents. You can test your approach on the provided training and validation dataset where you find a list of keywords that you can find for certain at least once in each set.
In this repository you'll find all the data necessary for your KeywordSpotting Task.
You find the following folders:
Contains ground-truth data.
Contains the transcription of all words (on a character level) of the whole dataset. The Format is as follows:
- XXX-YY-ZZ: XXX = Document Number, YY = Line Number, ZZ = Word Number
- Contains the character-wise transcription of the word (letters seperated with dashes)
- Special characters denoted with s_
- numbers (s_x)
- punctuation (s_pt, s_cm, ...)
- strong s (s_s)
- hyphen (s_mi)
- semicolon (s_sq)
- apostrophe (s_qt)
- colon (s_qo)
Contains bounding boxes for all words in the svg-format.
- XXX.svg: File containing the bounding boxes for the given documents
- **id** contains the same XXX-YY-ZZ naming as above
Contains the original images in jpg-format.
Contains three files:
####train.txt / valid.txt #### Contains a splitting of the documents into a training and a validation set.
Contains a list of keywords of which each will be at least once in the training and validation dataset.