Question Regarding End Model #66

tschamp31 · 2019-08-26T01:48:55Z

Preface: So my background when coming into machine learning is mainly just understanding programming. Ive gained some solid knowledge about data framing etc but the math of neural networks is still a major weakness.

So my question I believe I already understand but I want to verify before I waste time on these efforts.
1.) The MJSYNTH dataset will only teach the model how to break down/identify/process singular wards?

2.) Assuming yes to (2); Which in result means it needs to be taught how to read sentence and paragraph/spacing structures, correct?

3.) Assuming yes to (2); Is that where your teams work on mapsynth came into play?

4.) Assuming yes to (3); Is the finished mjsynth model then trained on the mapsynth dataset or is it finetuned/scoped to the mapsynth dataset?

5.) Assuming yes to (4); What is the global_step/loss/learning rate etc that is ideal to train it on that dataset?

Yes/No should suffice for all 5. If no maybe a very short rationale. Thank you again for this project being public. Wish your team well on the 2019 ICDAR. I will also post a 1 million step model trained on a single GPU if your team would like a copy on hand or provide to the public.

weinman · 2019-08-27T13:31:49Z

Yes, the mjsynth dataset contains only images of single words.
Yes, training on mjsynth alone will not work well for segmenting words (i.e., with spaces).
No, We created and used MapTextSynthesizer because the visual properties of MJSynth were not a good match for our application. By default, it also generates only images of single words, but with more complicated backgrounds and wider inter-character spacing (on average).
For the results in our ICDAR'19 paper, we train the model from scratch solely on the MapTextSynthesizer stream.
The training schedule we use is given in the paper (Table II) with average (per-word) loss on the real map and MJSynth data given in Figure 6. See often recognize 'u' wrongly #42 for some additional context/examples.

Indirectly, you could probably train a sequence recognizer with MapTextSynthesizer. You could generate a static list of captions (phrases) to sample from as if they were words (though I'm not sure whether the spaces would render properly, maybe @arthurhero knows), but the better thing to do would be to choose the random phrase dynamically on the fly, which would require some more substantial modifications to the code.

In either case, you could then use the CTCWordBeamSearch module in a multi-word mode to recognize the text (or plain Tensorflow CTC beam search if you don't want a lexicon). Just remember to include a space among the output characters in charset.py.

arthurhero · 2019-08-27T15:16:15Z

Yes, the mjsynth dataset contains only images of single words.

Yes, training on mjsynth alone will not work well for segmenting words (i.e., with spaces).

No, We created and used MapTextSynthesizer because the visual properties of MJSynth were not a good match for our application. By default, it also generates only images of single words, but with more complicated backgrounds and wider inter-character spacing (on average).

For the results in our ICDAR'19 paper, we train the model from scratch solely on the MapTextSynthesizer stream.

The training schedule we use is given in the paper (Table II) with average (per-word) loss on the real map and MJSynth data given in Figure 6. See often recognize 'u' wrongly #42 for some additional context/examples.

Indirectly, you could probably train a sequence recognizer with MapTextSynthesizer. You could generate a static list of captions (phrases) to sample from as if they were words (though I'm not sure whether the spaces would render properly, maybe @arthurhero knows), but the better thing to do would be to choose the random phrase dynamically on the fly, which would require some more substantial modifications to the code.

In either case, you could then use the CTCWordBeamSearch module in a multi-word mode to recognize the text (or plain Tensorflow CTC beam search if you don't want a lexicon). Just remember to include a space among the output characters in charset.py.

The spacing should be fine. But since phrases tend to be longer than words, pay attention to the hard upper limit of the image width, which can be set in mts_texthelper.cpp at line 562:

surface = cairo_image_surface_create(CAIRO_FORMAT_ARGB32, 40*height,height);

Currently the hard limit is 40 times the image height. You might want to set it higher for phrases.

tschamp31 · 2019-08-27T17:41:07Z

Perfect, that feedback was exactly what I needed. Thank you both, I will continue to update the code to a cleaner TF2.0. Ideally getting rid of all "tf.compat.vX".

As I said before thank you again for having this project public.

tschamp31 closed this as completed Aug 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question Regarding End Model #66

Question Regarding End Model #66

tschamp31 commented Aug 26, 2019

weinman commented Aug 27, 2019

arthurhero commented Aug 27, 2019

tschamp31 commented Aug 27, 2019

Question Regarding End Model #66

Question Regarding End Model #66

Comments

tschamp31 commented Aug 26, 2019

weinman commented Aug 27, 2019

arthurhero commented Aug 27, 2019

tschamp31 commented Aug 27, 2019