Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question Regarding End Model #66

Closed
tschamp31 opened this issue Aug 26, 2019 · 3 comments
Closed

Question Regarding End Model #66

tschamp31 opened this issue Aug 26, 2019 · 3 comments

Comments

@tschamp31
Copy link

Preface: So my background when coming into machine learning is mainly just understanding programming. Ive gained some solid knowledge about data framing etc but the math of neural networks is still a major weakness.

So my question I believe I already understand but I want to verify before I waste time on these efforts.
1.) The MJSYNTH dataset will only teach the model how to break down/identify/process singular wards?

2.) Assuming yes to (2); Which in result means it needs to be taught how to read sentence and paragraph/spacing structures, correct?

3.) Assuming yes to (2); Is that where your teams work on mapsynth came into play?

4.) Assuming yes to (3); Is the finished mjsynth model then trained on the mapsynth dataset or is it finetuned/scoped to the mapsynth dataset?

5.) Assuming yes to (4); What is the global_step/loss/learning rate etc that is ideal to train it on that dataset?

Yes/No should suffice for all 5. If no maybe a very short rationale. Thank you again for this project being public. Wish your team well on the 2019 ICDAR. I will also post a 1 million step model trained on a single GPU if your team would like a copy on hand or provide to the public.

@weinman
Copy link
Owner

weinman commented Aug 27, 2019

  1. Yes, the mjsynth dataset contains only images of single words.
  2. Yes, training on mjsynth alone will not work well for segmenting words (i.e., with spaces).
  3. No, We created and used MapTextSynthesizer because the visual properties of MJSynth were not a good match for our application. By default, it also generates only images of single words, but with more complicated backgrounds and wider inter-character spacing (on average).
  4. For the results in our ICDAR'19 paper, we train the model from scratch solely on the MapTextSynthesizer stream.
  5. The training schedule we use is given in the paper (Table II) with average (per-word) loss on the real map and MJSynth data given in Figure 6. See often recognize 'u' wrongly #42 for some additional context/examples.

Indirectly, you could probably train a sequence recognizer with MapTextSynthesizer. You could generate a static list of captions (phrases) to sample from as if they were words (though I'm not sure whether the spaces would render properly, maybe @arthurhero knows), but the better thing to do would be to choose the random phrase dynamically on the fly, which would require some more substantial modifications to the code.

In either case, you could then use the CTCWordBeamSearch module in a multi-word mode to recognize the text (or plain Tensorflow CTC beam search if you don't want a lexicon). Just remember to include a space among the output characters in charset.py.

@arthurhero
Copy link

  1. Yes, the mjsynth dataset contains only images of single words.
  2. Yes, training on mjsynth alone will not work well for segmenting words (i.e., with spaces).
  3. No, We created and used MapTextSynthesizer because the visual properties of MJSynth were not a good match for our application. By default, it also generates only images of single words, but with more complicated backgrounds and wider inter-character spacing (on average).
  4. For the results in our ICDAR'19 paper, we train the model from scratch solely on the MapTextSynthesizer stream.
  5. The training schedule we use is given in the paper (Table II) with average (per-word) loss on the real map and MJSynth data given in Figure 6. See often recognize 'u' wrongly #42 for some additional context/examples.

Indirectly, you could probably train a sequence recognizer with MapTextSynthesizer. You could generate a static list of captions (phrases) to sample from as if they were words (though I'm not sure whether the spaces would render properly, maybe @arthurhero knows), but the better thing to do would be to choose the random phrase dynamically on the fly, which would require some more substantial modifications to the code.

In either case, you could then use the CTCWordBeamSearch module in a multi-word mode to recognize the text (or plain Tensorflow CTC beam search if you don't want a lexicon). Just remember to include a space among the output characters in charset.py.

The spacing should be fine. But since phrases tend to be longer than words, pay attention to the hard upper limit of the image width, which can be set in mts_texthelper.cpp at line 562:

surface = cairo_image_surface_create(CAIRO_FORMAT_ARGB32, 40*height,height);

Currently the hard limit is 40 times the image height. You might want to set it higher for phrases.

@tschamp31
Copy link
Author

Perfect, that feedback was exactly what I needed. Thank you both, I will continue to update the code to a cleaner TF2.0. Ideally getting rid of all "tf.compat.vX".

As I said before thank you again for having this project public.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants