Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the documentation associated with SentenceSplittingMode #39

Open
garethbirduk opened this issue Oct 8, 2023 · 1 comment
Open

Comments

@garethbirduk
Copy link

Describe the bug
This is an improved documentation suggestion more than a bug.

  1. The expected impact of the enum DeepL.SentenceSplittingMode.cs in DeepL.Translator.TranslateTextAsync is poorly documented.
  2. The expected impact cannot be determined in its corresponding unit test; the tests are poorly implemented as they assert only for a non-exception rather than a positively-expected outcome.

Specifically, if and how a new line character \n may or may not be retained is not explained.

To Reproduce
Steps to reproduce the behavior:

  1. Review the documentation

Enum controlling how input translation text should be split into sentences.

This says nothing for how this input translation text impacts the output translation text.

  1. Review the unit tests
      var translator = CreateTestTranslator();
      const string text = "If the implementation is hard to explain, it's a bad idea.\n" +
                          "If the implementation is easy to explain, it may be a good idea.";

      await translator.TranslateTextAsync(
            text,
            null,
            "DE",
            new TextTranslateOptions { SentenceSplittingMode = SentenceSplittingMode.Off });

Expected behavior

  1. Documentation should be made describing how the enum value impacts our expected output translation text.
  2. Existing unit test should be replaced by unit tests that ensure this expected behaviour is adhered to.
@JanEbbing
Copy link
Member

Hi @garethbirduk , thanks for the feedback. There is some more explanation around the parameter in our API docs ("Multiple Sentences" section and the split_sentences option in the /translate endpoint documentation above).

I'm happy to improve the documentation/unit testing here, but I'm not sure how to achieve either.

(The following has some simplifications) Our neural networks are trained on translated sentences. This means the expected input is one sentence in a language, and the expected output is the translated sentence in the target language. Real-world data unfortunately is not always perfectly punctuated, perfectly spelled, or even a full sentence! E.g. if you use the DeepL API to translate chat messages, people might rarely use a period "." to end a sentence. That is why the sentence splitting option exists, to give the user some configuration option, what constitutes a sentence in their usecase (in the chat setting we might want to split on punctuation and, more importantly, new lines. When translating a website, we might want to split only on punctuation, as the HTML/... might require newlines inside sentences.) Now if you use the wrong setting here for your use case, what would happen? Our models would go out of distribution, but predicting what exactly will happen is an open problem in Machine Learning. We can expect the translation quality to go down however (a good case would be, the model tries to translate 2 sentences as 1).

On the unit tests (I'd argue these are more integration tests) - could you suggest a sample way to test any of this? The only thing I could imagine is trying to detect how many sentences there are in the output, and compare how many there were in the input - but this is very brittle, model changes might break this assumption etc. Hence, we test these things on the model side, and not in the client libraries (the client libraries should only check that they correctly pass the setting to the API. Then the API tests should check that they pass it correctly to the models, etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants