Improve the documentation associated with SentenceSplittingMode #39

garethbirduk · 2023-10-08T06:31:00Z

Describe the bug
This is an improved documentation suggestion more than a bug.

The expected impact of the enum DeepL.SentenceSplittingMode.cs in DeepL.Translator.TranslateTextAsync is poorly documented.
The expected impact cannot be determined in its corresponding unit test; the tests are poorly implemented as they assert only for a non-exception rather than a positively-expected outcome.

Specifically, if and how a new line character \n may or may not be retained is not explained.

To Reproduce
Steps to reproduce the behavior:

Review the documentation

Enum controlling how input translation text should be split into sentences.

This says nothing for how this input translation text impacts the output translation text.

Review the unit tests

      var translator = CreateTestTranslator();
      const string text = "If the implementation is hard to explain, it's a bad idea.\n" +
                          "If the implementation is easy to explain, it may be a good idea.";

      await translator.TranslateTextAsync(
            text,
            null,
            "DE",
            new TextTranslateOptions { SentenceSplittingMode = SentenceSplittingMode.Off });

Expected behavior

Documentation should be made describing how the enum value impacts our expected output translation text.
Existing unit test should be replaced by unit tests that ensure this expected behaviour is adhered to.

The text was updated successfully, but these errors were encountered:

JanEbbing · 2023-10-09T11:38:46Z

Hi @garethbirduk , thanks for the feedback. There is some more explanation around the parameter in our API docs ("Multiple Sentences" section and the split_sentences option in the /translate endpoint documentation above).

I'm happy to improve the documentation/unit testing here, but I'm not sure how to achieve either.

(The following has some simplifications) Our neural networks are trained on translated sentences. This means the expected input is one sentence in a language, and the expected output is the translated sentence in the target language. Real-world data unfortunately is not always perfectly punctuated, perfectly spelled, or even a full sentence! E.g. if you use the DeepL API to translate chat messages, people might rarely use a period "." to end a sentence. That is why the sentence splitting option exists, to give the user some configuration option, what constitutes a sentence in their usecase (in the chat setting we might want to split on punctuation and, more importantly, new lines. When translating a website, we might want to split only on punctuation, as the HTML/... might require newlines inside sentences.) Now if you use the wrong setting here for your use case, what would happen? Our models would go out of distribution, but predicting what exactly will happen is an open problem in Machine Learning. We can expect the translation quality to go down however (a good case would be, the model tries to translate 2 sentences as 1).

On the unit tests (I'd argue these are more integration tests) - could you suggest a sample way to test any of this? The only thing I could imagine is trying to detect how many sentences there are in the output, and compare how many there were in the input - but this is very brittle, model changes might break this assumption etc. Hence, we test these things on the model side, and not in the client libraries (the client libraries should only check that they correctly pass the setting to the API. Then the API tests should check that they pass it correctly to the models, etc).

garethbirduk added the needs-verification label Oct 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the documentation associated with SentenceSplittingMode #39

Improve the documentation associated with SentenceSplittingMode #39

garethbirduk commented Oct 8, 2023

JanEbbing commented Oct 9, 2023

Improve the documentation associated with SentenceSplittingMode #39

Improve the documentation associated with SentenceSplittingMode #39

Comments

garethbirduk commented Oct 8, 2023

JanEbbing commented Oct 9, 2023