Skip to content

Conversation

@Enkidu93
Copy link
Collaborator

@Enkidu93 Enkidu93 commented Jul 28, 2025

Besides a couple changes that may come as a result of outstanding comments in the machine.py PR, it is all ported. I have yet to do a careful look through myself, but reviewing can begin in the meantime so we can get it all in ASAP.


This change is Reviewable

@Enkidu93 Enkidu93 requested a review from ddaspit July 28, 2025 22:48
@Enkidu93
Copy link
Collaborator Author

(I don't seem to be able to request @benjaminking's review)

@codecov-commenter
Copy link

codecov-commenter commented Jul 28, 2025

Codecov Report

❌ Patch coverage is 91.80815% with 193 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.18%. Comparing base (66ce53a) to head (cc00f4e).

Files with missing lines Patch % Lines
src/SIL.Machine/PunctuationAnalysis/TextSegment.cs 75.24% 25 Missing ⚠️
...ne/PunctuationAnalysis/QuotationMarkStringMatch.cs 81.30% 15 Missing and 5 partials ⚠️
...hine/PunctuationAnalysis/UsfmStructureExtractor.cs 79.59% 20 Missing ⚠️
...tuationAnalysis/DepthBasedQuotationMarkResolver.cs 95.56% 11 Missing and 6 partials ⚠️
...chine/PunctuationAnalysis/QuotationMarkMetadata.cs 72.13% 16 Missing and 1 partial ⚠️
...L.Machine/Corpora/FallbackQuotationMarkResolver.cs 87.60% 11 Missing and 4 partials ⚠️
src/SIL.Machine/Corpora/UsfmToken.cs 42.30% 13 Missing and 2 partials ⚠️
...rpora/ZipParatextProjectQuoteConventionDetector.cs 0.00% 13 Missing ⚠️
.../Corpora/ParatextProjectQuoteConventionDetector.cs 62.50% 12 Missing ⚠️
....Machine/PunctuationAnalysis/QuoteConventionSet.cs 94.76% 9 Missing ⚠️
... and 9 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #316      +/-   ##
==========================================
+ Coverage   70.75%   72.18%   +1.42%     
==========================================
  Files         390      414      +24     
  Lines       32806    35129    +2323     
  Branches     4612     4858     +246     
==========================================
+ Hits        23213    25359    +2146     
- Misses       8527     8681     +154     
- Partials     1066     1089      +23     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think PunctuationAnalysis should be a top-level namespace, i.e. SIL.Machine.PunctuationAnalysis. We will need to make a corresponding change in Machine.py.

Reviewed 45 of 50 files at r1, 10 of 10 files at r2, all commit messages.
Reviewable status: all files reviewed, 17 unresolved discussions (waiting on @Enkidu93)


src/SIL.Machine/Corpora/PunctuationAnalysis/TextSegment.cs line 40 at r2 (raw file):

        }

        public override bool Equals(object obj)

We should implement IEquatable.


src/SIL.Machine/Corpora/UsfmToken.cs line 67 at r2 (raw file):

        }

        public override bool Equals(object obj)

You should implement the IEquatable interface for proper type checking.


src/SIL.Machine/Corpora/QuoteConventionChangingUsfmUpdateBlockHandler.cs line 12 at r2 (raw file):

        private readonly QuoteConvention _targetQuoteConvention;
        private readonly QuotationMarkUpdateSettings _settings;
        protected QuotationMarkFinder _quotationMarkFinder;

Generally, I don't like exposing fields to external classes. If it is exposed to an external class, it should be a property.


src/SIL.Machine/Corpora/PunctuationAnalysis/StandardQuoteConventions.cs line 5 at r2 (raw file):

namespace SIL.Machine.Corpora.PunctuationAnalysis
{
    public class StandardQuoteConventions

Maybe call this class QuoteConventions and the constant Standard. Also, the class can be static.


src/SIL.Machine/Corpora/PunctuationAnalysis/StandardQuoteConventions.cs line 7 at r2 (raw file):

    public class StandardQuoteConventions
    {
        public static QuoteConventionSet QuoteConventions = new QuoteConventionSet(

This should be read-only.


src/SIL.Machine/Corpora/PunctuationAnalysis/Chapter.cs line 7 at r2 (raw file):

    public class Chapter
    {
        public Chapter(List<Verse> verses)

It can be dangerous assigning a mutable collection to a field in a class constructor. I'm guessing you are following the Python code, so it might have the same issue. It is safer to make a copy of the collection. I would pass in an IEnumerable and call ToList.


src/SIL.Machine/Corpora/QuotationMarkUpdateFirstPass.cs line 15 at r2 (raw file):

        private readonly QuotationMarkFinder _quotationMarkFinder;
        private readonly DepthBasedQuotationMarkResolver _quotationMarkResolver;
        public bool WillFallbackModeWork;

This should be a property.


src/SIL.Machine/Corpora/PunctuationAnalysis/DepthBasedQuotationMarkResolver.cs line 10 at r2 (raw file):

    public class QuotationMarkResolverState
    {
        public Stack<QuotationMarkMetadata> Quotations { get; private set; }

It is usually better to expose a read-only interface for an internal collection, so that callers don't accidentally mutate the collection.


src/SIL.Machine/Corpora/PunctuationAnalysis/DepthBasedQuotationMarkResolver.cs line 386 at r2 (raw file):

    public class DepthBasedQuotationMarkResolver : IQuotationMarkResolver
    {
        public readonly IQuotationMarkResolutionSettings Settings;

These should be properties.


src/SIL.Machine/Corpora/PunctuationAnalysis/QuotationMarkStringMatch.cs line 26 at r2 (raw file):

        }

        public override bool Equals(object obj)

You should implement the IEquatable interface.


tests/SIL.Machine.Tests/Corpora/QuotationDenormalizationTests.cs line 45 at r2 (raw file):

        UsfmParser.Parse(normalizedUsfm, quotationMarkDenormalizationFirstPass);
        var bestChapterStrategies = quotationMarkDenormalizationFirstPass.FindBestChapterStrategies();

var should only be used if the type is explicit elsewhere in the line. Since there are so many unit tests that are being added, I'm fine if you want to skip these small niggles.


src/SIL.Machine/Corpora/PunctuationAnalysis/QuotationMarkMetadata.cs line 37 at r2 (raw file):

        }

        public override bool Equals(object obj)

You should implement the IEquatable interface.


tests/SIL.Machine.Tests/Corpora/PunctuationAnalysis/PreliminaryQuotationMarkAnalyzerTests.cs line 13 at r2 (raw file):

    {
        var apostropheProportionStatistics = new ApostropheProportionStatistics();
        apostropheProportionStatistics.CountCharacters(new TextSegment.Builder().SetText("'").Build());

I am seeing a lot of code to build text segments. I feel like there has got to be a more convenient/concise way to build a TextSegment. Maybe a function like TextSegment.Create that accepts an Action<TextSegment.Builder> delegate. Once again, just a suggestion. It is perfectly understandable if you don't want to mess with the unit tests.


src/SIL.Machine/Corpora/PunctuationAnalysis/Verse.cs line 8 at r2 (raw file):

    public class Verse
    {
        public List<TextSegment> TextSegments { get; private set; }

Can this be IReadOnlyList?


src/SIL.Machine/Corpora/PunctuationAnalysis/PreliminaryQuotationMarkAnalyzer.cs line 59 at r2 (raw file):

        public void Reset()
        {
            _wordInitialOccurrences = new Dictionary<string, int>();

It is usually more efficient to call Clear rather than construct a new collection. These fields can be made read-only.


src/SIL.Machine/Corpora/PunctuationAnalysis/PreliminaryQuotationMarkAnalyzer.cs line 154 at r2 (raw file):

        public void Reset()
        {
            _earlierQuotationMarkCounts = new Dictionary<string, int>();

See comment above.


src/SIL.Machine/Corpora/PunctuationAnalysis/PreliminaryQuotationMarkAnalyzer.cs line 207 at r2 (raw file):

    {
        private readonly QuoteConventionSet _quoteConventions;
        private Dictionary<string, List<QuotationMarkStringMatch>> _groupedQuotationMarks;

This should be read-only.


src/SIL.Machine/Corpora/PunctuationAnalysis/QuoteConventionSet.cs line 27 at r2 (raw file):

        }

        public override bool Equals(object obj)

We should implement IEquatable.


src/SIL.Machine/Corpora/PunctuationAnalysis/IQuotationMarkResolver.cs line 7 at r2 (raw file):

    public interface IQuotationMarkResolver
    {
        IEnumerable<QuotationMarkMetadata> ResolveQuotationMarks(List<QuotationMarkStringMatch> quotationMarkMatches);

It is usually better to pass around read-only collections by default. Only pass mutable collections if necessary.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 18 unresolved discussions (waiting on @Enkidu93)


src/SIL.Machine/Corpora/PunctuationAnalysis/QuotationMarkStringMatch.cs line 200 at r2 (raw file):

            if (characterString.Length == 1)
            {
                return UnicodeInfo.GetName(characterString[0]).Contains(attribute);

This feels inefficient. It is basically performing a string search over the Unicode character name for every character. It would be more efficient to get the Unicode category for the character. What did .NET not support that requires the use of this dependency?

@Enkidu93
Copy link
Collaborator Author

Enkidu93 commented Aug 5, 2025

I'll try and address these comments tomorrow.

For convenience, this PR will also address #319 and will include porting the work from sillsdev/machine.py#210 and sillsdev/machine.py#207.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made PunctuationAnalysis a top-level namespace. I'll follow up in machine.py once this is through.

Reviewable status: 6 of 62 files reviewed, 18 unresolved discussions (waiting on @ddaspit)


src/SIL.Machine/Corpora/QuotationMarkUpdateFirstPass.cs line 15 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This should be a property.

Done.


src/SIL.Machine/Corpora/QuoteConventionChangingUsfmUpdateBlockHandler.cs line 12 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Generally, I don't like exposing fields to external classes. If it is exposed to an external class, it should be a property.

Done. A lot of this exposure is for testing. I don't like it either - I think it's part of the reason that Ben was able to get code coverage so high. I wasn't sure how best to address it.


src/SIL.Machine/Corpora/UsfmToken.cs line 67 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should implement the IEquatable interface for proper type checking.

Done.


tests/SIL.Machine.Tests/Corpora/QuotationDenormalizationTests.cs line 45 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

var should only be used if the type is explicit elsewhere in the line. Since there are so many unit tests that are being added, I'm fine if you want to skip these small niggles.

I'll see what I can do about this.


src/SIL.Machine/Corpora/PunctuationAnalysis/Chapter.cs line 7 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

It can be dangerous assigning a mutable collection to a field in a class constructor. I'm guessing you are following the Python code, so it might have the same issue. It is safer to make a copy of the collection. I would pass in an IEnumerable and call ToList.

Done.


src/SIL.Machine/Corpora/PunctuationAnalysis/DepthBasedQuotationMarkResolver.cs line 10 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

It is usually better to expose a read-only interface for an internal collection, so that callers don't accidentally mutate the collection.

Done.


src/SIL.Machine/Corpora/PunctuationAnalysis/DepthBasedQuotationMarkResolver.cs line 386 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

These should be properties.

Done.


src/SIL.Machine/Corpora/PunctuationAnalysis/IQuotationMarkResolver.cs line 7 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

It is usually better to pass around read-only collections by default. Only pass mutable collections if necessary.

Done.


src/SIL.Machine/Corpora/PunctuationAnalysis/PreliminaryQuotationMarkAnalyzer.cs line 59 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

It is usually more efficient to call Clear rather than construct a new collection. These fields can be made read-only.

Done.


src/SIL.Machine/Corpora/PunctuationAnalysis/PreliminaryQuotationMarkAnalyzer.cs line 154 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

See comment above.

Done.


src/SIL.Machine/Corpora/PunctuationAnalysis/PreliminaryQuotationMarkAnalyzer.cs line 207 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This should be read-only.

Done.


src/SIL.Machine/Corpora/PunctuationAnalysis/QuotationMarkMetadata.cs line 37 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should implement the IEquatable interface.

Done.


src/SIL.Machine/Corpora/PunctuationAnalysis/QuotationMarkStringMatch.cs line 26 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should implement the IEquatable interface.

Done.


src/SIL.Machine/Corpora/PunctuationAnalysis/QuotationMarkStringMatch.cs line 200 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This feels inefficient. It is basically performing a string search over the Unicode character name for every character. It would be more efficient to get the Unicode category for the character. What did .NET not support that requires the use of this dependency?

It's very inefficient, but I haven't been able to find an alternative. The issue is that I'm trying to match behavior available in the python regexes like \p{QuotationMark} and \p{script=Latin}. Using the Unicode category, unfortunately, is not sufficient. The dependency allows you to get info that's aware of Unicode properties. Unfortunately, script is not one they've added so far, but the property is present in the name. We can discuss this further in our meeting if you like.


src/SIL.Machine/Corpora/PunctuationAnalysis/QuoteConventionSet.cs line 27 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We should implement IEquatable.

Done.


src/SIL.Machine/Corpora/PunctuationAnalysis/StandardQuoteConventions.cs line 5 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Maybe call this class QuoteConventions and the constant Standard. Also, the class can be static.

Done.


src/SIL.Machine/Corpora/PunctuationAnalysis/StandardQuoteConventions.cs line 7 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This should be read-only.

Done.


src/SIL.Machine/Corpora/PunctuationAnalysis/TextSegment.cs line 40 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We should implement IEquatable.

Done.


src/SIL.Machine/Corpora/PunctuationAnalysis/Verse.cs line 8 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Can this be IReadOnlyList?

Yep, done


tests/SIL.Machine.Tests/Corpora/PunctuationAnalysis/PreliminaryQuotationMarkAnalyzerTests.cs line 13 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I am seeing a lot of code to build text segments. I feel like there has got to be a more convenient/concise way to build a TextSegment. Maybe a function like TextSegment.Create that accepts an Action<TextSegment.Builder> delegate. Once again, just a suggestion. It is perfectly understandable if you don't want to mess with the unit tests.

I agree, but yes, I'm not sure how feasible it is to rework. Maybe something we can think about once it's on QA? Wouldn't affect functionality...

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 of 62 files reviewed, 18 unresolved discussions (waiting on @ddaspit)


src/SIL.Machine/Corpora/PunctuationAnalysis/QuotationMarkStringMatch.cs line 200 at r2 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

It's very inefficient, but I haven't been able to find an alternative. The issue is that I'm trying to match behavior available in the python regexes like \p{QuotationMark} and \p{script=Latin}. Using the Unicode category, unfortunately, is not sufficient. The dependency allows you to get info that's aware of Unicode properties. Unfortunately, script is not one they've added so far, but the property is present in the name. We can discuss this further in our meeting if you like.

Switched to using PCRE.NET 🥳 - thank you, guys, for turning me onto that.

@Enkidu93
Copy link
Collaborator Author

Enkidu93 commented Aug 7, 2025

I believe I fixed all the unclear uses of var. Ready for re-review, @ddaspit.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this was a lot of tedious work. Good job.

Reviewed 1 of 4 files at r3, 56 of 56 files at r4, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)


src/SIL.Machine/Corpora/QuoteConventionChangingUsfmUpdateBlockHandler.cs line 12 at r2 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Done. A lot of this exposure is for testing. I don't like it either - I think it's part of the reason that Ben was able to get code coverage so high. I wasn't sure how best to address it.

It is fine to do this occasionally when it is difficult to test code without a direct call.

@Enkidu93 Enkidu93 force-pushed the port_quotation_denormalization branch from 4a8232d to 61428af Compare August 11, 2025 13:04
@Enkidu93 Enkidu93 merged commit 2670829 into master Aug 11, 2025
3 of 4 checks passed
@Enkidu93 Enkidu93 deleted the port_quotation_denormalization branch August 11, 2025 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants