-
-
Notifications
You must be signed in to change notification settings - Fork 17
Port quotation denormalization #316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
(I don't seem to be able to request @benjaminking's review) |
ddaspit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think PunctuationAnalysis should be a top-level namespace, i.e. SIL.Machine.PunctuationAnalysis. We will need to make a corresponding change in Machine.py.
Reviewed 45 of 50 files at r1, 10 of 10 files at r2, all commit messages.
Reviewable status: all files reviewed, 17 unresolved discussions (waiting on @Enkidu93)
src/SIL.Machine/Corpora/PunctuationAnalysis/TextSegment.cs line 40 at r2 (raw file):
} public override bool Equals(object obj)
We should implement IEquatable.
src/SIL.Machine/Corpora/UsfmToken.cs line 67 at r2 (raw file):
} public override bool Equals(object obj)
You should implement the IEquatable interface for proper type checking.
src/SIL.Machine/Corpora/QuoteConventionChangingUsfmUpdateBlockHandler.cs line 12 at r2 (raw file):
private readonly QuoteConvention _targetQuoteConvention; private readonly QuotationMarkUpdateSettings _settings; protected QuotationMarkFinder _quotationMarkFinder;
Generally, I don't like exposing fields to external classes. If it is exposed to an external class, it should be a property.
src/SIL.Machine/Corpora/PunctuationAnalysis/StandardQuoteConventions.cs line 5 at r2 (raw file):
namespace SIL.Machine.Corpora.PunctuationAnalysis { public class StandardQuoteConventions
Maybe call this class QuoteConventions and the constant Standard. Also, the class can be static.
src/SIL.Machine/Corpora/PunctuationAnalysis/StandardQuoteConventions.cs line 7 at r2 (raw file):
public class StandardQuoteConventions { public static QuoteConventionSet QuoteConventions = new QuoteConventionSet(
This should be read-only.
src/SIL.Machine/Corpora/PunctuationAnalysis/Chapter.cs line 7 at r2 (raw file):
public class Chapter { public Chapter(List<Verse> verses)
It can be dangerous assigning a mutable collection to a field in a class constructor. I'm guessing you are following the Python code, so it might have the same issue. It is safer to make a copy of the collection. I would pass in an IEnumerable and call ToList.
src/SIL.Machine/Corpora/QuotationMarkUpdateFirstPass.cs line 15 at r2 (raw file):
private readonly QuotationMarkFinder _quotationMarkFinder; private readonly DepthBasedQuotationMarkResolver _quotationMarkResolver; public bool WillFallbackModeWork;
This should be a property.
src/SIL.Machine/Corpora/PunctuationAnalysis/DepthBasedQuotationMarkResolver.cs line 10 at r2 (raw file):
public class QuotationMarkResolverState { public Stack<QuotationMarkMetadata> Quotations { get; private set; }
It is usually better to expose a read-only interface for an internal collection, so that callers don't accidentally mutate the collection.
src/SIL.Machine/Corpora/PunctuationAnalysis/DepthBasedQuotationMarkResolver.cs line 386 at r2 (raw file):
public class DepthBasedQuotationMarkResolver : IQuotationMarkResolver { public readonly IQuotationMarkResolutionSettings Settings;
These should be properties.
src/SIL.Machine/Corpora/PunctuationAnalysis/QuotationMarkStringMatch.cs line 26 at r2 (raw file):
} public override bool Equals(object obj)
You should implement the IEquatable interface.
tests/SIL.Machine.Tests/Corpora/QuotationDenormalizationTests.cs line 45 at r2 (raw file):
UsfmParser.Parse(normalizedUsfm, quotationMarkDenormalizationFirstPass); var bestChapterStrategies = quotationMarkDenormalizationFirstPass.FindBestChapterStrategies();
var should only be used if the type is explicit elsewhere in the line. Since there are so many unit tests that are being added, I'm fine if you want to skip these small niggles.
src/SIL.Machine/Corpora/PunctuationAnalysis/QuotationMarkMetadata.cs line 37 at r2 (raw file):
} public override bool Equals(object obj)
You should implement the IEquatable interface.
tests/SIL.Machine.Tests/Corpora/PunctuationAnalysis/PreliminaryQuotationMarkAnalyzerTests.cs line 13 at r2 (raw file):
{ var apostropheProportionStatistics = new ApostropheProportionStatistics(); apostropheProportionStatistics.CountCharacters(new TextSegment.Builder().SetText("'").Build());
I am seeing a lot of code to build text segments. I feel like there has got to be a more convenient/concise way to build a TextSegment. Maybe a function like TextSegment.Create that accepts an Action<TextSegment.Builder> delegate. Once again, just a suggestion. It is perfectly understandable if you don't want to mess with the unit tests.
src/SIL.Machine/Corpora/PunctuationAnalysis/Verse.cs line 8 at r2 (raw file):
public class Verse { public List<TextSegment> TextSegments { get; private set; }
Can this be IReadOnlyList?
src/SIL.Machine/Corpora/PunctuationAnalysis/PreliminaryQuotationMarkAnalyzer.cs line 59 at r2 (raw file):
public void Reset() { _wordInitialOccurrences = new Dictionary<string, int>();
It is usually more efficient to call Clear rather than construct a new collection. These fields can be made read-only.
src/SIL.Machine/Corpora/PunctuationAnalysis/PreliminaryQuotationMarkAnalyzer.cs line 154 at r2 (raw file):
public void Reset() { _earlierQuotationMarkCounts = new Dictionary<string, int>();
See comment above.
src/SIL.Machine/Corpora/PunctuationAnalysis/PreliminaryQuotationMarkAnalyzer.cs line 207 at r2 (raw file):
{ private readonly QuoteConventionSet _quoteConventions; private Dictionary<string, List<QuotationMarkStringMatch>> _groupedQuotationMarks;
This should be read-only.
src/SIL.Machine/Corpora/PunctuationAnalysis/QuoteConventionSet.cs line 27 at r2 (raw file):
} public override bool Equals(object obj)
We should implement IEquatable.
src/SIL.Machine/Corpora/PunctuationAnalysis/IQuotationMarkResolver.cs line 7 at r2 (raw file):
public interface IQuotationMarkResolver { IEnumerable<QuotationMarkMetadata> ResolveQuotationMarks(List<QuotationMarkStringMatch> quotationMarkMatches);
It is usually better to pass around read-only collections by default. Only pass mutable collections if necessary.
ddaspit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 18 unresolved discussions (waiting on @Enkidu93)
src/SIL.Machine/Corpora/PunctuationAnalysis/QuotationMarkStringMatch.cs line 200 at r2 (raw file):
if (characterString.Length == 1) { return UnicodeInfo.GetName(characterString[0]).Contains(attribute);
This feels inefficient. It is basically performing a string search over the Unicode character name for every character. It would be more efficient to get the Unicode category for the character. What did .NET not support that requires the use of this dependency?
|
I'll try and address these comments tomorrow. For convenience, this PR will also address #319 and will include porting the work from sillsdev/machine.py#210 and sillsdev/machine.py#207. |
Enkidu93
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made PunctuationAnalysis a top-level namespace. I'll follow up in machine.py once this is through.
Reviewable status: 6 of 62 files reviewed, 18 unresolved discussions (waiting on @ddaspit)
src/SIL.Machine/Corpora/QuotationMarkUpdateFirstPass.cs line 15 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This should be a property.
Done.
src/SIL.Machine/Corpora/QuoteConventionChangingUsfmUpdateBlockHandler.cs line 12 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Generally, I don't like exposing fields to external classes. If it is exposed to an external class, it should be a property.
Done. A lot of this exposure is for testing. I don't like it either - I think it's part of the reason that Ben was able to get code coverage so high. I wasn't sure how best to address it.
src/SIL.Machine/Corpora/UsfmToken.cs line 67 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
You should implement the
IEquatableinterface for proper type checking.
Done.
tests/SIL.Machine.Tests/Corpora/QuotationDenormalizationTests.cs line 45 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
varshould only be used if the type is explicit elsewhere in the line. Since there are so many unit tests that are being added, I'm fine if you want to skip these small niggles.
I'll see what I can do about this.
src/SIL.Machine/Corpora/PunctuationAnalysis/Chapter.cs line 7 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
It can be dangerous assigning a mutable collection to a field in a class constructor. I'm guessing you are following the Python code, so it might have the same issue. It is safer to make a copy of the collection. I would pass in an
IEnumerableand callToList.
Done.
src/SIL.Machine/Corpora/PunctuationAnalysis/DepthBasedQuotationMarkResolver.cs line 10 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
It is usually better to expose a read-only interface for an internal collection, so that callers don't accidentally mutate the collection.
Done.
src/SIL.Machine/Corpora/PunctuationAnalysis/DepthBasedQuotationMarkResolver.cs line 386 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
These should be properties.
Done.
src/SIL.Machine/Corpora/PunctuationAnalysis/IQuotationMarkResolver.cs line 7 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
It is usually better to pass around read-only collections by default. Only pass mutable collections if necessary.
Done.
src/SIL.Machine/Corpora/PunctuationAnalysis/PreliminaryQuotationMarkAnalyzer.cs line 59 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
It is usually more efficient to call
Clearrather than construct a new collection. These fields can be made read-only.
Done.
src/SIL.Machine/Corpora/PunctuationAnalysis/PreliminaryQuotationMarkAnalyzer.cs line 154 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
See comment above.
Done.
src/SIL.Machine/Corpora/PunctuationAnalysis/PreliminaryQuotationMarkAnalyzer.cs line 207 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This should be read-only.
Done.
src/SIL.Machine/Corpora/PunctuationAnalysis/QuotationMarkMetadata.cs line 37 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
You should implement the
IEquatableinterface.
Done.
src/SIL.Machine/Corpora/PunctuationAnalysis/QuotationMarkStringMatch.cs line 26 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
You should implement the
IEquatableinterface.
Done.
src/SIL.Machine/Corpora/PunctuationAnalysis/QuotationMarkStringMatch.cs line 200 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This feels inefficient. It is basically performing a string search over the Unicode character name for every character. It would be more efficient to get the Unicode category for the character. What did .NET not support that requires the use of this dependency?
It's very inefficient, but I haven't been able to find an alternative. The issue is that I'm trying to match behavior available in the python regexes like \p{QuotationMark} and \p{script=Latin}. Using the Unicode category, unfortunately, is not sufficient. The dependency allows you to get info that's aware of Unicode properties. Unfortunately, script is not one they've added so far, but the property is present in the name. We can discuss this further in our meeting if you like.
src/SIL.Machine/Corpora/PunctuationAnalysis/QuoteConventionSet.cs line 27 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
We should implement
IEquatable.
Done.
src/SIL.Machine/Corpora/PunctuationAnalysis/StandardQuoteConventions.cs line 5 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Maybe call this class
QuoteConventionsand the constantStandard. Also, the class can be static.
Done.
src/SIL.Machine/Corpora/PunctuationAnalysis/StandardQuoteConventions.cs line 7 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This should be read-only.
Done.
src/SIL.Machine/Corpora/PunctuationAnalysis/TextSegment.cs line 40 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
We should implement
IEquatable.
Done.
src/SIL.Machine/Corpora/PunctuationAnalysis/Verse.cs line 8 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Can this be
IReadOnlyList?
Yep, done
tests/SIL.Machine.Tests/Corpora/PunctuationAnalysis/PreliminaryQuotationMarkAnalyzerTests.cs line 13 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I am seeing a lot of code to build text segments. I feel like there has got to be a more convenient/concise way to build a
TextSegment. Maybe a function likeTextSegment.Createthat accepts anAction<TextSegment.Builder>delegate. Once again, just a suggestion. It is perfectly understandable if you don't want to mess with the unit tests.
I agree, but yes, I'm not sure how feasible it is to rework. Maybe something we can think about once it's on QA? Wouldn't affect functionality...
Enkidu93
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 5 of 62 files reviewed, 18 unresolved discussions (waiting on @ddaspit)
src/SIL.Machine/Corpora/PunctuationAnalysis/QuotationMarkStringMatch.cs line 200 at r2 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
It's very inefficient, but I haven't been able to find an alternative. The issue is that I'm trying to match behavior available in the python regexes like
\p{QuotationMark}and\p{script=Latin}. Using the Unicode category, unfortunately, is not sufficient. The dependency allows you to get info that's aware of Unicode properties. Unfortunately, script is not one they've added so far, but the property is present in the name. We can discuss this further in our meeting if you like.
Switched to using PCRE.NET 🥳 - thank you, guys, for turning me onto that.
|
I believe I fixed all the unclear uses of |
ddaspit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this was a lot of tedious work. Good job.
Reviewed 1 of 4 files at r3, 56 of 56 files at r4, all commit messages.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)
src/SIL.Machine/Corpora/QuoteConventionChangingUsfmUpdateBlockHandler.cs line 12 at r2 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
Done. A lot of this exposure is for testing. I don't like it either - I think it's part of the reason that Ben was able to get code coverage so high. I wasn't sure how best to address it.
It is fine to do this occasionally when it is difficult to test code without a direct call.
4a8232d to
61428af
Compare
Besides a couple changes that may come as a result of outstanding comments in the machine.py PR, it is all ported. I have yet to do a careful look through myself, but reviewing can begin in the meantime so we can get it all in ASAP.
This change is