-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve English subtypes refs finder, add tests #1155
base: master
Are you sure you want to change the base?
Conversation
self.subtypes_string = '|'.join([re.escape(s) for s in self.subtype_names + self.subtype_abbreviations]) | ||
self.full_subtypes = Subtype.objects.all() | ||
self.subtypes = [s.name for s in self.full_subtypes] + \ | ||
[s.abbreviation for s in self.full_subtypes] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given what we've learnt from term matching and italics, does it make sense to sort these longest first?
(No\.?\s*)? | ||
(?P<num>\d+) | ||
(\s+of\s+|/) | ||
(?P<year>\d{{4}}) | ||
.{{0,60}}(\s|/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me a bit nervous, it's pretty broad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neil had an interesting idea which maybe we need to start exploring: start with a broad regex, and then filter down edge cases with supplemental regexes. His argument is it helps to keep the regexes simple and understandable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eg. this would be a false positive: as published in GN No 102. In section 4378 of the Act...
This stuff is tricky.
<paragraph eId="sec_1.paragraph-0"> | ||
<content> | ||
<p>Something to do with <ref href="/akn/za/act/gn/2012/102">GN no 102 published on 5 March 2012</ref>.</p> | ||
<p>And another thing about <ref href="/akn/za/act/si/1998/4">SI 4, published in Government Gazette 12345 of 31 January 1998</ref>.</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should only markup the GN no 102
and SI 4
?
No description provided.