Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tld: comprehensive rewrite #1939

Merged
merged 3 commits into from
Sep 24, 2020
Merged

tld: comprehensive rewrite #1939

merged 3 commits into from
Sep 24, 2020

Conversation

dgw
Copy link
Member

@dgw dgw commented Sep 17, 2020

Description

In short, this fixes Sopel being unable to find known-valid TLDs. But it also makes things better than before. No more ugly, unmaintainable regex!

  • Pulls from IANA for verifying that a TLD is recognized
  • If a TLD is not in the IANA list, it stops and says so
  • If the given TLD exists in the IANA list, return at least a confirmation of existence
  • If the TLD exists in the Wikipedia article we've been using for years, also include more details from there
  • Wikipedia table data is now transformed into an in-memory dictionary using HTMLParser

Both the IANA and Wikipedia data sources are cached locally, persisted/restored to/from the bot's database on shutdown/startup, and periodically updated by a plugin job (minimum age 7 days).

I think about the only thing I didn't do was add example-tests. But I haven't decided against doing so. I've just not gotten to it.

Checklist

  • I have read CONTRIBUTING.md
  • I can and do license this contribution under the EFLv2
  • No issues are reported by make qa (runs make quality and make test)
  • I have tested the functionality of the things this change touches

See also

This might be sufficient to consider #1648 as "done". I've been unable to find any viable alternative data source that actually provides the information contained in Wikipedia's tables—even the basics like whether a TLD allows second-level registrations or supports DNSSEC/IDN/etc.

List is cached in `bot.memory` at runtime, and persisted to a plugin
value in `bot.db` at shutdown. If a cached list exists, it is loaded
into `bot.memory` at startup.

The list is updated automatically once the data is 7 days old.
Replaced those horrid (and ugly) regex patterns with a bare-bones HTML
parser, an in-memory lookup dictionary, and dynamic generation of the
output line depending on what fields are available.

Surely there are some remaining edge cases (though I threw a number of
different TLDs at this and knocked out some already), but this is an
absolutely MASSIVE improvement in maintainability IMO.

Uses the same caching logic as for the IANA TLD list, so any given
Sopel instance will only hit Wikipedia at most once a week or so.
Also moved the data-processing step inside the parser object, because
having a helper function for that at the module level didn't make sense.
@dgw dgw added the Feature label Sep 17, 2020
@dgw dgw added this to the 7.1.0 milestone Sep 17, 2020
@dgw dgw requested a review from a team September 17, 2020 06:36
@dgw
Copy link
Member Author

dgw commented Sep 17, 2020

Update: I was gathering examples of the new output and found one table that gets parsed wrong. Will need to investigate that one. A quick look at the HTML just now didn't reveal any obvious structural difference, and I unfortunately can't spend any more time on this until Friday evening (9/18 in central US, for our non-American project members) at the earliest.

Still feels good to have mostly done, even if a gremlin or two remain lurking.

Copy link
Contributor

@Exirel Exirel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT it's working fine.

So yeah, sure, one table is messing with us. It's still far better than the current version.

@dgw
Copy link
Member Author

dgw commented Sep 24, 2020

I'll just merge this as-is and follow up with another PR to fix that table later (or an issue, if I don't get to it soon). Like @Exirel said, "it's working fine", and I agree that one table misbehaving a bit shouldn't block the vast improvement that is the rest of this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants