new api #116

masklinn · 2022-05-02T09:50:22Z

migrate results to data classes, trivially convertible to the old results via dataclasses.asdict
add an intermediate result type which is not fully set, useful for both caching partial results and more clearly signalling behaviour to the caller
introduce a high-level "Parser" API for the root operation (instead of free functions), replace the current parsers by matchers which serve a similar role but are less required

QUESTIONS

are the Pretty* functions actually useful / necessary? They've been here from the start, but it's not entirely clear that they're useful

TODO

better testing for the newly accessible bits
Demo alternative parser? implemented an re2-based parser which seems to work really well (how well will have to wait for Add benchmarking #163 and possibly course correction)
move the addition of mypy in a separate commit and tighten it significantly, see mypy --strict #179 for details
check if putting all the regexes in the same re2.Filter (and post-filter) works better than creating a filter per domain
After testing it, result is mixed: parse through the data file of Add benchmarking #163 is about 7% faster, but parsing just one domain is 30% to 270% slower (not a typo, parse_user_agent goes from 0.58s to 1.57s, parse_device fares best).
This does make sense, the unified parser does essentially always the same amount of work and that's largely informed by the devices set which is by far the largest and most complex so severely impact the UA and OS sets. The single match does gain an edge when all three domains are parsed, but i don't think that edge is sufficient to justify the loss in performance when parsing individual domains.

masklinn · 2022-05-02T11:42:30Z

@jab @ThiefMaster since you were apparently interested in a more idiomatic API, this is my starting point / current thinking. It is very much patterned after the idea of utility functions in front of a flexible (and hopefully reasonably compositional) pile of classes.

So the basic use should be the same as today (a small handful of functions performing standard tasks, just better named, and returning dataclasses instead of unspecified dicts), but the intermediate Parser object would provide users of the library with a lot more flexibility in terms of caching strategy, matchers, or even parsing.

ua_parser/__init__.py

tests/test_core.py

src/ua_parser/re2.py

tests/test_core.py

tests/test_iterative.py

src/ua_parser/loaders.py

setup.py

New API with full typing ======================== Seems pretty self-explanatory, rather than returning somewhat ad-hoc dicts this API works off of dataclasses, it should be compatible with the legacy version through the magic of ~~buying two of them~~ `dataclasses.asdict`. Parser API ========== The legacy version had "parsers" which really represent individual parsing rules. In the new API the job of a parser is what the top-level functions did, they wrap around the entire job of parsing a user-agent string. The core API is just `__call__`, with a selection flag for the domains (seems like the least bad term for what "user agent", "os", and "device" are, other alternatives I considered are "component" and "category", but I'm still ambivalent). Overridable helpers are provided which match the old API's methods (with PEP8 conventions), as well as the same style of helpers at the package toplevel. This resolves a number of limitations: Concurrency ----------- While the library should be thread-safe (and I need to find a way to test that) the ability to instantiate parsers should provide the opportunity for things like thread-local parsers, or actual parallelism if we start using native extensions (regex, re2). It also allows running multiple *parser configurations* concurrently, including e.g. multiple independent custom yaml sets. Not sure there's a use for it, but why not? At the very least it should make using custom YAML datasets much easier than having to set envvars. Customization ------------- Public APIs are provided both to instantiate and tune parsers, and to set the global parser. Hopefully this makes evaluating proposed parsers as well as evaluating & tuning caches (algorithm & size) easier. Even more so as we should provide some sort of evaluation CLI in ua-parser#163. Caches ------ In the old API, package-provided API could only be global and with a single implementation as it had to integrate with the toplevel parsing functions. By reifying the parsing job, a cache is just a parser which delegates the parse if it doesn't have a hit. This allows more easily providing, testing, and evolving alternative cache strategies. Bulk APIs --------- The current parser checks rules (regexes) one at a time on the input, but there are advanced regex APIs which can check a regex *set* and return which one(s) matched, allowing much more efficicent bulk matching e.g. google's re2, rust's regex. With the old scheme, this would be a pretty significant change in use / behaviour, obviating the use of the "parsers" with no recourse. Under the new parsing scheme, these can just be different "base" parsers, they can be the default, they can be cached, and users can instantiate their own parser instead. Misc ---- The new API's UA extractor pipeline supports `patch_minor`, though that requires excluding that bit from the tests as there are apparently broken test cases around that item (ua-parser/uap-core#562). Init Helpers ============ Having proper parsers is the opportunity to allow setting parsers at runtime more easily (instead of load-time envvars), however optional constructors (classmethods) turns out to be iffy from an API and typing perspective both. Instead have the "base" parsers (the ones doing the actual parsing of the UAs) just take a uniform parsed data set, and have utility loaders provide that from various data sources (precompiled, preformatted, or data files). This avoids redundancy and the need for mixins / inheritance, and mypy is *much* happier. Legacy Parsers -> New Matchers ============================== The bridging of the legacy parsers and the new results turned out to be pretty mid. Instead, the new API relies on similar but better typed matcher classes, with a slightly different API: they return `None` on a match failure instead of a triplet, which make them compose better in iteration (e.g. can just `filter` them out). Add a `Matchers` alias to carry them around (a tuple of lists of matchers) for convenience, as well as as base parser parameter. Also clarify the replacer rules, and hopefully implement the thing more clearly. Fixes ua-parser#93, fixes ua-parser#142, closes ua-parser#116

New API with full typing ======================== Seems pretty self-explanatory, rather than returning somewhat ad-hoc dicts this API works off of dataclasses, it should be compatible with the legacy version through the magic of ~~buying two of them~~ `dataclasses.asdict`. Parser API ========== The legacy version had "parsers" which really represent individual parsing rules. In the new API the job of a parser is what the top-level functions did, they wrap around the entire job of parsing a user-agent string. The core API is just `__call__`, with a selection flag for the domains (seems like the least bad term for what "user agent", "os", and "device" are, other alternatives I considered are "component" and "category", but I'm still ambivalent). Overridable helpers are provided which match the old API's methods (with PEP8 conventions), as well as the same style of helpers at the package toplevel. This resolves a number of limitations: Concurrency ----------- While the library should be thread-safe (and I need to find a way to test that) the ability to instantiate parsers should provide the opportunity for things like thread-local parsers, or actual parallelism if we start using native extensions (regex, re2). It also allows running multiple *parser configurations* concurrently, including e.g. multiple independent custom yaml sets. Not sure there's a use for it, but why not? At the very least it should make using custom YAML datasets much easier than having to set envvars. The caching parser being stateful, it's protected by an optional lock seems like the best way to make caching thread-safe. When only using a single thread, or using thread-local parsers, caching can be disabled by using a `contextlib.nullcontext` as lock. Customization ------------- Public APIs are provided both to instantiate and tune parsers, and to set the global parser. Hopefully this makes evaluating proposed parsers as well as evaluating & tuning caches (algorithm & size) easier. Even more so as we should provide some sort of evaluation CLI in ua-parser#163. Caches ------ In the old API, package-provided API could only be global and with a single implementation as it had to integrate with the toplevel parsing functions. By reifying the parsing job, a cache is just a parser which delegates the parse if it doesn't have a hit. This allows more easily providing, testing, and evolving alternative cache strategies. Bulk APIs --------- The current parser checks rules (regexes) one at a time on the input, but there are advanced regex APIs which can check a regex *set* and return which one(s) matched, allowing much more efficicent bulk matching e.g. google's re2, rust's regex. With the old scheme, this would be a pretty significant change in use / behaviour, obviating the use of the "parsers" with no recourse. Under the new parsing scheme, these can just be different "base" parsers, they can be the default, they can be cached, and users can instantiate their own parser instead. Misc ---- The new API's UA extractor pipeline supports `patch_minor`, though that requires excluding that bit from the tests as there are apparently broken test cases around that item (ua-parser/uap-core#562). Init Helpers ============ Having proper parsers is the opportunity to allow setting parsers at runtime more easily (instead of load-time envvars), however optional constructors (classmethods) turns out to be iffy from an API and typing perspective both. Instead have the "base" parsers (the ones doing the actual parsing of the UAs) just take a uniform parsed data set, and have utility loaders provide that from various data sources (precompiled, preformatted, or data files). This avoids redundancy and the need for mixins / inheritance, and mypy is *much* happier. Legacy Parsers -> New Matchers ============================== The bridging of the legacy parsers and the new results turned out to be pretty mid. Instead, the new API relies on similar but better typed matcher classes, with a slightly different API: they return `None` on a match failure instead of a triplet, which make them compose better in iteration (e.g. can just `filter` them out). Add a `Matchers` alias to carry them around (a tuple of lists of matchers) for convenience, as well as as base parser parameter. Also clarify the replacer rules, and hopefully implement the thing more clearly. Fixes ua-parser#93, fixes ua-parser#142, closes ua-parser#116

src/ua_parser/core.py

src/ua_parser/basic.py

src/ua_parser/caching.py

src/ua_parser/_matchers.pyi

New API with full typing ======================== Seems pretty self-explanatory, rather than returning somewhat ad-hoc dicts this API works off of dataclasses, it should be compatible with the legacy version through the magic of ~~buying two of them~~ `dataclasses.asdict`. Parser API ========== The legacy version had "parsers" which really represent individual parsing rules. In the new API the job of a parser is what the top-level functions did, they wrap around the entire job of parsing a user-agent string. The core API is just `__call__`, with a selection flag for the domains (seems like the least bad term for what "user agent", "os", and "device" are, other alternatives I considered are "component" and "category", but I'm still ambivalent). Overridable helpers are provided which match the old API's methods (with PEP8 conventions), as well as the same style of helpers at the package toplevel. This resolves a number of limitations: Concurrency ----------- While the library should be thread-safe (and I need to find a way to test that) the ability to instantiate parsers should provide the opportunity for things like thread-local parsers, or actual parallelism if we start using native extensions (regex, re2). It also allows running multiple *parser configurations* concurrently, including e.g. multiple independent custom yaml sets. Not sure there's a use for it, but why not? At the very least it should make using custom YAML datasets much easier than having to set envvars. The caching parser being stateful, it's protected by an optional lock seems like the best way to make caching thread-safe. When only using a single thread, or using thread-local parsers, caching can be disabled by using a `contextlib.nullcontext` as lock. Customization ------------- Public APIs are provided both to instantiate and tune parsers, and to set the global parser. Hopefully this makes evaluating proposed parsers as well as evaluating & tuning caches (algorithm & size) easier. Even more so as we should provide some sort of evaluation CLI in ua-parser#163. Caches ------ In the old API, package-provided API could only be global and with a single implementation as it had to integrate with the toplevel parsing functions. By reifying the parsing job, a cache is just a parser which delegates the parse if it doesn't have a hit. This allows more easily providing, testing, and evolving alternative cache strategies. Bulk APIs --------- The current parser checks rules (regexes) one at a time on the input, but there are advanced regex APIs which can check a regex *set* and return which one(s) matched, allowing much more efficicent bulk matching e.g. google's re2, rust's regex. With the old scheme, this would be a pretty significant change in use / behaviour, obviating the use of the "parsers" with no recourse. Under the new parsing scheme, these can just be different "base" parsers, they can be the default, they can be cached, and users can instantiate their own parser instead. Misc ---- The new API's UA extractor pipeline supports `patch_minor`, though that requires excluding that bit from the tests as there are apparently broken test cases around that item (ua-parser/uap-core#562). Init Helpers ============ Having proper parsers is the opportunity to allow setting parsers at runtime more easily (instead of load-time envvars), however optional constructors (classmethods) turns out to be iffy from an API and typing perspective both. Instead have the "base" parsers (the ones doing the actual parsing of the UAs) just take a uniform parsed data set, and have utility loaders provide that from various data sources (precompiled, preformatted, or data files). This avoids redundancy and the need for mixins / inheritance, and mypy is *much* happier. Legacy Parsers -> New Matchers ============================== The bridging of the legacy parsers and the new results turned out to be pretty mid. Instead, the new API relies on similar but better typed matcher classes, with a slightly different API: they return `None` on a match failure instead of a triplet, which make them compose better in iteration (e.g. can just `filter` them out). Add a `Matchers` alias to carry them around (a tuple of lists of matchers) for convenience, as well as as base parser parameter. Also clarify the replacer rules, and hopefully implement the thing more clearly. Fixes ua-parser#93, fixes ua-parser#142, closes ua-parser#116

Remove partial typing on the legacy API whose only effect is to break typechecking. Fixes ua-parser#179

Requires splitting out some of the testenvs, as re2 is not available for pypy at all, and not yet for 3.12. Uses `re2.Filter`, which unlike the C++ `FilteredRE2` bundles prefiltering, using an `re2.Set` so likely less efficient than providing one's own e.g. aho-corasick, but avoids having to do that. At first glance according to pytest's `--durations 0` this is quite successful (unlike using `re2.Set` which was more of a mixed bag): ``` 2.54s call tests/test_core.py::test_devices[test_device.yaml-basic] 2.51s call tests/test_core.py::test_ua[pgts_browser_list.yaml-basic] 2.48s call tests/test_legacy.py::TestParse::testPGTSStrings 2.43s call tests/test_legacy.py::TestParse::testStringsDevice 0.95s call tests/test_core.py::test_devices[test_device.yaml-re2] 0.55s call tests/test_core.py::test_ua[pgts_browser_list.yaml-re2] 0.18s call tests/test_core.py::test_ua[test_ua.yaml-basic] 0.16s call tests/test_legacy.py::TestParse::testBrowserscopeStrings 0.10s call tests/test_core.py::test_ua[test_ua.yaml-re2] ``` While the "basic" parser for the new API is slightly slower than the legacy API (browserscope does use test_ua.yaml so that matches) the re2 parser is significantly faster than both: - 60% faster on test_device.yaml (~2.5s -> 1s) - 80% faster on pgts (2.5s -> 0.5s) - 40% faster on browserscope (0.16 -> 0.1) This is very encouraging, altough the memory consumption has not been checked (yet). Fixes ua-parser#149, kind-of

New API with full typing ======================== Seems pretty self-explanatory, rather than returning somewhat ad-hoc dicts this API works off of dataclasses, it should be compatible with the legacy version through the magic of ~~buying two of them~~ `dataclasses.asdict`. Parser API ========== The legacy version had "parsers" which really represent individual parsing rules. In the new API the job of a parser is what the top-level functions did, they wrap around the entire job of parsing a user-agent string. The core API is just `__call__`, with a selection flag for the domains (seems like the least bad term for what "user agent", "os", and "device" are, other alternatives I considered are "component" and "category", but I'm still ambivalent). Overridable helpers are provided which match the old API's methods (with PEP8 conventions), as well as the same style of helpers at the package toplevel. This resolves a number of limitations: Concurrency ----------- While the library should be thread-safe (and I need to find a way to test that) the ability to instantiate parsers should provide the opportunity for things like thread-local parsers, or actual parallelism if we start using native extensions (regex, re2). It also allows running multiple *parser configurations* concurrently, including e.g. multiple independent custom yaml sets. Not sure there's a use for it, but why not? At the very least it should make using custom YAML datasets much easier than having to set envvars. The caching parser being stateful, it's protected by an optional lock seems like the best way to make caching thread-safe. When only using a single thread, or using thread-local parsers, caching can be disabled by using a `contextlib.nullcontext` as lock. Customization ------------- Public APIs are provided both to instantiate and tune parsers, and to set the global parser. Hopefully this makes evaluating proposed parsers as well as evaluating & tuning caches (algorithm & size) easier. Even more so as we should provide some sort of evaluation CLI in #163. Caches ------ In the old API, package-provided API could only be global and with a single implementation as it had to integrate with the toplevel parsing functions. By reifying the parsing job, a cache is just a parser which delegates the parse if it doesn't have a hit. This allows more easily providing, testing, and evolving alternative cache strategies. Bulk APIs --------- The current parser checks rules (regexes) one at a time on the input, but there are advanced regex APIs which can check a regex *set* and return which one(s) matched, allowing much more efficicent bulk matching e.g. google's re2, rust's regex. With the old scheme, this would be a pretty significant change in use / behaviour, obviating the use of the "parsers" with no recourse. Under the new parsing scheme, these can just be different "base" parsers, they can be the default, they can be cached, and users can instantiate their own parser instead. Misc ---- The new API's UA extractor pipeline supports `patch_minor`, though that requires excluding that bit from the tests as there are apparently broken test cases around that item (ua-parser/uap-core#562). Init Helpers ============ Having proper parsers is the opportunity to allow setting parsers at runtime more easily (instead of load-time envvars), however optional constructors (classmethods) turns out to be iffy from an API and typing perspective both. Instead have the "base" parsers (the ones doing the actual parsing of the UAs) just take a uniform parsed data set, and have utility loaders provide that from various data sources (precompiled, preformatted, or data files). This avoids redundancy and the need for mixins / inheritance, and mypy is *much* happier. Legacy Parsers -> New Matchers ============================== The bridging of the legacy parsers and the new results turned out to be pretty mid. Instead, the new API relies on similar but better typed matcher classes, with a slightly different API: they return `None` on a match failure instead of a triplet, which make them compose better in iteration (e.g. can just `filter` them out). Add a `Matchers` alias to carry them around (a tuple of lists of matchers) for convenience, as well as as base parser parameter. Also clarify the replacer rules, and hopefully implement the thing more clearly. Fixes #93, fixes #142, closes #116

masklinn force-pushed the new_api branch 4 times, most recently from 09ef7fc to e747b6a Compare May 2, 2022 11:10

masklinn marked this pull request as draft May 2, 2022 11:27

masklinn force-pushed the new_api branch 2 times, most recently from 8800508 to b9674f8 Compare May 2, 2022 11:38

masklinn force-pushed the new_api branch 5 times, most recently from d9b166f to 45b50f7 Compare May 3, 2022 12:10

masklinn commented May 5, 2022

View reviewed changes

ua_parser/__init__.py Outdated Show resolved Hide resolved

masklinn commented May 5, 2022

View reviewed changes

ua_parser/__init__.py Outdated Show resolved Hide resolved

masklinn commented May 5, 2022

View reviewed changes

ua_parser/__init__.py Outdated Show resolved Hide resolved

masklinn mentioned this pull request May 5, 2022

Update packaging to *bleeding edge* #120

Closed

masklinn linked an issue Aug 21, 2022 that may be closed by this pull request

Make the API more pythonic (snake_case function names) #93

Closed

masklinn mentioned this pull request Nov 7, 2022

Versioning change (for API update) #146

Closed

masklinn force-pushed the new_api branch from 45b50f7 to 91cd530 Compare May 3, 2023 07:42

masklinn changed the title ~~new api first draft~~ new api draft May 3, 2023

masklinn force-pushed the new_api branch 2 times, most recently from 42c329c to a7d49b1 Compare May 3, 2023 10:20

masklinn commented May 3, 2023

View reviewed changes

tests/test_core.py Show resolved Hide resolved

masklinn force-pushed the new_api branch 6 times, most recently from ea47fab to 902d755 Compare May 3, 2023 20:48