Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new api #116

Merged
merged 3 commits into from
Feb 6, 2024
Merged

new api #116

merged 3 commits into from
Feb 6, 2024

Conversation

masklinn
Copy link
Contributor

@masklinn masklinn commented May 2, 2022

  • migrate results to data classes, trivially convertible to the old results via dataclasses.asdict
  • add an intermediate result type which is not fully set, useful for both caching partial results and more clearly signalling behaviour to the caller
  • introduce a high-level "Parser" API for the root operation (instead of free functions), replace the current parsers by matchers which serve a similar role but are less required

QUESTIONS

  • are the Pretty* functions actually useful / necessary? They've been here from the start, but it's not entirely clear that they're useful

TODO

  • better testing for the newly accessible bits
  • Demo alternative parser? implemented an re2-based parser which seems to work really well (how well will have to wait for Add benchmarking #163 and possibly course correction)
  • move the addition of mypy in a separate commit and tighten it significantly, see mypy --strict #179 for details
  • check if putting all the regexes in the same re2.Filter (and post-filter) works better than creating a filter per domain
    After testing it, result is mixed: parse through the data file of Add benchmarking #163 is about 7% faster, but parsing just one domain is 30% to 270% slower (not a typo, parse_user_agent goes from 0.58s to 1.57s, parse_device fares best).
    This does make sense, the unified parser does essentially always the same amount of work and that's largely informed by the devices set which is by far the largest and most complex so severely impact the UA and OS sets. The single match does gain an edge when all three domains are parsed, but i don't think that edge is sufficient to justify the loss in performance when parsing individual domains.

@masklinn masklinn force-pushed the new_api branch 4 times, most recently from 09ef7fc to e747b6a Compare May 2, 2022 11:10
@masklinn masklinn marked this pull request as draft May 2, 2022 11:27
@masklinn masklinn force-pushed the new_api branch 2 times, most recently from 8800508 to b9674f8 Compare May 2, 2022 11:38
@masklinn
Copy link
Contributor Author

masklinn commented May 2, 2022

@jab @ThiefMaster since you were apparently interested in a more idiomatic API, this is my starting point / current thinking. It is very much patterned after the idea of utility functions in front of a flexible (and hopefully reasonably compositional) pile of classes.

So the basic use should be the same as today (a small handful of functions performing standard tasks, just better named, and returning dataclasses instead of unspecified dicts), but the intermediate Parser object would provide users of the library with a lot more flexibility in terms of caching strategy, matchers, or even parsing.

@masklinn masklinn force-pushed the new_api branch 5 times, most recently from d9b166f to 45b50f7 Compare May 3, 2022 12:10
ua_parser/__init__.py Outdated Show resolved Hide resolved
ua_parser/__init__.py Outdated Show resolved Hide resolved
ua_parser/__init__.py Outdated Show resolved Hide resolved
@masklinn masklinn linked an issue Aug 21, 2022 that may be closed by this pull request
@masklinn masklinn changed the title new api first draft new api draft May 3, 2023
@masklinn masklinn force-pushed the new_api branch 2 times, most recently from 42c329c to a7d49b1 Compare May 3, 2023 10:20
@masklinn masklinn force-pushed the new_api branch 6 times, most recently from ea47fab to 902d755 Compare May 3, 2023 20:48
src/ua_parser/re2.py Outdated Show resolved Hide resolved
tests/test_iterative.py Outdated Show resolved Hide resolved
src/ua_parser/loaders.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
masklinn added a commit to masklinn/uap-python that referenced this pull request Nov 3, 2023
New API with full typing
========================

Seems pretty self-explanatory, rather than returning somewhat ad-hoc
dicts this API works off of dataclasses, it should be compatible with
the legacy version through the magic of ~~buying two of them~~
`dataclasses.asdict`.

Parser API
==========

The legacy version had "parsers" which really represent individual
parsing rules. In the new API the job of a parser is what the
top-level functions did, they wrap around the entire job of parsing a
user-agent string.

The core API is just `__call__`, with a selection flag for the domains
(seems like the least bad term for what "user agent", "os", and
"device" are, other alternatives I considered are "component" and
"category", but I'm still ambivalent). Overridable helpers are
provided which match the old API's methods (with PEP8 conventions), as
well as the same style of helpers at the package toplevel.

This resolves a number of limitations:

Concurrency
-----------

While the library should be thread-safe (and I need to find a way to
test that) the ability to instantiate parsers should provide the
opportunity for things like thread-local parsers, or actual
parallelism if we start using native extensions (regex, re2).

It also allows running multiple *parser configurations* concurrently,
including e.g. multiple independent custom yaml sets. Not sure there's
a use for it, but why not?

At the very least it should make using custom YAML datasets much
easier than having to set envvars.

Customization
-------------

Public APIs are provided both to instantiate and tune parsers, and to
set the global parser. Hopefully this makes evaluating proposed
parsers as well as evaluating & tuning caches (algorithm & size)
easier. Even more so as we should provide some sort of evaluation CLI
in ua-parser#163.

Caches
------

In the old API, package-provided API could only be global and with a
single implementation as it had to integrate with the toplevel parsing
functions. By reifying the parsing job, a cache is just a parser which
delegates the parse if it doesn't have a hit.

This allows more easily providing, testing, and evolving alternative
cache strategies.

Bulk APIs
---------

The current parser checks rules (regexes) one at a time on the input,
but there are advanced regex APIs which can check a regex *set* and
return which one(s) matched, allowing much more efficicent bulk
matching e.g. google's re2, rust's regex.

With the old scheme, this would be a pretty significant change in use
/ behaviour, obviating the use of the "parsers" with no
recourse. Under the new parsing scheme, these can just be different
"base" parsers, they can be the default, they can be cached, and users
can instantiate their own parser instead.

Misc
----

The new API's UA extractor pipeline supports `patch_minor`, though
that requires excluding that bit from the tests as there are
apparently broken test cases around that
item (ua-parser/uap-core#562).

Init Helpers
============

Having proper parsers is the opportunity to allow setting parsers at
runtime more easily (instead of load-time envvars), however optional
constructors (classmethods) turns out to be iffy from an API and
typing perspective both.

Instead have the "base" parsers (the ones doing the actual parsing of
the UAs) just take a uniform parsed data set, and have utility loaders
provide that from various data sources (precompiled, preformatted, or
data files). This avoids redundancy and the need for mixins /
inheritance, and mypy is *much* happier.

Legacy Parsers -> New Matchers
==============================

The bridging of the legacy parsers and the new results turned out to
be pretty mid.

Instead, the new API relies on similar but better typed matcher
classes, with a slightly different API: they return `None` on a match
failure instead of a triplet, which make them compose better in
iteration (e.g. can just `filter` them out).

Add a `Matchers` alias to carry them around (a tuple of lists of
matchers) for convenience, as well as as base parser parameter.

Also clarify the replacer rules, and hopefully implement the thing
more clearly.

Fixes ua-parser#93, fixes ua-parser#142, closes ua-parser#116
@masklinn masklinn marked this pull request as ready for review November 3, 2023 19:53
@masklinn masklinn changed the title new api draft new api Nov 3, 2023
masklinn added a commit to masklinn/uap-python that referenced this pull request Jan 14, 2024
New API with full typing
========================

Seems pretty self-explanatory, rather than returning somewhat ad-hoc
dicts this API works off of dataclasses, it should be compatible with
the legacy version through the magic of ~~buying two of them~~
`dataclasses.asdict`.

Parser API
==========

The legacy version had "parsers" which really represent individual
parsing rules. In the new API the job of a parser is what the
top-level functions did, they wrap around the entire job of parsing a
user-agent string.

The core API is just `__call__`, with a selection flag for the domains
(seems like the least bad term for what "user agent", "os", and
"device" are, other alternatives I considered are "component" and
"category", but I'm still ambivalent). Overridable helpers are
provided which match the old API's methods (with PEP8 conventions), as
well as the same style of helpers at the package toplevel.

This resolves a number of limitations:

Concurrency
-----------

While the library should be thread-safe (and I need to find a way to
test that) the ability to instantiate parsers should provide the
opportunity for things like thread-local parsers, or actual
parallelism if we start using native extensions (regex, re2).

It also allows running multiple *parser configurations* concurrently,
including e.g. multiple independent custom yaml sets. Not sure there's
a use for it, but why not?

At the very least it should make using custom YAML datasets much
easier than having to set envvars.

The caching parser being stateful, it's protected by an optional lock
seems like the best way to make caching thread-safe. When only using a
single thread, or using thread-local parsers, caching can be disabled
by using a `contextlib.nullcontext` as lock.

Customization
-------------

Public APIs are provided both to instantiate and tune parsers, and to
set the global parser. Hopefully this makes evaluating proposed
parsers as well as evaluating & tuning caches (algorithm & size)
easier. Even more so as we should provide some sort of evaluation CLI
in ua-parser#163.

Caches
------

In the old API, package-provided API could only be global and with a
single implementation as it had to integrate with the toplevel parsing
functions. By reifying the parsing job, a cache is just a parser which
delegates the parse if it doesn't have a hit.

This allows more easily providing, testing, and evolving alternative
cache strategies.

Bulk APIs
---------

The current parser checks rules (regexes) one at a time on the input,
but there are advanced regex APIs which can check a regex *set* and
return which one(s) matched, allowing much more efficicent bulk
matching e.g. google's re2, rust's regex.

With the old scheme, this would be a pretty significant change in use
/ behaviour, obviating the use of the "parsers" with no
recourse. Under the new parsing scheme, these can just be different
"base" parsers, they can be the default, they can be cached, and users
can instantiate their own parser instead.

Misc
----

The new API's UA extractor pipeline supports `patch_minor`, though
that requires excluding that bit from the tests as there are
apparently broken test cases around that
item (ua-parser/uap-core#562).

Init Helpers
============

Having proper parsers is the opportunity to allow setting parsers at
runtime more easily (instead of load-time envvars), however optional
constructors (classmethods) turns out to be iffy from an API and
typing perspective both.

Instead have the "base" parsers (the ones doing the actual parsing of
the UAs) just take a uniform parsed data set, and have utility loaders
provide that from various data sources (precompiled, preformatted, or
data files). This avoids redundancy and the need for mixins /
inheritance, and mypy is *much* happier.

Legacy Parsers -> New Matchers
==============================

The bridging of the legacy parsers and the new results turned out to
be pretty mid.

Instead, the new API relies on similar but better typed matcher
classes, with a slightly different API: they return `None` on a match
failure instead of a triplet, which make them compose better in
iteration (e.g. can just `filter` them out).

Add a `Matchers` alias to carry them around (a tuple of lists of
matchers) for convenience, as well as as base parser parameter.

Also clarify the replacer rules, and hopefully implement the thing
more clearly.

Fixes ua-parser#93, fixes ua-parser#142, closes ua-parser#116
@masklinn masklinn mentioned this pull request Jan 15, 2024
masklinn added a commit to masklinn/uap-python that referenced this pull request Feb 3, 2024
New API with full typing
========================

Seems pretty self-explanatory, rather than returning somewhat ad-hoc
dicts this API works off of dataclasses, it should be compatible with
the legacy version through the magic of ~~buying two of them~~
`dataclasses.asdict`.

Parser API
==========

The legacy version had "parsers" which really represent individual
parsing rules. In the new API the job of a parser is what the
top-level functions did, they wrap around the entire job of parsing a
user-agent string.

The core API is just `__call__`, with a selection flag for the domains
(seems like the least bad term for what "user agent", "os", and
"device" are, other alternatives I considered are "component" and
"category", but I'm still ambivalent). Overridable helpers are
provided which match the old API's methods (with PEP8 conventions), as
well as the same style of helpers at the package toplevel.

This resolves a number of limitations:

Concurrency
-----------

While the library should be thread-safe (and I need to find a way to
test that) the ability to instantiate parsers should provide the
opportunity for things like thread-local parsers, or actual
parallelism if we start using native extensions (regex, re2).

It also allows running multiple *parser configurations* concurrently,
including e.g. multiple independent custom yaml sets. Not sure there's
a use for it, but why not?

At the very least it should make using custom YAML datasets much
easier than having to set envvars.

The caching parser being stateful, it's protected by an optional lock
seems like the best way to make caching thread-safe. When only using a
single thread, or using thread-local parsers, caching can be disabled
by using a `contextlib.nullcontext` as lock.

Customization
-------------

Public APIs are provided both to instantiate and tune parsers, and to
set the global parser. Hopefully this makes evaluating proposed
parsers as well as evaluating & tuning caches (algorithm & size)
easier. Even more so as we should provide some sort of evaluation CLI
in ua-parser#163.

Caches
------

In the old API, package-provided API could only be global and with a
single implementation as it had to integrate with the toplevel parsing
functions. By reifying the parsing job, a cache is just a parser which
delegates the parse if it doesn't have a hit.

This allows more easily providing, testing, and evolving alternative
cache strategies.

Bulk APIs
---------

The current parser checks rules (regexes) one at a time on the input,
but there are advanced regex APIs which can check a regex *set* and
return which one(s) matched, allowing much more efficicent bulk
matching e.g. google's re2, rust's regex.

With the old scheme, this would be a pretty significant change in use
/ behaviour, obviating the use of the "parsers" with no
recourse. Under the new parsing scheme, these can just be different
"base" parsers, they can be the default, they can be cached, and users
can instantiate their own parser instead.

Misc
----

The new API's UA extractor pipeline supports `patch_minor`, though
that requires excluding that bit from the tests as there are
apparently broken test cases around that
item (ua-parser/uap-core#562).

Init Helpers
============

Having proper parsers is the opportunity to allow setting parsers at
runtime more easily (instead of load-time envvars), however optional
constructors (classmethods) turns out to be iffy from an API and
typing perspective both.

Instead have the "base" parsers (the ones doing the actual parsing of
the UAs) just take a uniform parsed data set, and have utility loaders
provide that from various data sources (precompiled, preformatted, or
data files). This avoids redundancy and the need for mixins /
inheritance, and mypy is *much* happier.

Legacy Parsers -> New Matchers
==============================

The bridging of the legacy parsers and the new results turned out to
be pretty mid.

Instead, the new API relies on similar but better typed matcher
classes, with a slightly different API: they return `None` on a match
failure instead of a triplet, which make them compose better in
iteration (e.g. can just `filter` them out).

Add a `Matchers` alias to carry them around (a tuple of lists of
matchers) for convenience, as well as as base parser parameter.

Also clarify the replacer rules, and hopefully implement the thing
more clearly.

Fixes ua-parser#93, fixes ua-parser#142, closes ua-parser#116
masklinn added a commit to masklinn/uap-python that referenced this pull request Feb 3, 2024
New API with full typing
========================

Seems pretty self-explanatory, rather than returning somewhat ad-hoc
dicts this API works off of dataclasses, it should be compatible with
the legacy version through the magic of ~~buying two of them~~
`dataclasses.asdict`.

Parser API
==========

The legacy version had "parsers" which really represent individual
parsing rules. In the new API the job of a parser is what the
top-level functions did, they wrap around the entire job of parsing a
user-agent string.

The core API is just `__call__`, with a selection flag for the domains
(seems like the least bad term for what "user agent", "os", and
"device" are, other alternatives I considered are "component" and
"category", but I'm still ambivalent). Overridable helpers are
provided which match the old API's methods (with PEP8 conventions), as
well as the same style of helpers at the package toplevel.

This resolves a number of limitations:

Concurrency
-----------

While the library should be thread-safe (and I need to find a way to
test that) the ability to instantiate parsers should provide the
opportunity for things like thread-local parsers, or actual
parallelism if we start using native extensions (regex, re2).

It also allows running multiple *parser configurations* concurrently,
including e.g. multiple independent custom yaml sets. Not sure there's
a use for it, but why not?

At the very least it should make using custom YAML datasets much
easier than having to set envvars.

The caching parser being stateful, it's protected by an optional lock
seems like the best way to make caching thread-safe. When only using a
single thread, or using thread-local parsers, caching can be disabled
by using a `contextlib.nullcontext` as lock.

Customization
-------------

Public APIs are provided both to instantiate and tune parsers, and to
set the global parser. Hopefully this makes evaluating proposed
parsers as well as evaluating & tuning caches (algorithm & size)
easier. Even more so as we should provide some sort of evaluation CLI
in ua-parser#163.

Caches
------

In the old API, package-provided API could only be global and with a
single implementation as it had to integrate with the toplevel parsing
functions. By reifying the parsing job, a cache is just a parser which
delegates the parse if it doesn't have a hit.

This allows more easily providing, testing, and evolving alternative
cache strategies.

Bulk APIs
---------

The current parser checks rules (regexes) one at a time on the input,
but there are advanced regex APIs which can check a regex *set* and
return which one(s) matched, allowing much more efficicent bulk
matching e.g. google's re2, rust's regex.

With the old scheme, this would be a pretty significant change in use
/ behaviour, obviating the use of the "parsers" with no
recourse. Under the new parsing scheme, these can just be different
"base" parsers, they can be the default, they can be cached, and users
can instantiate their own parser instead.

Misc
----

The new API's UA extractor pipeline supports `patch_minor`, though
that requires excluding that bit from the tests as there are
apparently broken test cases around that
item (ua-parser/uap-core#562).

Init Helpers
============

Having proper parsers is the opportunity to allow setting parsers at
runtime more easily (instead of load-time envvars), however optional
constructors (classmethods) turns out to be iffy from an API and
typing perspective both.

Instead have the "base" parsers (the ones doing the actual parsing of
the UAs) just take a uniform parsed data set, and have utility loaders
provide that from various data sources (precompiled, preformatted, or
data files). This avoids redundancy and the need for mixins /
inheritance, and mypy is *much* happier.

Legacy Parsers -> New Matchers
==============================

The bridging of the legacy parsers and the new results turned out to
be pretty mid.

Instead, the new API relies on similar but better typed matcher
classes, with a slightly different API: they return `None` on a match
failure instead of a triplet, which make them compose better in
iteration (e.g. can just `filter` them out).

Add a `Matchers` alias to carry them around (a tuple of lists of
matchers) for convenience, as well as as base parser parameter.

Also clarify the replacer rules, and hopefully implement the thing
more clearly.

Fixes ua-parser#93, fixes ua-parser#142, closes ua-parser#116
src/ua_parser/core.py Outdated Show resolved Hide resolved
src/ua_parser/core.py Show resolved Hide resolved
src/ua_parser/core.py Outdated Show resolved Hide resolved
src/ua_parser/core.py Outdated Show resolved Hide resolved
src/ua_parser/core.py Show resolved Hide resolved
src/ua_parser/basic.py Outdated Show resolved Hide resolved
src/ua_parser/caching.py Outdated Show resolved Hide resolved
src/ua_parser/_matchers.pyi Show resolved Hide resolved
@masklinn masklinn mentioned this pull request Feb 4, 2024
New API with full typing
========================

Seems pretty self-explanatory, rather than returning somewhat ad-hoc
dicts this API works off of dataclasses, it should be compatible with
the legacy version through the magic of ~~buying two of them~~
`dataclasses.asdict`.

Parser API
==========

The legacy version had "parsers" which really represent individual
parsing rules. In the new API the job of a parser is what the
top-level functions did, they wrap around the entire job of parsing a
user-agent string.

The core API is just `__call__`, with a selection flag for the domains
(seems like the least bad term for what "user agent", "os", and
"device" are, other alternatives I considered are "component" and
"category", but I'm still ambivalent). Overridable helpers are
provided which match the old API's methods (with PEP8 conventions), as
well as the same style of helpers at the package toplevel.

This resolves a number of limitations:

Concurrency
-----------

While the library should be thread-safe (and I need to find a way to
test that) the ability to instantiate parsers should provide the
opportunity for things like thread-local parsers, or actual
parallelism if we start using native extensions (regex, re2).

It also allows running multiple *parser configurations* concurrently,
including e.g. multiple independent custom yaml sets. Not sure there's
a use for it, but why not?

At the very least it should make using custom YAML datasets much
easier than having to set envvars.

The caching parser being stateful, it's protected by an optional lock
seems like the best way to make caching thread-safe. When only using a
single thread, or using thread-local parsers, caching can be disabled
by using a `contextlib.nullcontext` as lock.

Customization
-------------

Public APIs are provided both to instantiate and tune parsers, and to
set the global parser. Hopefully this makes evaluating proposed
parsers as well as evaluating & tuning caches (algorithm & size)
easier. Even more so as we should provide some sort of evaluation CLI
in ua-parser#163.

Caches
------

In the old API, package-provided API could only be global and with a
single implementation as it had to integrate with the toplevel parsing
functions. By reifying the parsing job, a cache is just a parser which
delegates the parse if it doesn't have a hit.

This allows more easily providing, testing, and evolving alternative
cache strategies.

Bulk APIs
---------

The current parser checks rules (regexes) one at a time on the input,
but there are advanced regex APIs which can check a regex *set* and
return which one(s) matched, allowing much more efficicent bulk
matching e.g. google's re2, rust's regex.

With the old scheme, this would be a pretty significant change in use
/ behaviour, obviating the use of the "parsers" with no
recourse. Under the new parsing scheme, these can just be different
"base" parsers, they can be the default, they can be cached, and users
can instantiate their own parser instead.

Misc
----

The new API's UA extractor pipeline supports `patch_minor`, though
that requires excluding that bit from the tests as there are
apparently broken test cases around that
item (ua-parser/uap-core#562).

Init Helpers
============

Having proper parsers is the opportunity to allow setting parsers at
runtime more easily (instead of load-time envvars), however optional
constructors (classmethods) turns out to be iffy from an API and
typing perspective both.

Instead have the "base" parsers (the ones doing the actual parsing of
the UAs) just take a uniform parsed data set, and have utility loaders
provide that from various data sources (precompiled, preformatted, or
data files). This avoids redundancy and the need for mixins /
inheritance, and mypy is *much* happier.

Legacy Parsers -> New Matchers
==============================

The bridging of the legacy parsers and the new results turned out to
be pretty mid.

Instead, the new API relies on similar but better typed matcher
classes, with a slightly different API: they return `None` on a match
failure instead of a triplet, which make them compose better in
iteration (e.g. can just `filter` them out).

Add a `Matchers` alias to carry them around (a tuple of lists of
matchers) for convenience, as well as as base parser parameter.

Also clarify the replacer rules, and hopefully implement the thing
more clearly.

Fixes ua-parser#93, fixes ua-parser#142, closes ua-parser#116
Remove partial typing on the legacy API whose only effect is to break
typechecking.

Fixes ua-parser#179
Requires splitting out some of the testenvs, as re2 is not available
for pypy at all, and not yet for 3.12.

Uses `re2.Filter`, which unlike the C++ `FilteredRE2` bundles
prefiltering, using an `re2.Set` so likely less efficient than
providing one's own e.g. aho-corasick, but avoids having to do that.

At first glance according to pytest's `--durations 0` this is quite
successful (unlike using `re2.Set` which was more of a mixed bag):

```
2.54s call     tests/test_core.py::test_devices[test_device.yaml-basic]
2.51s call     tests/test_core.py::test_ua[pgts_browser_list.yaml-basic]
2.48s call     tests/test_legacy.py::TestParse::testPGTSStrings
2.43s call     tests/test_legacy.py::TestParse::testStringsDevice
0.95s call     tests/test_core.py::test_devices[test_device.yaml-re2]
0.55s call     tests/test_core.py::test_ua[pgts_browser_list.yaml-re2]
0.18s call     tests/test_core.py::test_ua[test_ua.yaml-basic]
0.16s call     tests/test_legacy.py::TestParse::testBrowserscopeStrings
0.10s call     tests/test_core.py::test_ua[test_ua.yaml-re2]
```

While the "basic" parser for the new API is slightly slower than the
legacy API (browserscope does use test_ua.yaml so that matches) the
re2 parser is significantly faster than both:

- 60% faster on test_device.yaml (~2.5s -> 1s)
- 80% faster on pgts (2.5s -> 0.5s)
- 40% faster on browserscope (0.16 -> 0.1)

This is very encouraging, altough the memory consumption has not been
checked (yet).

Fixes ua-parser#149, kind-of
@masklinn masklinn added this to the 1.0 milestone Feb 6, 2024
@masklinn masklinn merged commit e719a7e into ua-parser:master Feb 6, 2024
29 checks passed
masklinn added a commit that referenced this pull request Feb 6, 2024
New API with full typing
========================

Seems pretty self-explanatory, rather than returning somewhat ad-hoc
dicts this API works off of dataclasses, it should be compatible with
the legacy version through the magic of ~~buying two of them~~
`dataclasses.asdict`.

Parser API
==========

The legacy version had "parsers" which really represent individual
parsing rules. In the new API the job of a parser is what the
top-level functions did, they wrap around the entire job of parsing a
user-agent string.

The core API is just `__call__`, with a selection flag for the domains
(seems like the least bad term for what "user agent", "os", and
"device" are, other alternatives I considered are "component" and
"category", but I'm still ambivalent). Overridable helpers are
provided which match the old API's methods (with PEP8 conventions), as
well as the same style of helpers at the package toplevel.

This resolves a number of limitations:

Concurrency
-----------

While the library should be thread-safe (and I need to find a way to
test that) the ability to instantiate parsers should provide the
opportunity for things like thread-local parsers, or actual
parallelism if we start using native extensions (regex, re2).

It also allows running multiple *parser configurations* concurrently,
including e.g. multiple independent custom yaml sets. Not sure there's
a use for it, but why not?

At the very least it should make using custom YAML datasets much
easier than having to set envvars.

The caching parser being stateful, it's protected by an optional lock
seems like the best way to make caching thread-safe. When only using a
single thread, or using thread-local parsers, caching can be disabled
by using a `contextlib.nullcontext` as lock.

Customization
-------------

Public APIs are provided both to instantiate and tune parsers, and to
set the global parser. Hopefully this makes evaluating proposed
parsers as well as evaluating & tuning caches (algorithm & size)
easier. Even more so as we should provide some sort of evaluation CLI
in #163.

Caches
------

In the old API, package-provided API could only be global and with a
single implementation as it had to integrate with the toplevel parsing
functions. By reifying the parsing job, a cache is just a parser which
delegates the parse if it doesn't have a hit.

This allows more easily providing, testing, and evolving alternative
cache strategies.

Bulk APIs
---------

The current parser checks rules (regexes) one at a time on the input,
but there are advanced regex APIs which can check a regex *set* and
return which one(s) matched, allowing much more efficicent bulk
matching e.g. google's re2, rust's regex.

With the old scheme, this would be a pretty significant change in use
/ behaviour, obviating the use of the "parsers" with no
recourse. Under the new parsing scheme, these can just be different
"base" parsers, they can be the default, they can be cached, and users
can instantiate their own parser instead.

Misc
----

The new API's UA extractor pipeline supports `patch_minor`, though
that requires excluding that bit from the tests as there are
apparently broken test cases around that
item (ua-parser/uap-core#562).

Init Helpers
============

Having proper parsers is the opportunity to allow setting parsers at
runtime more easily (instead of load-time envvars), however optional
constructors (classmethods) turns out to be iffy from an API and
typing perspective both.

Instead have the "base" parsers (the ones doing the actual parsing of
the UAs) just take a uniform parsed data set, and have utility loaders
provide that from various data sources (precompiled, preformatted, or
data files). This avoids redundancy and the need for mixins /
inheritance, and mypy is *much* happier.

Legacy Parsers -> New Matchers
==============================

The bridging of the legacy parsers and the new results turned out to
be pretty mid.

Instead, the new API relies on similar but better typed matcher
classes, with a slightly different API: they return `None` on a match
failure instead of a triplet, which make them compose better in
iteration (e.g. can just `filter` them out).

Add a `Matchers` alias to carry them around (a tuple of lists of
matchers) for convenience, as well as as base parser parameter.

Also clarify the replacer rules, and hopefully implement the thing
more clearly.

Fixes #93, fixes #142, closes #116
@masklinn masklinn deleted the new_api branch February 6, 2024 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make the API more pythonic (snake_case function names)
1 participant