-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep "lifecycle log" in models #2863
Comments
As I see it, the main events to store with respect to the *2Vec models are:
The most typical model uses would just have about 3-5 such events. Some advanced uses might involve being loaded/expanded/trained/saved multiple times, but still most likely just dozens of events – tiny compared to the overall model. So I'd see it as a plain Python list of events, where each event is just a dict. It might be capped as only the last N events – but a cap is probably not needed. It'd automatically be part of the model serialization. A utility method might dump the events as a series of JSON lines, or something prettier - and a quick glance at that output would both answer the most common questions about a problem situations, and quickly highlight to what extent previous/current gensim versions were involved. |
This might be as simple as a single method on
It might also check that all top-level keys are from some declared/documented list in source code. It'd be easy & encouraged to expand this list, with a short note on the key's interpretation, but error or warn when an unknown top-level key is used (unless we add a convention for private/x-/etc ad hoc keys). Even long-lived models would only accumulate a few KB of data, and reviewing this log (either in user reports or our own older model) would quickly reveal many sources of oddness or error. |
* fix docs * re #2863: record lifecycle events * log events even if lifecycle attribute is turned off * fix overlong lines + use f-strings * bump up internal version + improve docs * record more info for load + save * lifecycle events for KeyedVectors * add lifecycle events to remaining models * ask for lifecycle log in ISSUE_TEMPLATE * improve logging * Update word2vec.py remove unused import * Update CHANGELOG.md Co-authored-by: Michael Penkov <m@penkov.dev>
I realize my review is late - but I think the addition/use of the If there must be a Separately, a short utility accessor that pretty-prints the (most-recent N lines of) log as JSON could be helpful to complete the value & demo-ability of the feature. |
My "original spirit" of the logs was to store strings only. One event = one string. I suppose this highlights a different view to the intended / envisioned purpose of the log:
My implementation does support arbitrary dicts (your proposal), for flexibility. But I guess I filled the dicts primarily with the stuff that I'd like to see in user reports (strings).
It doesn't. It's just a choice I made, in some places (not all). Or I don't understand what you mean by this paragraph. |
I wasn't thinking particularly for machine processing, just clarity to expert users. Free-form narrative log lines can vary a lot in their terminology/grammar/choice-of-what-to-emphasize/units/etc. Historically, plenty of Gensim logging text has been of uneven style/quality - but that of course is not unique to Gensim, and I'm sure I've contributed some vague/under-proof-read output, as well. Reusing prominent exact in-code variable names, treating the whole property dict as a self-describing 'log event', should be more robust against uneven style/language. My hope would be for any major model lifecycle event, from a small controlled-vocabulary of possible events, the event name is all the context necessary. (Anything extra is just more maintenance effort, or risk that the unstructured text drifts from code functionality.) |
Why
When Gensim users report issues with models, we have trouble deciding what they actually did, what version they used, etc. They're often confused themselves.
This leads to protracted communication => increased support effort => slower issue resolution and more annoyance all around. Not good.
What
Keep an internal "lifecycle log" inside each model. This could be as simple as an internal list-of-strings attribute, e.g.
word2vec.lifecycle = ["init() created on 2020-06-20 15:49:03 UTC with Gensim 3.8.3 on Python 3.6", "train() called on 2020-06-20 15:49:04 UTC", "load() called on 2020-06-29 8:7:58 UTC", …]
. We would serialize this log onsave()
, and ask users to provide its value when investigating an issue.We probably want to keep this simple and human-readable, both in the API and log content. Not to over-engineer to start with – not attempting a full model recreation from the log or anything, I can see how this could turn into a rabbit hole.
Originally conceived by @gojomo in #2698 (comment).
The text was updated successfully, but these errors were encountered: