-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New caching and formatting functionality #634
Conversation
151b51b
to
9ec8a7d
Compare
On the todos:
Do you really mean file-notify? Or is this necessary to avoid the update when running |
Right, as it stands the cache never gets updated at all, even when calling |
First glance looks good. When you think it's ready to merge to the other branch, mark it ready for review? I'll test more a bit later. |
Or if easier, I suppose you could rebase on main, and I can close the other. Up to you. |
Sure, I'll rebase on main so we don't have to do another merge dance in the future. |
Except why keep track of the checksums if we need file-notify? With new cache design, maybe we should turn on file-notify be default? |
We don't need file notify, but it's nice to have. File-notify can also report spurious change events in some cases, so the checksum guards against unnecessary updates. And yes, we can turn on file-notify by default now. But we should probably have a fallback for people who can't use file-notify (maybe they're on a network file system or something?) We'll have to fine-tune the details on this and add appropriate customization options. For example, I can imagine somebody might have a bibliography on a tramp connection and they don't want emacs to try to update it every time they start completion. Even just checking the file size and time could involve a long round-trip they'd rather not have. |
Can you also take a look at the new capf and make sure it works? I have a problem testing it. Should just require a slight tweak to work with this. |
I'm going to edit your TODO directly, to keep it all in one place. |
I'll put some general notes here: First, I tried Second, performance: First, not surprisingly, this version of ELISP> (benchmark-run-compiled 100 (citar--ref-completion-table))
(12.98203636 43 6.759782593000001) Alas, see my comment below on UX snappiness with large libraries: #634 (comment). In that scenario, not caching the completion candidates means it's slower overall. Aside: maybe we should change the name to |
citar.el
Outdated
@@ -179,6 +193,25 @@ All functions that match a particular field are run in order." | |||
:type '(alist :key-type (choice (const t) (repeat string)) | |||
:value-type function)) | |||
|
|||
(defcustom citar-prefilter-entries '(nil . t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@roshanshariff WDYT of this idea of mine?
This would address #449. Earlier I had closed it because I couldn't figure out a flexible way to do this, but this seems a good approach, that's a practical use for filtering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to implement a Consult-like narrowing system. I looked at its code, and behind the scenes it maintains a list of predicates that are combined into minibuffer-completion-predicate
. We'd bind some keys to allow predicates to be enabled and disabled on-the-fly, just like Consult does.
Then we could make citar-select-ref
activate a certain predicate by default, but you'd always be able to press a key to remove the filter and/or add other filters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that would be valuable, without us requiring consult.
But this (really the car) is orthogonal, and is just saying whether to pre-filter with citar-open-notes
and such, as per #449.
As in, this and that are complementary.
In any case, feel free to remove or adapt this at will.
Responsiveness with large librariesThis approach has the major advantage that the "has-" indicators are always in sync. It also seems a clean architecture, with appropriate separation of concerns. But Try this 6000+ reference/3.2 MB file. http://www-alg.ist.hokudai.ac.jp/~thomas/COLTBIB/colt.bib.gz There is a noticeable pause, akin to having no cache on my ~1000 item bib. What if we cache the completion candidates too? This approximates that (I don't include the affixation function, but it doesn't make an obvious difference): ELISP> (setq my/completions (citar--ref-completion-table))
#<hash-table equal 6934/8311 0x475237>
ELISP> (completing-read "test: " my/completions) The difference is obvious (though not with my own smaller library). I guess we'll need to decide what kind of performance users should expect, with what size libraries? My own view is one shouldn't expect for the citar UI to be highly responsive with such large files. But this approach and also caching the completions are not mutually exclusive, so we could always add it later if needed? Context: I'm running these tests on a Lenovo ThinkPad X1 Carbon 7th gen laptop, with i5-8265U processor and 16 GB of RAM. Not the fastest machine, for sure, but no slouch. |
Why? |
Thanks for that big file; I will test with it on top of own ~1000 key bibliography. I'm implementing some low-hanging fruit in terms of improving performance that should help. However, you're right that no matter how hard we try, for large enough libraries performance will be a concern. I'd prefer not to add another cache, however, with all the accompanying headaches. Instead, I think we should expose a few options that gradually make the results less "live" but improve performance. The idea would be to maintain the current architecture, but have the "preformat" stage do more and more work. So, for example, we currently adapt the completions to the current frame width. We could have a defcustom to specify a fixed width for completion candidates. Then the preformat can produce a single string of that width instead of a list of strings, and there's little work to be done at display-time. An even more extreme option would make the preformat include the has:note and has:file markers, and they would only be updated when the bibliography changes or you run We'd have these options disabled by default, but document them in the Readme as "options for large bibliography files". What do you think? If you enabled all these options, I think you'd get pretty much the performance of having a candidates cache (but only a few users will need to go that far).
Without file-notify, if the bibliography changes, you will have to wait for it to be parsed the next time you call completion. People might prefer to have Emacs re-parse it automatically (perhaps when it's idle) so there's no wait.
I modified |
In any case, I do like the idea of incremental iteration (certainly not doing everything in this PR), in response to real user needs, but to have a clean foundation. Do you agree this foundation is, indeed, cleaner, and easier to maintain going forward, than the current code?
Do you imagine two boolean defcustoms? Or one with choices? EDIT: or maybe one defcustom called something like
Right. But then would want to be when idle (or using the async library; though that would add a dependency).
+1 |
Yes, I think all the future improvements I can think of would be easier to build upon this new architecture than what we have now. As you say, we should respond to real user needs. If we decide that we need better performance on very large files, I'm not opposed to adding new caches if needed. But this should be a last resort. Sometimes it's easy to avoid improving performance just by adding a cache on top, but the complexity of the system increases rapidly when you do this. And if we do end up adding another cache in the end, it's good if it can be invalidated and reconstructed quickly anyway :)
I'll have to think about this more, and I think the correct design will become apparent as we go. I need to do some profiling to figure out where the bottlenecks currently are so we're not optimizing the wrong thing. |
It might be the defcustom values are user-oriented. Like the current behavior results in constantly updated related resource indicators, which sacrifices some ultimate performance and responsiveness, while the other optimizations you note sacrifice the former to optimize the latter. EDIT: maybe a defcustom like |
Actually, all this might be moot. With some optimization (a little hacky for now), I'm able to get it down to about 800 ms to format 7000 entries, with live has:notes and has:files indicators and adaptive column widths. And I think there are some more avenues for improvement. Turns out ~70% of the time was being spent in garbage collection, not in string handling... |
This sounds great! Just to add a user perspective, I currently always use a file exported from Zotero with my full library (ca 6500 entries), so keeping that fast is important for me. The initial parsing always takes 5-10 seconds or something (with json) which is of course unavoidable, but in case significant and slow work with all entries is done at other stages, I think the suggestion of configuring more "static" display could be good. I’ll gladly help testing when you think it is ready enough. |
Did you ever test JSON vs BibTeX? My hypothesis is JSON would be faster, but I've never tested. If it is, we might note that in the docs.
I think once Roshan pushes that commit he's working on, it should be ready to test this piece. |
That's great. But wiill it be fast enough for Now that he confirms using a very large library (I wasn't sure how common it was), we should probably make him happy :-)
And the upshot of that? We can tune to reduce the impact? |
dcf79a1
to
aa05050
Compare
I would be inclined to remove the file-notify functionality for now. The code is rather complex, and I can't think of a good way to have it work without locking out the UI. Without it, you get the parsing delay the first time you use a Citar command after the bibliography has changed, rather than every time it changes. In the future I'd like to look into a way of doing the parsing in the background, but that is going to be challenging. A background emacs process would need to send the parsed data back as text, parsing which is probably just as slow as EDIT: Sorry, I'd missed your recent messages. Needless to say, I agree :) |
Bibliography files are now automatically re-parsed if their size or modification time have changed and their hash has changed. Also re-parse if the bibliography file has not changed but `citar--fields-to-parse` returns a different list. There should no longer be any need to manually refresh the cache, since calling any citar command will do that if needed. Therefore, remove the `citar-cache-refresh` command. Some other small changes: * Calculate the bibliography file hash only if its size and/or modification time have changed. * Update the preformatted strings if the format string has changed, even if the bibliography hasn't.
I've pushed a new commit which changes the caching behaviour. Now calling any citar command will refresh any files if and only if needed; see the commit message. I feel this is the right default for almost all users. I've removed the If there is demand, we can add a mode where Citar never updates files after first reading them, unless you manually request it. This could be useful for people who have large, frequently changing bibliographies and want Citar to intentionally use outdated information. But before doing this, I'd like to take a hard look at the performance of parsing and see what the bottlenecks are. Finally, as a small help for benchmarking, Citar now prints a message whenever it's updating a bibliography, and prints the elapsed time. If you want to benchmark the candidate formatting, which is the other performance-sensitive feature, use (citar-get-bibliographies) ;; make sure bibliographies are in cache
(benchmark-run-compiled 10 (citar--format-candidates)) ;; Remember to divide the elapsed time by 10 |
Also add convenience functions for adding and removing notes backends.
eb368b8
to
11f0fbe
Compare
11f0fbe
to
b04dd2b
Compare
(defun citar-file-has-notes (&optional _entries) | ||
"Return predicate testing whether cite key has associated notes." | ||
;; REVIEW why this optional arg when not needed? | ||
(let ((files (citar-file--get-notes-hash))) | ||
(lambda (key) | ||
(gethash key files)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@roshanshariff can you review this little thing on citar-has-note
?
See my comment here.
I didn't need entries here, and I don't need them in the org-roam one.
Or maybe I'm misunderstanding how this should work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do a more full code review later this week when I have some time, but the main reason is for commonality with the file-related functions. There it's useful to have the entries
argument to check if an entry has the file
field.
On the other hand, with our current notes backends, it is true that you only need the cite key. But I can imagine adding support in the future for the BibTeX note field, so maybe it's a good idea to leave the API as-is and just ignore the entries
argument if your backend doesn't need it,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we can think on it a bit.
But if a function needs it, it can pull the entries; no?
E.g. if I understand right, doesn't seem justified to design for a hypothetical need when we have two concrete backends neither of which need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is a matter of taste to some extent. The code that's calling has-notes
has a set of entries that it is operating on. To me, it seems cleaner to pass that on to the backend, which can either use it or ignore it. The alternative is for the backend to call back into Citar to get the entries, but this is not trivial. It involves asking the major-mode function for the list of bibliographies, getting them from the cache, merging their entries into another big hash table (which can use a lot of memory for no reason), etc. This is the reason some other API functions accept an optional :entries
keyword argument, so they can avoid calling citar-get-entries
if the entries
hash table is already available.
And finally, it's easier to unit test a function just by passing it a hash table, knowing that its behaviour only depends on the arguments you pass in (rather than having to set up an environment where citar-get-entries
returns the right thing).
In general, I prefer functions that take arguments for the information they need and return results that depend only on those arguments rather than global state. Of course, global state is unavoidable, but I think it's better for testability, reasoning, etc. if access to it is not scattered throughout the code base but carefully controlled to a few places.
EDIT: Also, if the file has changed on disk between the first time citar-get-entries
was called by citar--format-candidates
and when the backends call it again, then they could end up using different sources of information (not to mention parsing the bibliography files twice or more). I know this is not very likely, but even the theoretical possibility of race conditions like this makes the software engineer part of me very uncomfortable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. In any case, I'll defer to you; just thought I'd raise it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, and of course I might come around to a different view later. But also see my edit above about race conditions.
In general, accessing the cache is not "free", especially now that it can unpredictably decide to re-parse files. In general, the correct way to use it is to get everything you need from the cache at the beginning of some operation (i.e. call citar-get-entries
in this case), use that information throughout, and then throw it away only when you're done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other news, I fixed at least one obvious bug (that was causing a compilation error noted in the MELPA review) in the capf. Perhaps it works now?
I can't easily test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do a more full code review later this week when I have some time ...
I'm not sure if it's useful, but I just added this branch.
You should have write access, if you need/want it.
b04dd2b
to
a3bee3c
Compare
Crossing-fingers, and .... merged! I also submitted the There are still a couple of little things; please take a look, including at my question above, and feel free to submit (nice and small) PRs! |
Fixes regression introduced in #634.
Updated, and everything seems to work great! Some feedback on real-world performance: On my 6500 items library, with native json support:
Running: (benchmark-run-compiled 10 (citar--format-candidates)) Gives: Thanks to the updates of org-cite-csl-activate, @andras-simonyi, one parsing step is cut out from my workflow as well. I guess people on older computers may perhaps need to take action. Nevertheless, my switch from citing in org-mode with the help of org-zotxt which calls the really slow Zotero server, to citing with citar feels like a great update. Citing once the library is parsed is really quick! |
@andersjohansson - we have identified some performance issues in #659, which should be addressed in parsebib and in citar "soon". But it only impacts initial parsing, or when the file changes. |
This pull request continues the work in #628.
Public API
This is a summary of all the Citar public functions
Bibliography data
citar-get-entry: (KEY)
citar-get-entries: ()
citar-get-value: (FIELD KEY-OR-ENTRY)
citar-get-field-with-value: (FIELDS KEY-OR-ENTRY)
: Returns(FIELD . VALUE)
, the first existing field amongFIELDS
.citar-get-display-value: (FIELDS KEY-OR-ENTRY)
Associated files, notes, and links
These functions take an optional keyword argument
:entries
, which defaults to calling(citar-get-entries)
. If this argument is provided, they won't access the cache. They also look for resources through keys found in thecrossref
field.citar-get-files: (KEY-OR-KEYS)
citar-get-links: (KEY-OR-KEYS)
citar-has-files: ()
citar-has-notes: ()
citar-has-links: ()
The
citar-has-
functions return predicates that take aKEY
argument and return whether that key has files/notes/links. The returned functions can be used with the:filter
argument tocitar-select-ref
:(citar-select-ref :filter (citar-has-files))
Selecting references
citar-select-refs: (&key multiple filter)
: Always returns a list of keys, regardless of whether:multiple
isnil
.citar-select-ref: (&key filter)
: Always returns a single key.At-point information
citar-key-at-point
citar-citation-at-point
Commands
...
Formatting
This is in
citar-format.el
. Formatting references is now done in three stages:citar--ref-completion-table
) Metadata like the "has:notes" and "has:files" tags are looked up and added.The first step is done when the bibliography is parsed, while the second and third are fast enough to do for every completion session.
Behind the scenes, there is new code to parse format strings in the existing format into a new, efficient, list-based format. All the functions now operate on this parsed representation internally. When formatting or pre-formatting entire bibliographies, this parsing only needs to be done once.
Caching
This functionality is in
citar-cache.el
. Instead of buffer-local and global caches, we now maintain a centralized cache which holds records for each bibliography file in use. This cache also keeps track of which buffers are using each bibliography, so that the cached files can be discarded when they are no longer used by any buffer.The cache also holds "pre-formatted" candidate strings (i.e. the result of step 1 above). These strings do not depend on any transient information like whether each entry has files/notes or the screen width.
Remaining work
Allow the cache to be refreshed manually usingNow, the bibliography is automatically updated whenever you call a citar command and it has changed on disk. If there is demand, we can add a mode where Citar never updates the bibliography unless explicitly requested.citar-refresh
. Currently the cache is never updated after the first time each bibliography is parsed.Integrate the file-notify functionality so the cache can be updated whenever the underlying bibliography file changes.We are removing file-notify functionality for now.org-cite-insert
embark-act
both in minibuffer and buffer (there are bugs in both)