-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alot aborted after tagging three times. #1460
Comments
After more testing just calling :tag killed three times in sequence crashes alot. This happens on any search I have and using any tag. Most likely some version issues with Xapian database but I am unable to debug this. I tried using python3-gdb but I get no symbols or backtrace to see what is going on. Any tips on how to debug this would be appreciated.
|
After more investigation by setting debug messages all over the place I found the crash happens in this line: alot/ui.py#L655. After setting a tag in a search buffer that causes that thread to be removed from the buffer, alot goes like this:
If the thread is not removed from the seach buffer then the crash does not occur. This can be tested by commenting out the searchbuffer.threadlist.remove() call. This is a urwid MainLoop that is outside of my debugging skills. This same program path works fine twice but the third time always aborts. Most probably some race condition or similar going on inside that MainLoop due to the async/await being used. For reference I tried this with urwid 2.0.1 and 2.1.0 versions and it crashes with both. |
Thanks a lot for the debugging effort! It sounds like a problem that I had as well but I was not able to reproduce it reliably.
I have the impression that it has something to do with the underlying notmuch library being called in an unsafe way,
because no python library would simply segfault in this way without any error logs..
I'll try to look into it over the holiday break.
Cheers,
P
Quoting Horacio Sanson (2019-12-12 14:04:24)
… After more investigation by setting debug messages all over the place I found
the crash happens in this line: alot/ui.py#L655.
After setting a tag in a search buffer that causes that thread to be removed
from the buffer, alot goes like this:
1. Call TagCommand apply() method.
2. After applying the tags it invokes FlushCommand with an inner refresh()
function as callback.
3. Since the message tagged no longer matches the search query, the refresh()
method removes it by invoking searchbuffer.threadlist.remove() on the
thread.
4. The refresh() callback finishes by calling ui.update() that crashes
(aborts) when it reaches draw_screen() in the Mainloop object.
If the thread is not removed from the seach buffer then the crash does not
occur. This can be tested by commenting out the searchbuffer.threadlist.remove
() call.
This is a urwid MainLoop that is outside of my debugging skills. This same
program path works fine twice but the third time always aborts. Most probably
some race condition or similar going on inside that MainLoop dot to the async/
await being used.
For reference I tried this with urwid 2.0.1 and 2.1.0 versions and it crashes
with both.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.*
|
I can reproduce this as well, bisecting points at commit e7e0c52 as the culprit. |
I keep hitting the problem as well, The last seemingly relevant lines in the log are:
Let me know what I could do to narrow down the issue, or even if there's a hot fix to try out. HTH. |
So you also see an abrupt segfault of the underlying python interpreter, correct?
I'd *really* like to know what causes this or even how to debug this.
Again, I'm quite sure that it is down to the notmuch bindings...
|
Quoting Patrick Totzke (2020-05-08 21:36:39)
Again, I'm quite sure that it is down to the notmuch bindings...
Most likely. The memory safety of the python binding is highly suspect.
In my local alot fork, this goes away by switching to notmuch2 bindings.
…--
Anton Khirnov
|
@elenril do you have a patch of your fix I could try, please? |
Quoting Frank LENORMAND (2020-05-09 07:49:30)
@elenril do you have a patch of your fix I could try, please?
My branch has diverged from master quite considerably, so testing just
this fix is not possible. Aalso, I am becoming less sure of my previous
comment - this might be a problem in notmuch itself, not the bindings.
As far as I can tell, the crash is an unhandled
Xapian::DatabaseModifiedError, triggered by the database being modified
while a query is being read.
But you can simply revert e7e0c52.
…--
Anton Khirnov
|
Could you today your findings too the notmuch mailing list, @lenormf ? if much rather fix the bindings that reverting to this arcane work around in separate processes. |
I will, if I have information not already reported here. I see that users in this thread are involved in Could it be a better move to have a new |
Quoting Frank LENORMAND (2020-05-09 10:48:33)
Could it be a better move to have a new `alot` branch that uses
`notmuch2` bindings, which allegedly don't have the crash issue? We
would only have to live with the crash for as long as it took to get
`notmuch2` up to par (as far as I understand, they would deprecate the
old ones), and this could bolster improvements and more generally
upstream contributions to `notmuch2`.
I retract my comment from above - the problem seems NOT to be in the
python bindings, so changing those would not fix anything.
I can actually reproduce the same crash (at least I think it's the same
crash) just with the notmuch tool in the following way:
- run a long-running-qury in one terminal, such as
notmuch search --output=threads '*' > /dev/null
- in another terminal, modify the database twice, e.g.
notmuch tag +foo 'id:somemessageinyourdb'; notmuch tag -foo 'id:somemessageinyourdb'
while the first query is running
The first crash will abort. IIUC it is caused by Xapian throwing the
DatabaseModifiedError exception, which is not handled by libnotmuch. The
solution is fixing libnotmuch to handle it. That might require API
changes, since I think there is no way for a threads or messages
iterator to report and error.
The caller (i.e. alot in this case) would also have to re-execute the
query.
…--
Anton Khirnov
|
I ran
With @elenril's post, I think there's now enough information to close this issue and handle the crash on the |
Quoting Frank LENORMAND (2020-05-09 12:45:09)
I ran `gdb` to try to get a more precise stacktrace, this is what I got:
```
#0 0x00007ffff788ace5 in raise () at /usr/lib/libc.so.6
#1 0x00007ffff7874857 in abort () at /usr/lib/libc.so.6
#2 0x00007ffff534a0ff in () at /usr/lib/libtalloc.so.2
#3 0x00007ffff534a0f7 in () at /usr/lib/libtalloc.so.2
#4 0x00007ffff54615ab in notmuch_thread_destroy () at /usr/lib/libnotmuch.so.5
#5 0x00007ffff6183a8d in () at /usr/lib/libffi.so.7
#6 0x00007ffff618301b in () at /usr/lib/libffi.so.7
[Python interpreter calls…]
#184 0x00007ffff7cd3b38 in PyRun_SimpleFileExFlags (fp=0x5555555b7240, filename=<optimized out>, closeit=1, flags=0x7fffffffe498) at Python/pythonrun.c:428
#185 0x00007ffff7e3e7a8 in pymain_run_file (cf=0x7fffffffe498, config=0x55555557de50) at Modules/main.c:381
#186 pymain_run_python (exitcode=0x7fffffffe490) at Modules/main.c:565
#187 Py_RunMain () at Modules/main.c:644
#188 0x00007ffff7e3e899 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:698
#189 0x00007ffff7876023 in __libc_start_main () at /usr/lib/libc.so.6
#190 0x000055555555507e in _start
```
`notmuch_thread_destroy(thread)` simply calls `talloc_free(thread)`, which raises `SIGABRT` (corrupted memory? double free? race?).
With @elenril's post, I think there's now enough information to close this issue and handle the crash on the `notmuch` side.
That seems like a different problem though. The one I am seeing is
triggered in xapian, whereas in your backtrace the abort() comes from
libtalloc, which is used by libnotmuch itself. That actually could be
caused by the bindings managing memory incorrectly.
…--
Anton Khirnov
|
Correct, but in any case this is a |
Quoting Frank LENORMAND (2020-05-09 13:14:41)
Correct, but in any case this is a `notmuch` issue, and doesn't need to be handled here.
Not necessarily. The current python bindings are quite finicky about being
used in exactly the right manner, it could be that alot is not handling
some detail correctly.
I think that was one of the motivations for replacing them with
notmuch2.
…--
Anton Khirnov
|
Fair enough. |
Quoting Frank LENORMAND (2020-05-18 12:05:13)
@elenril it doesn't look like the `Xapian::DatabaseModifiedError` abortion was reported [upstream](https://nmbug.notmuchmail.org/status/#Bugs), have you notified them about it?
Yes, I have discussed it with developers on IRC and intend to send a
patch fixing it.
…--
Anton Khirnov
|
Great to hear, thanks! |
I've been running into the same problem, consistent crashes after a tag command causes a thread to be removed from a search buffer, requiring more threads to be loaded. (It doesn't happen if the complete search results are shown, so one workaround is to keep your inbox under control!) Quoting @elenril:
For me it's also usually an abort() from libtalloc. Quoting @lenormf:
It actually seems to be a use after free. If you set e.g.
Which comes from the
Quoting @elenril:
Well, even if notmuch is fixed so as not to abort, it still needs to be addressed on the alot side, whether you wait for an error to come back from notmuch or proactively cancel any open queries after writing to the db. In any case it breaks the search buffer's lazy loading mechanism... |
Quoting Steven Lawrance (2020-06-18 20:30:47)
I've been running into the same problem, consistent crashes after a tag command causes a thread to be removed from a search buffer, requiring more threads to be loaded. (It doesn't happen if the complete search results are shown, so one workaround is to keep your inbox under control!)
In my experience it also does not happen in the last page: So if you
have a list longer than one screenful, then an option is to work your
way up from the bottom.
|
I'm also hitting this problem very frequently running alot from master. @elenril did you manage to write a patch for notmuch? do you need any help testing it or something? |
Are people still hitting this? I just installed alot from FreeBSD ports and ran into it again, and remembered I had a small patch as a workaround:
This just pre-loads all the thread IDs from a search query, so the search itself can be closed. It does slow things down noticeably when there are a lot (10k+) of results but it's at least usable for me. |
Yes, I still suffer from this issue. |
your patch works well for me, @stlawrance - Thanks! |
This is magic, I don't see why this should work but hey :)
Can you send a PR?
Thanks!
P
|
I no longer experience this issue with release v0.10 (as packaged for Debian). |
no, I was was mistaken (forgot that failure occured when _changing a listing - e.g. deleting entry - 3 times, not simply scrolling 3 pages): Issue still exists for me with release v0.10 :-( |
This is an updated patch which (slows down but) avoids crashing when deleting an item 3 times in a list longer than 1 screen: --- manager.py.orig 2021-09-26 19:40:37.000000000 +0200
+++ manager.py 2021-09-28 10:30:43.505410082 +0200
@@ -335,10 +335,11 @@
"""
assert sort in self._sort_orders
db = Database(path=self.path, mode=Database.MODE.READ_ONLY)
- for t in db.threads(querystring,
+ thread_ids = [t.threadid for t in db.threads(querystring,
sort=self._sort_orders[sort],
- exclude_tags=self.exclude_tags):
- yield t.threadid
+ exclude_tags=self.exclude_tags)]
+ for t in thread_ids:
+ yield t
def add_message(self, path, tags=None, afterwards=None):
""" |
thank you @jonassmedegaard ! I forgot this thread and actually wanted to include the fix in the new release.. |
sorry, but I find Github unethical and only ever use their issue trackers. If more helpful for you that I provide a git fork with the patch applied then I'd be happy to do so - but elsewhere, not on Github. If you will only credit contributions provided as a Github PR then that's your choice - otherwise I am Jonas Smedegaard dr@jones.dk - that and more also provided at https://dr.jones.dk/info. |
OK no worries.
I "credit" people by posting the output of `git shortlog` in the announcement mails, so there'd have to be a commit by you.
But it's no big issue if you prefer not to post commits here.
Thanks anyway,
P
Quoting Jonas Smedegaard (2021-09-28 11:03:35)
… sorry, but I find Github unethical and only ever use their issue trackers.
If more helpful for you that I provide a git fork with the patch applied then
I'd be happy to do so - but elsewhere, not on Github.
If you will only credit contributions provided as a Github PR then that's your
choice - otherwise I am Jonas Smedegaard ***@***.*** - that and more also
provided at https://dr.jones.dk/info.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android. *
|
Sorry I didn't get around to adapting/testing the patch for current master, I'm glad it got there in the end though! Thanks. |
@pazz Thanks for applying the patch (and even mentioning me in commit message - often that's not the case when I don't serve commits on a silver plate). It seems, however, that the patch does not work as intended (I am really sorry - I did test and it did improve things, but later not - weird!). It seems that the later commit a3b6018 makes my alot no longer crash, however. So long story short: please simply revert commit caf87d9 and (unless others disagree) this issue can be closed - where credit goes to Kieran Bingham |
I'm not sure I can see how my commit has an effect on this specific issue, except that it may have also been an issue that you would hit if you use "thread_subject = oldest" in your config, which would have prevented you from moving to the latest master otherwise. That now uses the new notmuch2 bindings, which I suspect might have a lot more changes that could be relevant. |
please disregard my previous post: I now again experienced crash after 3 times tagging a list more than a screenful in size, and re-applying patch caf87d9 made alot work again. Perhaps the scenario triggered the crash is not simply "list of more than a screenful" but involves other factors as well. |
As I said above in this thread - the problem is in libnotmuch, it cannot
be fixed in alot. I intend to send a patch there to handle this.
Also, in my experience it happens a lot less frequently if you upgrade
your notmuch database to glass.
|
Quoting elenril (2021-11-22 07:22:48)
Also, in my experience it happens a lot less frequently if you upgrade
your notmuch database to glass.
I am not aware of this. How do I update notmuchs db?
just by running notmuch compact?
|
Quoting Patrick Totzke (2021-11-23 10:33:04)
I am not aware of this. How do I update notmuchs db?
just by running notmuch compact?
See
https://xapian.org/docs/admin_notes.html#converting-a-chert-database-to-a-glass-database
But I hear dump-recreate-restore may be faster.
…--
Anton Khirnov
|
Quoting elenril (2021-11-23 09:42:51)
Quoting Patrick Totzke (2021-11-23 10:33:04)
>
> I am not aware of this. How do I update notmuchs db?
> just by running notmuch compact?
See
https://xapian.org/docs/admin_notes.html#
converting-a-chert-database-to-a-glass-database
Thanks. Debians xapian-examples actually only comes with the source and I cannot be bothered to find out how to compile this.
But I hear dump-recreate-restore may be faster.
That is notmuch dump?
|
Quoting Patrick Totzke (2021-11-23 11:13:30)
Quoting elenril (2021-11-23 09:42:51)
> Quoting Patrick Totzke (2021-11-23 10:33:04)
> >
> > I am not aware of this. How do I update notmuchs db?
> > just by running notmuch compact?
>
> See
> https://xapian.org/docs/admin_notes.html#
> converting-a-chert-database-to-a-glass-database
Thanks. Debians xapian-examples actually only comes with the source and I cannot be bothered to find out how to compile this.
the binary is shipped in xapian-tools
> But I hear dump-recreate-restore may be faster.
That is notmuch dump?
Yes, like https://notmuchmail.org/howto/#index6h2
…--
Anton Khirnov
|
coolcool, many thanks!
P
|
Does anybody still hit this bug or now any updates? We have three different indicators/ideas how to fix this collected in this thread:
In #1647 it was suggested to revert caf87d9 as it slows down the buffer creation for search queries with many results. So the question is also: can anyone still reproduce this without caf87d9? @elenril you proposed to send a patch upstream, how did that go? |
Quoting Lucas Hoffmann (2024-09-21 08:42:58)
Does anybody still hit this bug or now any updates? We have three different indicators/ideas how to fix this collected in this thread:
1. update your database to glass (#1460 (comment))
2. switch the python bindings from notmuch to notmuch2 which alot did in #1547 (and #1673)
These can only reduce the rate at which this happens, not eliminate it
completely.
3. preload threads into a list instead of using the generator, proposed by @jonassmedegaard and merge into alot in caf87d9
This would fix it, but I'd expect a high performance and responsiveness
cost.
@elenril you proposed to send a patch upstream, how did that go?
It's been several years since my attempt so I don't remember the details
anymore, but it turned out to be harder than I expected. I do intend to
return to this issue eventually though, but can't say when.
…--
Anton Khirnov
|
@elenril - Out of curiosity, do you have anything you can share that helps us understand what the issue is, and what'd be required to fix it (e.g., a mailing list thread)? |
Describe the bug
In search buffer if I run this command three times, alot aborts:
Software Versions
To Reproduce
Steps to reproduce the behaviour:
:untag --no-flush inbox; untag --no-flush spam; tag killed
three timesError Log
Log output below but there does not seem to be any errors in it.
The text was updated successfully, but these errors were encountered: