Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alot aborted after tagging three times. #1460

Open
hsanson opened this issue Dec 9, 2019 · 46 comments
Open

Alot aborted after tagging three times. #1460

hsanson opened this issue Dec 9, 2019 · 46 comments
Labels

Comments

@hsanson
Copy link

hsanson commented Dec 9, 2019

Describe the bug
In search buffer if I run this command three times, alot aborts:

untag --no-flush inbox; untag --no-flush spam; tag killed

Software Versions

  • Python version: Python 3.6.9
  • Notmuch version: notmuch 0.26
  • Alot version: 0.8.1 (installed manually from git master)

To Reproduce
Steps to reproduce the behaviour:

  1. open search buffer
  2. run the command :untag --no-flush inbox; untag --no-flush spam; tag killed three times
  3. Alot closes and an "Aborted" message appears in the terminal.

Error Log
Log output below but there does not seem to be any errors in it.

DEBUG:__init__:cmd parms {'flush': False, 'allmessages': False, 'tags': 'spam', 'action': 'remove'}
DEBUG:search:all? False
DEBUG:search:q: (tag:inbox AND NOT tag:killed AND NOT tag:spam AND NOT tag:replied) AND thread:0000000000037292
DEBUG:ui:search command string: " tag killed"
DEBUG:__init__:mode:search got commandline " tag killed"
DEBUG:__init__:ARGS: ['tag', 'killed']
DEBUG:__init__:cmd parms {'flush': 'True', 'allmessages': False, 'tags': 'killed', 'action': 'add'}
DEBUG:search:all? False
DEBUG:search:q: (tag:inbox AND NOT tag:killed AND NOT tag:spam AND NOT tag:replied) AND thread:0000000000037292
DEBUG:manager:write-out item: ('untag', None, '(tag:inbox AND NOT tag:killed AND NOT tag:spam AND NOT tag:replied) AND thread:0000000000037292', ['inbox'])
DEBUG:manager:cmd created
DEBUG:manager:got write lock
DEBUG:manager:got atomic
DEBUG:manager:ended atomic
DEBUG:manager:ended atomic
DEBUG:manager:closed db
DEBUG:manager:flush finished
DEBUG:manager:write-out item: ('untag', None, '(tag:inbox AND NOT tag:killed AND NOT tag:spam AND NOT tag:replied) AND thread:0000000000037292', ['spam'])
DEBUG:manager:cmd created
DEBUG:manager:got write lock
DEBUG:manager:got atomic
DEBUG:manager:ended atomic
DEBUG:manager:ended atomic
DEBUG:manager:closed db
DEBUG:manager:flush finished
DEBUG:manager:write-out item: ('tag', None, '(tag:inbox AND NOT tag:killed AND NOT tag:spam AND NOT tag:replied) AND thread:0000000000037292', ['killed'])
DEBUG:manager:cmd created
DEBUG:manager:got write lock
DEBUG:manager:got atomic
DEBUG:manager:ended atomic
DEBUG:manager:ended atomic
DEBUG:manager:closed db
DEBUG:manager:flush finished
DEBUG:search:remove thread from result list: thread:0000000000037292: PROBLEM: Clustering not runnig on dev-xmpp03
@pazz pazz added the bug label Dec 10, 2019
@hsanson
Copy link
Author

hsanson commented Dec 10, 2019

After more testing just calling :tag killed three times in sequence crashes alot. This happens on any search I have and using any tag.

Most likely some version issues with Xapian database but I am unable to debug this. I tried using python3-gdb but I get no symbols or backtrace to see what is going on. Any tips on how to debug this would be appreciated.

 [~]>  gdb /usr/bin/python3.6-dbg core
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/python3.6-dbg...done.

warning: core file may not match specified executable file.
[New LWP 17123]
Core was generated by `/usr/bin/python3 /home/ryujin/.local/bin/alot'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f76cf7dbe97 in ?? ()
(gdb) py-bt
Unable to locate python frame
(gdb) bt
#0  0x00007f76cf7dbe97 in ?? ()
#1  0x0000000000000000 in ?? ()
(gdb)

@hsanson
Copy link
Author

hsanson commented Dec 12, 2019

After more investigation by setting debug messages all over the place I found the crash happens in this line: alot/ui.py#L655.

After setting a tag in a search buffer that causes that thread to be removed from the buffer, alot goes like this:

  1. Call TagCommand apply() method.
  2. After applying the tags it invokes FlushCommand with an inner refresh() function as callback.
  3. Since the message tagged no longer matches the search query, the refresh() method removes it by invoking searchbuffer.threadlist.remove() on the thread.
  4. The refresh() callback finishes by calling ui.update() that crashes (aborts) when it reaches draw_screen() in the Mainloop object.

If the thread is not removed from the seach buffer then the crash does not occur. This can be tested by commenting out the searchbuffer.threadlist.remove() call.

This is a urwid MainLoop that is outside of my debugging skills. This same program path works fine twice but the third time always aborts. Most probably some race condition or similar going on inside that MainLoop due to the async/await being used.

For reference I tried this with urwid 2.0.1 and 2.1.0 versions and it crashes with both.

@pazz
Copy link
Owner

pazz commented Dec 12, 2019 via email

@elenril
Copy link
Contributor

elenril commented Dec 20, 2019

I can reproduce this as well, bisecting points at commit e7e0c52 as the culprit.

@lenormf
Copy link
Contributor

lenormf commented May 8, 2020

I keep hitting the problem as well, v0.9.1. It's hard to reproduce reliably, especially when debugging is enabled, but I always hit the issue when toggling a tag (s).

The last seemingly relevant lines in the log are:

DEBUG:globals:flush complete
DEBUG:ui:Got key (['s'], [115])
DEBUG:ui:cmdline: 'toggletags unread'
DEBUG:ui:search command string: "toggletags unread"
DEBUG:__init__:mode:search got commandline "toggletags unread"
DEBUG:__init__:ARGS: ['toggletags', 'unread']
DEBUG:__init__:cmd parms {'flush': 'True', 'tags': 'unread', 'action': 'toggle'}
DEBUG:search:all? False
DEBUG:search:q: (tag:unread) AND thread:0000000000002773
DEBUG:manager:write-out item: ('untag', <function Thread.remove_tags.<locals>.myafterwards at 0x7fa4d15855e0>, 'thread:0000000000002773', ['unread'])
DEBUG:manager:cmd created
DEBUG:manager:got write lock
DEBUG:manager:got atomic
DEBUG:manager:ended atomic
DEBUG:manager:ended atomic
DEBUG:manager:closed db
DEBUG:manager:<function Thread.remove_tags.<locals>.myafterwards at 0x7fa4d15855e0>
DEBUG:manager:called callback
DEBUG:manager:flush finished
DEBUG:manager:write-out item: ('tag', <function Thread.add_tags.<locals>.myafterwards at 0x7fa4d1585790>, 'thread:0000000000002773', [])
DEBUG:manager:cmd created
DEBUG:manager:got write lock
DEBUG:manager:got atomic
DEBUG:manager:ended atomic
DEBUG:manager:ended atomic
DEBUG:manager:closed db
DEBUG:manager:<function Thread.add_tags.<locals>.myafterwards at 0x7fa4d1585790>
DEBUG:search:remove thread from result list: thread:0000000000002773: [RSS][hacker-news][11/28] SIGGRAPH 2020 Technical Paper: N-Dimensional Rigid Body Dynamics

Let me know what I could do to narrow down the issue, or even if there's a hot fix to try out.

HTH.

@pazz
Copy link
Owner

pazz commented May 8, 2020 via email

@elenril
Copy link
Contributor

elenril commented May 9, 2020 via email

@lenormf
Copy link
Contributor

lenormf commented May 9, 2020

@elenril do you have a patch of your fix I could try, please?

@elenril
Copy link
Contributor

elenril commented May 9, 2020 via email

@pazz
Copy link
Owner

pazz commented May 9, 2020

Could you today your findings too the notmuch mailing list, @lenormf ? if much rather fix the bindings that reverting to this arcane work around in separate processes.

@lenormf
Copy link
Contributor

lenormf commented May 9, 2020

I will, if I have information not already reported here.

I see that users in this thread are involved in notmuch's CFFI bindings. What is the state of those, are they currently in a good-enough state that they could be used instead of the regular bindings?

Could it be a better move to have a new alot branch that uses notmuch2 bindings, which allegedly don't have the crash issue? We would only have to live with the crash for as long as it took to get notmuch2 up to par (as far as I understand, they would deprecate the old ones), and this could bolster improvements and more generally upstream contributions to notmuch2.

@elenril
Copy link
Contributor

elenril commented May 9, 2020 via email

@lenormf
Copy link
Contributor

lenormf commented May 9, 2020

I ran gdb to try to get a more precise stacktrace, this is what I got:

#0  0x00007ffff788ace5 in raise () at /usr/lib/libc.so.6
#1  0x00007ffff7874857 in abort () at /usr/lib/libc.so.6
#2  0x00007ffff534a0ff in  () at /usr/lib/libtalloc.so.2
#3  0x00007ffff534a0f7 in  () at /usr/lib/libtalloc.so.2
#4  0x00007ffff54615ab in notmuch_thread_destroy () at /usr/lib/libnotmuch.so.5
#5  0x00007ffff6183a8d in  () at /usr/lib/libffi.so.7
#6  0x00007ffff618301b in  () at /usr/lib/libffi.so.7
[Python interpreter calls…]
#184 0x00007ffff7cd3b38 in PyRun_SimpleFileExFlags (fp=0x5555555b7240, filename=<optimized out>, closeit=1, flags=0x7fffffffe498) at Python/pythonrun.c:428
#185 0x00007ffff7e3e7a8 in pymain_run_file (cf=0x7fffffffe498, config=0x55555557de50) at Modules/main.c:381
#186 pymain_run_python (exitcode=0x7fffffffe490) at Modules/main.c:565
#187 Py_RunMain () at Modules/main.c:644
#188 0x00007ffff7e3e899 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:698
#189 0x00007ffff7876023 in __libc_start_main () at /usr/lib/libc.so.6
#190 0x000055555555507e in _start

notmuch_thread_destroy(thread) simply calls talloc_free(thread), which raises SIGABRT (corrupted memory? double free? race?).

With @elenril's post, I think there's now enough information to close this issue and handle the crash on the notmuch side.

@elenril
Copy link
Contributor

elenril commented May 9, 2020 via email

@lenormf
Copy link
Contributor

lenormf commented May 9, 2020

Correct, but in any case this is a notmuch issue, and doesn't need to be handled here.

@elenril
Copy link
Contributor

elenril commented May 9, 2020 via email

@lenormf
Copy link
Contributor

lenormf commented May 9, 2020

Fair enough.

@lenormf
Copy link
Contributor

lenormf commented May 18, 2020

@elenril it doesn't look like the Xapian::DatabaseModifiedError abortion was reported upstream, have you notified them about it?

@elenril
Copy link
Contributor

elenril commented May 18, 2020 via email

@lenormf
Copy link
Contributor

lenormf commented May 18, 2020

Great to hear, thanks!

@stlawrance
Copy link

I've been running into the same problem, consistent crashes after a tag command causes a thread to be removed from a search buffer, requiring more threads to be loaded. (It doesn't happen if the complete search results are shown, so one workaround is to keep your inbox under control!)

Quoting @elenril:

That seems like a different problem though. The one I am seeing is
triggered in xapian, whereas in your backtrace the abort() comes from
libtalloc, which is used by libnotmuch itself. That actually could be
caused by the bindings managing memory incorrectly.

For me it's also usually an abort() from libtalloc.

Quoting @lenormf:

notmuch_thread_destroy(thread) simply calls talloc_free(thread), which raises SIGABRT (corrupted memory? double free? race?).

It actually seems to be a use after free. If you set e.g. TALLOC_FREE_FILL=5, you'll get an out of bounds read a litte earlier:

#0  strlen (str=0x505050505050505 <Address 0x505050505050505 out of bounds>)
    at /usr/src/lib/libc/string/strlen.c:100
#1  0x000000080871e9d8 in z_get (ptr=0x7ffffffb53a0, size=8) at cfield.c:1346
#2  0x0000000808715078 in GetResult (restype=0x80691b440, result=0x7ffffffb53a0, checker=0x0)
    at callproc.c:921
#3  0x0000000808714758 in _ctypes_callproc (pProc=0x809bcd3a0 <notmuch_thread_get_thread_id>, 
    argtuple=0x80da90680, flags=4353, argtypes=0x807e3db50, restype=0x80691b440, checker=0x0)
    at callproc.c:1223

Which comes from the yield t.get_thread_id() in alot's DBManager,get_threads. I'm not sure where that thread object (or its parent threads or query) actually got freed -- it's not coming from the _destroy finalizer calls in the notmuch python bindings. I suspect it's after the db.close() in DBManager.flush. And Database.close does have the warning:

This function closes the notmuch database. From that point
on every method invoked on any object ever derived from
the closed database may cease to function and raise a
NotmuchError.

Quoting @elenril:

Yes, I have discussed it with developers on IRC and intend to send a
patch fixing it.

Well, even if notmuch is fixed so as not to abort, it still needs to be addressed on the alot side, whether you wait for an error to come back from notmuch or proactively cancel any open queries after writing to the db. In any case it breaks the search buffer's lazy loading mechanism...

@jonassmedegaard
Copy link

jonassmedegaard commented Jun 18, 2020 via email

@meskio
Copy link
Contributor

meskio commented Aug 25, 2020

I'm also hitting this problem very frequently running alot from master. @elenril did you manage to write a patch for notmuch? do you need any help testing it or something?

@stlawrance
Copy link

Are people still hitting this? I just installed alot from FreeBSD ports and ran into it again, and remembered I had a small patch as a workaround:

diff --git a/alot/db/manager.py b/alot/db/manager.py
index d95ea332..713d8d73 100644
--- a/alot/db/manager.py
+++ b/alot/db/manager.py
@@ -277,8 +277,9 @@ class DBManager:
         if exclude_tags:
             for tag in exclude_tags:
                 q.exclude_tag(tag)
-        for t in q.search_threads():
-            yield t.get_thread_id()
+        thread_ids = [t.get_thread_id() for t in q.search_threads()]
+        for t in thread_ids:
+             yield t
 
     def query(self, querystring):
         """

This just pre-loads all the thread IDs from a search query, so the search itself can be closed. It does slow things down noticeably when there are a lot (10k+) of results but it's at least usable for me.

@jonassmedegaard
Copy link

Yes, I still suffer from this issue.
Will try your patch - thanks!

@jonassmedegaard
Copy link

your patch works well for me, @stlawrance - Thanks!

@pazz
Copy link
Owner

pazz commented Jul 20, 2021 via email

@jonassmedegaard
Copy link

I no longer experience this issue with release v0.10 (as packaged for Debian).

@jonassmedegaard
Copy link

no, I was was mistaken (forgot that failure occured when _changing a listing - e.g. deleting entry - 3 times, not simply scrolling 3 pages): Issue still exists for me with release v0.10 :-(

@jonassmedegaard
Copy link

This is an updated patch which (slows down but) avoids crashing when deleting an item 3 times in a list longer than 1 screen:

--- manager.py.orig     2021-09-26 19:40:37.000000000 +0200
+++ manager.py  2021-09-28 10:30:43.505410082 +0200
@@ -335,10 +335,11 @@
         """
         assert sort in self._sort_orders
         db = Database(path=self.path, mode=Database.MODE.READ_ONLY)
-        for t in db.threads(querystring,
+        thread_ids = [t.threadid for t in db.threads(querystring,
                             sort=self._sort_orders[sort],
-                            exclude_tags=self.exclude_tags):
-            yield t.threadid
+                            exclude_tags=self.exclude_tags)]
+        for t in thread_ids:
+             yield t

     def add_message(self, path, tags=None, afterwards=None):
         """

@pazz
Copy link
Owner

pazz commented Sep 28, 2021

thank you @jonassmedegaard ! I forgot this thread and actually wanted to include the fix in the new release..
Anyhow, mind sending this as PR? that way you also get some credit :)

@jonassmedegaard
Copy link

sorry, but I find Github unethical and only ever use their issue trackers.

If more helpful for you that I provide a git fork with the patch applied then I'd be happy to do so - but elsewhere, not on Github.

If you will only credit contributions provided as a Github PR then that's your choice - otherwise I am Jonas Smedegaard dr@jones.dk - that and more also provided at https://dr.jones.dk/info.

@pazz
Copy link
Owner

pazz commented Sep 28, 2021 via email

@stlawrance
Copy link

Sorry I didn't get around to adapting/testing the patch for current master, I'm glad it got there in the end though! Thanks.

@jonassmedegaard
Copy link

@pazz Thanks for applying the patch (and even mentioning me in commit message - often that's not the case when I don't serve commits on a silver plate).

It seems, however, that the patch does not work as intended (I am really sorry - I did test and it did improve things, but later not - weird!).

It seems that the later commit a3b6018 makes my alot no longer crash, however.

So long story short: please simply revert commit caf87d9 and (unless others disagree) this issue can be closed - where credit goes to Kieran Bingham

@kbingham
Copy link
Contributor

kbingham commented Nov 20, 2021

I'm not sure I can see how my commit has an effect on this specific issue, except that it may have also been an issue that you would hit if you use "thread_subject = oldest" in your config, which would have prevented you from moving to the latest master otherwise.

That now uses the new notmuch2 bindings, which I suspect might have a lot more changes that could be relevant.

@jonassmedegaard
Copy link

please disregard my previous post: I now again experienced crash after 3 times tagging a list more than a screenful in size, and re-applying patch caf87d9 made alot work again.

Perhaps the scenario triggered the crash is not simply "list of more than a screenful" but involves other factors as well.

@elenril
Copy link
Contributor

elenril commented Nov 22, 2021 via email

@pazz
Copy link
Owner

pazz commented Nov 23, 2021 via email

@elenril
Copy link
Contributor

elenril commented Nov 23, 2021 via email

@pazz
Copy link
Owner

pazz commented Nov 23, 2021 via email

@elenril
Copy link
Contributor

elenril commented Nov 23, 2021 via email

@pazz
Copy link
Owner

pazz commented Nov 23, 2021 via email

@lucc
Copy link
Collaborator

lucc commented Sep 21, 2024

Does anybody still hit this bug or now any updates? We have three different indicators/ideas how to fix this collected in this thread:

  1. update your database to glass (Alot aborted after tagging three times. #1460 (comment))
  2. switch the python bindings from notmuch to notmuch2 which alot did in 0.9.1 notmuch2 #1547 (and notmuch2 copatibility #1673)
  3. preload threads into a list instead of using the generator, proposed by @jonassmedegaard and merge into alot in caf87d9

In #1647 it was suggested to revert caf87d9 as it slows down the buffer creation for search queries with many results. So the question is also: can anyone still reproduce this without caf87d9?

@elenril you proposed to send a patch upstream, how did that go?

@elenril
Copy link
Contributor

elenril commented Sep 29, 2024 via email

@lfos
Copy link
Contributor

lfos commented Nov 7, 2024

@elenril - Out of curiosity, do you have anything you can share that helps us understand what the issue is, and what'd be required to fix it (e.g., a mailing list thread)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants