Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a Public Inbox Plugin #329

Merged
merged 2 commits into from
Apr 29, 2024
Merged

Create a Public Inbox Plugin #329

merged 2 commits into from
Apr 29, 2024

Conversation

mripard
Copy link
Contributor

@mripard mripard commented Sep 26, 2023

Public Inbox is a mailing-list archive project notably used by the Linux Foundation to host the lore.kernel.org ML archive.

This plugins allows to fetch from those archives the threads that the user started or was involved in and integrate them into the report.

This is a draft PR because there's a couple of things I'm unsure of and would like some feedback on:

  • It's fairly slow at the moment. This is in part due to the fact that for each mail from the user, we fetch the entire thread, and multiple times if the user participated multiple times. I think a local cache could help with that. Is there any policy wrt caching? If so, how should it be done?
  • Other plugins seem to create unit tests. However, Given that this plugin depends fairly heavily on the archive itself, we would need to mock it somehow, which is pretty involved. Is the testing mandatory? If so, do we need a mock, or is there some other preferred strategy?
  • I've deviated a bit from the output the other plugins follow to show the mail thread and a link to it. I'd assume the preferred way to do that now would be to use the markdown formatting? If it's disabled, should we remove the link entirely, or something else?

Thanks!

@sandrobonazzola
Copy link
Contributor

Public Inbox is a mailing-list archive project notably used by the Linux Foundation to host the lore.kernel.org ML archive.

That's nice

This plugins allows to fetch from those archives the threads that the user started or was involved in and integrate them into the report.

This is a draft PR because there's a couple of things I'm unsure of and would like some feedback on:

  • It's fairly slow at the moment. This is in part due to the fact that for each mail from the user, we fetch the entire thread, and multiple times if the user participated multiple times. I think a local cache could help with that. Is there any policy wrt caching? If so, how should it be done?
  • Other plugins seem to create unit tests. However, Given that this plugin depends fairly heavily on the archive itself, we would need to mock it somehow, which is pretty involved. Is the testing mandatory? If so, do we need a mock, or is there some other preferred strategy?

I think you can create a test fetching your own emails on a single day on the kernel mailing list; shouldn't overload the mail archive server and should be pretty quick if you choose a day with a single email and no long threads on it.

  • I've deviated a bit from the output the other plugins follow to show the mail thread and a link to it. I'd assume the preferred way to do that now would be to use the markdown formatting? If it's disabled, should we remove the link entirely, or something else?

Makes sense to me using the markdown format for adding links.

@psss
Copy link
Owner

psss commented Nov 10, 2023

Public Inbox is a mailing-list archive project notably used by the Linux Foundation to host the lore.kernel.org ML archive.

That's nice

Yeah, thanks for working on this improvement!

  • It's fairly slow at the moment. This is in part due to the fact that for each mail from the user, we fetch the entire thread, and multiple times if the user participated multiple times. I think a local cache could help with that. Is there any policy wrt caching? If so, how should it be done?

So far it seems no did plugin uses caching. I'd say we could start simple: Either create the first proof-of-concept without caching at all or implement simple cache in memory (just store and index all threads which have been already fetched before).

  • Other plugins seem to create unit tests. However, Given that this plugin depends fairly heavily on the archive itself, we would need to mock it somehow, which is pretty involved. Is the testing mandatory? If so, do we need a mock, or is there some other preferred strategy?

I think you can create a test fetching your own emails on a single day on the kernel mailing list; shouldn't overload the mail archive server and should be pretty quick if you choose a day with a single email and no long threads on it.

Agreed: Choosing a narrow since/until interval on a public public-inbox instance should be completely sufficient I'd say.

  • I've deviated a bit from the output the other plugins follow to show the mail thread and a link to it. I'd assume the preferred way to do that now would be to use the markdown formatting? If it's disabled, should we remove the link entirely, or something else?

Makes sense to me using the markdown format for adding links.

Sounds good to me to provide the rich output including links in the markdown format and use something concise (just the subject?) for the text output.

@mripard mripard marked this pull request as ready for review December 13, 2023 14:09
@mripard
Copy link
Contributor Author

mripard commented Dec 13, 2023

I've just updated the branch taking your comments into account:

  • I've added caching, reducing the runtime on my machine, with last week data, from 2 minutes to 17s. It's been mostly about reducing the network requests to a minimum, so hopefully we don't do them twice.
  • I've added markdown support too. I still like the old output format so I kept it, but let me know if it's an issue
  • I've added a couple of unit tests to make sure we get a somewhat sane list of messages.

Copy link
Owner

@psss psss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks much for the changes! Looks very good. I'm proposing just a couple of rather minor adjustments in 339d788. Could you please have a look if the changes are ok? Included is the --verbose mode for showing the full url of the message. I would like to keep this formatting (e.g. one-line per item by default) consistent across all plugins.


return msgs

def __fetch_thread_root(self, msg: Message) -> Message:

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns Note

Mixing implicit and explicit returns may indicate an error as implicit returns always return None.
did/plugins/public_inbox.py Outdated Show resolved Hide resolved
@psss psss added this to the 0.22 milestone Jan 2, 2024
@psss psss self-assigned this Jan 2, 2024
@mripard
Copy link
Contributor Author

mripard commented Jan 4, 2024

Thanks much for the changes! Looks very good. I'm proposing just a couple of rather minor adjustments in 339d788. Could you please have a look if the changes are ok? Included is the --verbose mode for showing the full url of the message. I would like to keep this formatting (e.g. one-line per item by default) consistent across all plugins.

It looks great to me, thanks!

@stefano-garzarella
Copy link

@mripard thanks for this PR. Really useful!

I used this plugin for a while without issues, but I just hit one today while running did --since 2024-01-08 --until 2024-01-14.

My configuration is:

[general]
email = "Stefano Garzarella" <sgarzare@redhat.com>
width = 160
plugins = ~/.did/plugins

[ml]
type = public_inbox
url = https://lore.kernel.org

...

For now I have public_inbox.py as an external plugin in ~/.did/plugins, but it is a link to the file contained in this PR.

This is the error:

$ ~/repos/did/bin/did --since 2024-01-08 --until 2024-01-14 > W02.md
Traceback (most recent call last):
  File "/home/stefano/.did/plugins/public_inbox.py", line 110, in __get_mbox_from_content
    content = gzip.decompress(content)
              ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/gzip.py", line 627, in decompress
    if _read_gzip_header(fp) is None:
       ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/gzip.py", line 456, in _read_gzip_header
    raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b'No')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/stefano/repos/did/bin/did", line 42, in <module>
    did.cli.main()
  File "/usr/lib/python3.12/site-packages/did/cli.py", line 238, in main
    user_stats.check()
  File "/usr/lib/python3.12/site-packages/did/stats.py", line 158, in check
    stat.check()
  File "/usr/lib/python3.12/site-packages/did/stats.py", line 158, in check
    stat.check()
  File "/usr/lib/python3.12/site-packages/did/stats.py", line 78, in check
    self.fetch()
  File "/home/stefano/.did/plugins/public_inbox.py", line 225, in fetch
    self.stats = [
                 ^
  File "/home/stefano/.did/plugins/public_inbox.py", line 190, in get_all_threads
    self.threads_cache[(since, until)] = self.__fetch_all_threads(since, until)
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefano/.did/plugins/public_inbox.py", line 185, in __fetch_all_threads
    mbox = self.__get_mbox_from_content(resp.content)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefano/.did/plugins/public_inbox.py", line 111, in __get_mbox_from_content
    except BadGzipFile:
           ^^^^^^^^^^^
NameError: name 'BadGzipFile' is not defined

As rapid fix, I applied the following patch, but I'm not sure if it is the right thing to do:

diff --git a/did/plugins/public_inbox.py b/did/plugins/public_inbox.py
index ec8e333..d3df993 100644
--- a/did/plugins/public_inbox.py
+++ b/did/plugins/public_inbox.py
@@ -106,7 +106,10 @@ class PublicInbox(object):
                 item(self._get_message_url(msg), level=2, options=opt)
 
     def __get_mbox_from_content(self, content: bytes) -> mailbox.mbox:
-        content = gzip.decompress(content)
+        try:
+            content = gzip.decompress(content)
+        except gzip.BadGzipFile:
+            return {}
 
         with tempfile.NamedTemporaryFile() as tmp:
             tmp.write(content)

@mripard
Copy link
Contributor Author

mripard commented Jan 29, 2024

@stefano-garzarella Hi, I'm glad it's helpful. I've pushed a new version that should fix it, plus a unit test to cover the failing case (we were ignoring the HTTP request status code and assuming the result was a GZIP archive).

I've also squashed the patch @psss did earlier.

@lukaszachy
Copy link
Collaborator

/packit test

did/plugins/public-inbox.py Fixed Show fixed Hide fixed
@mripard
Copy link
Contributor Author

mripard commented Feb 26, 2024

Hi, is there anything I can help with to get this merged?

@psss
Copy link
Owner

psss commented Mar 7, 2024

Sorry for late response, just too many things on my plate... I checked the latest version but I get the following error when running the test:

    def __get_thread_root(self, msg: Message) -> Message:
        log.debug("Looking for thread root of message %s" % msg.id())
        if msg.is_thread_root():
            log.debug("Message is thread root already. Returning.")
            return msg
    
        parent_id = msg.parent_id()
        if parent_id not in self.messages_cache:
            root = self.__fetch_thread_root(msg)
            log.debug("Found root message %s for message %s" % (root.id(), msg.id()))
            return root
    
        while True:
            log.debug("Parent is %s" % parent_id)
>           assert parent_id in self.messages_cache
E           AssertionError

../../did/plugins/public-inbox.py:158: AssertionError

Could you please have a look at it?

Also, I'm not sure it's a good idea to call the plugin file public-inbox.py as it cannot be directly imported because of the dash in the name. What about just substituting the underscore with dash when importing the plugin modules?

@mripard
Copy link
Contributor Author

mripard commented Mar 7, 2024

Sorry for late response, just too many things on my plate... I checked the latest version but I get the following error when running the test:

    def __get_thread_root(self, msg: Message) -> Message:
        log.debug("Looking for thread root of message %s" % msg.id())
        if msg.is_thread_root():
            log.debug("Message is thread root already. Returning.")
            return msg
    
        parent_id = msg.parent_id()
        if parent_id not in self.messages_cache:
            root = self.__fetch_thread_root(msg)
            log.debug("Found root message %s for message %s" % (root.id(), msg.id()))
            return root
    
        while True:
            log.debug("Parent is %s" % parent_id)
>           assert parent_id in self.messages_cache
E           AssertionError

../../did/plugins/public-inbox.py:158: AssertionError

Could you please have a look at it?

I think I fixed it already but wasn't sure whether I should push it or wait for it to be merged and then send it. I guess I have my answer :)

Also, I'm not sure it's a good idea to call the plugin file public-inbox.py as it cannot be directly imported because of the dash in the name. What about just substituting the underscore with dash when importing the plugin modules?

I don't have an opinion here, but also I'm not sure how to do that in Python. Would you have any pointers?

@psss
Copy link
Owner

psss commented Mar 7, 2024

I think I fixed it already but wasn't sure whether I should push it or wait for it to be merged and then send it. I guess I have my answer :)

:)

I don't have an opinion here, but also I'm not sure how to do that in Python. Would you have any pointers?

It would be just slightly modifying the way how plugins are imported. The relevant code should be around the load_components() function:

def load_components(*paths, **kwargs):

@mripard
Copy link
Contributor Author

mripard commented Mar 7, 2024

I just pushed a new version with public_inbox.py, and the bug you mentioned fixed with a unit test to cover the failing condition

@mripard
Copy link
Contributor Author

mripard commented Apr 23, 2024

Is there anything else I can do to help get this merged?

Copy link
Owner

@psss psss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix and sorry for the late response. Looks good. Added just a few last nitpicks.

did/plugins/public_inbox.py Outdated Show resolved Hide resolved
did/stats.py Outdated Show resolved Hide resolved
did/plugins/public_inbox.py Outdated Show resolved Hide resolved
tests/plugins/test_public_inbox.py Outdated Show resolved Hide resolved
@mripard
Copy link
Contributor Author

mripard commented Apr 29, 2024

Thanks for the suggestions, I've added your suggestions or reworked the code to address your concerns.

@mripard mripard requested a review from psss April 29, 2024 08:42
Some plugins like public-inbox might use dashes in their name to be more
user-friendly, but that will break python module loading system.

Let's replace all the dashes in the plugin names by underscores.

Signed-off-by: Maxime Ripard <mripard@kernel.org>
@psss
Copy link
Owner

psss commented Apr 29, 2024

Thanks for addressing the comments! Now looks good. I've just modified the first commit summary and removed the prefix, see the recommendations for some more context.

Public Inbox is a mailing-list archive project notably used by the Linux
Foundation to host the lore.kernel.org ML archive.

This plugins allows to fetch from those archives the threads that the
user started or was involved in and integrate them into the report.

Signed-off-by: Maxime Ripard <mripard@kernel.org>
@psss psss merged commit 4d57c64 into psss:main Apr 29, 2024
16 checks passed
@stefano-garzarella
Copy link

@mripard I just received this error today:

did --format markdown --since 2024-05-06 --until 2024-06-12
...
Traceback (most recent call last):
  File "/usr/bin/did", line 42, in <module>
    did.cli.main()
  File "/usr/lib/python3.12/site-packages/did/cli.py", line 239, in main
    user_stats.check()
  File "/usr/lib/python3.12/site-packages/did/stats.py", line 158, in check
    stat.check()
  File "/usr/lib/python3.12/site-packages/did/stats.py", line 158, in check
    stat.check()
  File "/usr/lib/python3.12/site-packages/did/stats.py", line 78, in check
    self.fetch()
  File "/usr/lib/python3.12/site-packages/did/plugins/public_inbox.py", line 231, in fetch
    self.stats = [
                 ^
  File "/usr/lib/python3.12/site-packages/did/plugins/public_inbox.py", line 207, in get_all_threads
    root = self.__get_thread_root(msg)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/did/plugins/public_inbox.py", line 153, in __get_thread_root
    log.debug("Found root message %s for message %s" % (root.id(), msg.id()))
                                                        ^^^^^^^
AttributeError: 'NoneType' object has no attribute 'id'

Any idea what is wrong?

@mripard
Copy link
Contributor Author

mripard commented Jun 4, 2024

Sorry for the late answer.

I don't have an immediate answer. Could you share your config (or at least the lore part) so I can reproduce it?

@stefano-garzarella
Copy link

Sorry for the late answer.

I don't have an immediate answer. Could you share your config (or at least the lore part) so I can reproduce it?

Sure, my config contains:

[general]
email = "Stefano Garzarella" <sgarzare@redhat.com>
width = 160

[ml]
type = public-inbox
url = https://lore.kernel.org
...

I had the issue running did --format markdown --since 2024-05-06 --until 2024-06-12.

For now, I'm using the following workaround:

diff --git a/did/plugins/public_inbox.py b/did/plugins/public_inbox.py
index 888f41c..70e1233 100644
--- a/did/plugins/public_inbox.py
+++ b/did/plugins/public_inbox.py
@@ -150,6 +150,9 @@ class PublicInbox(object):
         parent_id = msg.parent_id()
         if parent_id not in self.messages_cache:
             root = self.__fetch_thread_root(msg)
+            if root is None:
+                return None
+
             log.debug("Found root message %s for message %s" % (root.id(), msg.id()))
             return root
 
@@ -164,6 +167,9 @@ class PublicInbox(object):
             parent_id = parent.parent_id()
             if parent_id not in self.messages_cache:
                 root = self.__fetch_thread_root(msg)
+                if root is None:
+                    return None
+
                 log.debug(
                     "Found root message %s for message %s" %
                     (root.id(), msg.id()))
@@ -205,6 +211,9 @@ class PublicInbox(object):
 
             if not msg.is_thread_root():
                 root = self.__get_thread_root(msg)
+                if root is None:
+                    continue
+
                 root_id = root.id()
                 if root_id in found:
                     log.debug("Root message already encountered... Skip.")

@martinezjavier
Copy link

martinezjavier commented Jul 15, 2024

@stefano-garzarella @mripard it seems that issue fixed at mripard#1 happens again. I see that the latest code that landed does not check if root is None before attempting to get root.id().

@stefano-garzarella IMO your code is not a workaround but the proper fix since the problem is AFAIU when the thread is broke by some MUA or the list and there's no root message in the thread to reference anymore.

@stefano-garzarella
Copy link

@stefano-garzarella IMO your code is not a workaround but the proper fix since the problem is AFAIU when the thread is broke by some MUA or the list and there's no root message in the thread to reference anymore.

@martinezjavier I'm not sure if continue is the best thing to do in get_all_threads() if we can't find root.

@mripard
Copy link
Contributor Author

mripard commented Jul 15, 2024

I've spent some more time looking into it. The offending mail you first reported this issue for is: https://lore.kernel.org/all/AS2P194MB2170A39251AC72302E8F6F459ACB2@AS2P194MB2170.EURP194.PROD.OUTLOOK.COM/ and it looks like the culprit is due to root message not being archived by lore, and thus while the message has a correct in-reply-to chain, lore doesn't give us the root message.

I've started to work on a proper fix (that includes fixing the type hints too, since a significant part were wrong / poorly handled), and indeed, I think that yielding the current message directly when we don't have a root makes it a bit nicer than just ignoring it.

@stefano-garzarella
Copy link

I've started to work on a proper fix (that includes fixing the type hints too, since a significant part were wrong / poorly handled), and indeed, I think that yielding the current message directly when we don't have a root makes it a bit nicer than just ignoring it.

yeah, that would be much better! Thanks for working on a proper fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants