-
Notifications
You must be signed in to change notification settings - Fork 8
Threading Algorithms
##message threading. #© 1997-2002 Jamie Zawinski jwz@jwz.org
#Introduction In this document, I describe what is, in my humble but correct opinion, the best known algorithm for threading messages (that is, grouping messages together in parent/child relationships based on which messages are replies to which others.) This is the threading algorithm that was used in Netscape Mail and News 2.0 and 3.0, and in Grendel.
Sadly, my C implementation of this algorithm is not available, because it was purged during the 4.0 rewrite, and Netscape refused to allow me to free the 3.0 source code.
However, my Java implementation is available in the Grendel source. You can find a descendant of that code on ftp.mozilla.org. Here's the original source release: grendel-1998-09-05.tar.gz; and a later version, ported to more modern Java APIs: grendel-1999-05-14.tar.gz. The threading code is in view/Threader.java. See also IThreadable and TestThreader. (The mailsum code in storage/MailSummaryFile.java and the MIME parser in the mime/ directory may also be of interest.)
This is not the algorithm that Netscape 4.x uses, because this is another area where the 4.0 team screwed the pooch, and instead of just continuing to use the existing working code, replaced it with something that was bloated, slow, buggy, and incorrect. But hey, at least it was in C++ and used databases!
This algorithm is also described in the imapext-thread Internet Draft: Mark Crispin and Kenneth Murchison formalized my description of this algorithm, and propose it as the THREAD extension to the IMAP protocol (the idea being that the IMAP server could give you back a list of messages in a pre-threaded state, so that it wouldn't need to be done on the client side.) If you find my description of this algorithm confusing, perhaps their restating of it will be more to your taste.
I'm told this algorithm is also used in the Evolution and Balsa mail readers. Also, Simon Cozens and Richard Clamp have written a Perl version; Frederik Dietz has written a Ruby version; and Max Ogden has written a JavaScript version. (I've not tested any of these implementations, so I make no claims as to how faithfully they implement it.)
First some background on the headers involved.
#In-Reply-To: The In-Reply-To header was originally defined by RFC 822, the 1982 standard for mail messages. In 2001, its definition was tightened up by RFC 2822.
RFC 822 defined the In-Reply-To header as, basically, a free-text header. The syntax of it allowed it to contain basically any text at all. The following is, literally, a legal RFC 822 In-Reply-To header:
In-Reply-To: thirty-five ham and cheese sandwiches
So you're not guaranteed to be able to parse anything useful out of In-Reply-To if it exists, and even if it contains something that looks like a Message-ID, it might not be (especially since Message-IDs and email addresses have identical syntax.)
However, most of the time, In-Reply-To headers do have something useful in them. Back in 1997, I grepped over a huge number of messages and collected some damned lies, I mean, statistics, on what kind of In-Reply-To headers they contained. The results:
In a survey of 22,950 mail messages with In-Reply-To headers:
18,396 had at least one occurrence of <>-bracketed text.
4,554 had no <>-bracketed text at all (just names and dates.) 714 contained one <>-bracketed addr-spec and no message IDs. 4 contained multiple message IDs. 1 contained one message ID and one <>-bracketed addr-spec.
The most common forms of In-Reply-To seemed to be:
31% NAME's message of TIME ID@HOST 22% ID@HOST 9% ID@HOST from NAME at "TIME" 8% USER's message of TIME ID@HOST 7% USER's message of TIME 6% Your message of "TIME" 17% hundreds of other variants (average 0.4% each?) Of course these numbers are very much dependent on the sample set, which, in this case, was probably skewed toward Unix users, and/or toward people who had been on the net for quite some time (due to the age of the archives I checked.)
However, this seems to indicate that it's not unreasonable to assume that, if there is an In-Reply-To field, then the first <>-bracketed text found therein is the Message-ID of the parent message. It is safe to assume this, that is, so long as you still exhibit reasonable behavior when that assumption turns out to be wrong, which will happen a small-but-not-insignificant portion of the time.
RFC 2822, the successor to RFC 822, updated the definition of In-Reply-To: by the more modern standard, In-Reply-To may contain only message IDs. There will usually be only one, but there could be more than one: these are the IDs of the messages to which this one is a direct reply (the idea being that you might be sending one message in reply to several others.)
#References: The References header was defined by RFC 822 in 1982. It was defined in, effectively, the same way as the In-Reply-To header was defined: which is to say, its definition was pretty useless. (Like In-Reply-To, its definition was also tightened up in 2001 by RFC 2822.)
However, the References header was also defined in 1987 by RFC 1036 (section 2.2.5), the standard for USENET news messages. That definition was much tighter and more useful than the RFC 822 definition: it asserts that this header contain a list of Message-IDs listing the parent, grandparent, great-grandparent, and so on, of this message, oldest first. That is, the direct parent of this message will be the last element of the References header.
It is not guaranteed to contain the entire tree back to the root-most message in the thread: news readers are allowed to truncate it at their discretion, and the manner in which they truncate it (from the front, from the back, or from the middle) is not defined.
Therefore, while there is useful info in the References header, it is not uncommon for multiple messages in the same thread to have seemingly-contradictory References data, so threading code must make an effort to do the right thing in the face of conflicting data.
RFC 2822 updated the mail standard to have the same semantics of References as the news standard, RFC 1036.
In practice, if you ever see a References header in a mail message, it will follow the RFC 1036 (and RFC 2822) definition rather than the RFC 822 definition. Because the References header both contains more information and is easier to parse, many modern mail user agents generate and use the References header in mail instead of (or in addition to) In-Reply-To, and use the USENET semantics when they do so.
You will generally not see In-Reply-To in a news message, but it can occasionally happen, usually as a result of mail/news gateways.
So, any sensible threading software will have the ability to take both In-Reply-To and References headers into account.
Note: RFC 2822 (section 3.6.4) says that a References field should contain the contents of the parent message's References field, followed by the contents of the parent's Message-ID field (in other words, the References field should contain the path through the thread.) However, I've been informed that recent versions of Eudora violate this standard: they put the parent Message-ID in the In-Reply-To header, but do not duplicate it in the References header: instead, the References header contains the grandparent, great-grand-parent, etc.
This implies that to properly reconstruct the thread of a message in the face of this nonstandard behavior, we need to append any In-Reply-To message IDs to References.
##The Algorithm
This algorithm consists of five main steps, and each of those steps is somewhat complicated. However, once you've wrapped your brain around it, it's not really that complicated, considering what it does.
In defense of its complexity, I can say this:
This algorithm is incredibly robust in the face of garbage input, and even in the face of malicious input (you cannot construct a set of inputs that will send this algorithm into a loop, for example.) This algorithm has been field-tested by something on the order of ten million users over the course of six years. It really does work incredibly well. I've never seen it produce results that were anything less than totally reasonable. Well, enough with the disclaimers.
##Definitions:
A Container object is composed of:
- Message message; // (may be null)
- Container parent;
- Container child; // first child
- Container next; // next element in sibling list, or null
- A Message object only has a few fields we are interested in:
- String subject;
- ID message_id; // the ID of this message
- ID *references; // list of IDs of parent messages
The References field is populated from the
References
and/orIn-Reply-To
headers. If both headers exist, take the first thing in the In-Reply-To header that looks like a Message-ID, and append it to the References header.
If there are multiple things in In-Reply-To that look like Message-IDs, only use the first one of them: odds are that the later ones are actually email addresses, not IDs.
These ID objects can be strings, or they can be any other token on which you can do meaningful equality comparisons.
Only two things need to be done with the subject strings: ask whether they begin with Re:
, and compare the non-Re parts for equivalence. So you can get away with interning or otherwise hashing these, too. (This is a very good idea: my code does this so that I can use == instead of strcmp()
inside the loop.)
The ID objects also don't need to be strings, for the same reason. They can be hashes or numeric indexes or anything for which equality comparisons hold, so it's way faster if you can do pointer-equivalence comparisons instead of strcmp()
.
The reason the Container and Message objects are separate is because the Container fields are only needed during the act of threading: you don't need to keep those around, so there's no point in bulking up every Message structure with them.
The id_table is a hash table associating Message-IDs with Containers. An "empty container" is one that doesn't have a message in it, but which shows evidence of having existed. For whatever reason, we don't have that message in our list (maybe it is expired or canceled, maybe it was deleted from the folder, or any of several other reasons.) At presentation-time, these will show up as unselectable "parent" containers, for example, if we have the thread
-- A
|-- B
\-- C
-- D
and we know about messages B and C, but their common parent A does not exist, there will be a placeholder for A, to group them together, and prevent D from seeming to be a sibling of B and C. These "dummy" messages only ever occur at depth 0.
##The Algorithm:
#For each message:
-
If id_table contains an empty Container for this ID:
- Store this message in the Container's message slot.
-
Else:
- Create a new Container object holding this message;
- Index the Container by Message-ID in id_table.
-
For each element in the message's References field:
- Find a Container object for the given Message-ID:
- If there's one in id_table use that;
- Otherwise, make (and index) one with a null Message.
- Link the References field's Containers together in the order implied by the References header.
- If they are already linked, don't change the existing links.
- Do not add a link if adding that link would introduce a loop: that is, before asserting A->B, search down the children of B to see if A is reachable, and also search down the children of A to see if B is reachable. If either is already reachable as a child of the other, don't add the link.
- Find a Container object for the given Message-ID:
-
Set the parent of this message to be the last element in References. Note that this message may have a parent already: this can happen because we saw this ID in a References field, and presumed a parent based on the other entries in that field. Now that we have the actual message, we can be more definitive, so throw away the old parent and use this new one. Find this Container in the parent's children list, and unlink it.
-
Note that this could cause this message to now have no parent, if it has no references field, but some message referred to it as the non-first element of its references. (Which would have been some kind of lie...)
-
Note that at all times, the various "parent" and "child" fields must be kept inter-consistent.
Just say no to databases!