Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getDateTime() giving wrong date for invalid H (hour) #119

Closed
ThomasLandauer opened this issue Aug 3, 2020 · 23 comments
Closed

getDateTime() giving wrong date for invalid H (hour) #119

ThomasLandauer opened this issue Aug 3, 2020 · 23 comments
Labels

Comments

@ThomasLandauer
Copy link
Contributor

If the email contains this header (notice the invalid hour)

Date: Mon, 17 Sep 2018 28:59:30 +0600

then

$message->getHeader('Date')->getDateTime();

gives "2018-09-24 04:59:30.0 +06:00" (notice that the day "jumped" from 17 to 24).

I just observed this behavior in the wild (on spam emails, I have to admit :-)

The reason is that your DatePart::parseDateToken() uses PHP's \DateTime::createFromFormat – for which I've just filed a PHP bug: https://bugs.php.net/bug.php?id=79926

Just wanted to let you know. Maybe you want to add your own validity check for the hour? IMO it's OK to return null in this case (since it's invalid), or return the next day (which seems to be the expected behavior of PHP). But returning the next week just doesn't make any sense ;-)

@zbateson
Copy link
Owner

zbateson commented Aug 3, 2020

Hi @ThomasLandauer --

That's very interesting, and their response is interesting too -- I would've expected the overflow to be like you expected but I guess it's just down to their parsing.

I think for my part though my take has been to focus on parsing an email rather than validating an email, and that goes for headers too. I've taken the approach of trying to silently ignore problems and give a result, and do my best to support bad implementations (within reason, I can't support every bad implementation some random person wrote for one website for example)... but even supporting an erroneous implementation, I've taken as being preferable to giving nothing or erroring out.

To me this approach makes sense: if I'm using a library for something like email parsing, I rarely care if the email is invalid... if it is I'd rather see what the library can show me rather than tell me nothing. I can't say this is the definite correct way to go about this, or that it makes sense for everyone. This does mean two things though:

  1. I don't really do any error handling, even for extreme cases... and maybe that's wrong
  2. if you are interested in the validity of something you have to do the checking

Also for date headers which is relying on PHP's date parsing for example, if you want the DateTime object I'm not really doing much beyond just parsing the header and passing it to php's parser. This seems to work for the most part, over the years #116 is probably the first report of a potential issue where parsing a date failed -- I wasn't able to find it happening myself, and hadn't heard of it happening otherwise... so I'm not sure how widespread that is, and if it warrants code to handle it.

I'd be interested in hearing your perspective on this approach if you feel differently, for sure 👍

One thing I definitely seem to have failed as is creating a community around this... for what it's worth I'm happy to take other people's comments/work/suggestions for where this goes. I know that's not enough to make a community magically pop up, but finding the time to work on "community" always seemed secondary to me to just working on it, and I haven't had much time for a while to really do much on it in general.

@ThomasLandauer
Copy link
Contributor Author

Since the main (only?) use case of this library is to parse incoming emails, I agree 100% with your approach! Telling me all the errors somebody else has made, doesn't help anybody ;-)
I would even suggest that you include the essence of your above post right upfront in README.md!

Back to the getDateTime() issue:

Answering your question from the other issue:

What system is producing these date formats? Is it just an erroneous system one individual built, or something produced by a commonly-used mail client or software library?

In this case, it was just a poorly programmed spam bot ;-)

But I think your question is incomplete - you should also ask:

How difficult is it to "fix" it?

Why? Looking at Murphy's law, it's quite probable that sooner or later even some "commonly-used mail client" will produce the same errors as the "erroneous system". So fixing today's erroneous system's bugs is the best "preparation" for dealing with tomorrow's commonly-used system's bugs ;-)

So in this case I'd say: It's easy, so let's fix it!

For my (internal) app to investigate this, I used this regexp:

preg_match('/^(\w\w\w,\s+)?\d\d?\s+\w\w\w\s+\d\d\d\d\s+(\d\d)\:/', $date, $matches)
if (((int) $matches[2]) >= 24)

@zbateson
Copy link
Owner

zbateson commented Aug 4, 2020

Excellent, that's encouraging to hear 👍

I would even suggest that you include the essence of your above post right upfront in README.md!

It's what I meant by forgiving in goal number 2 in the readme, namely:

Standards-compliant but forgiving

I could explain what I mean more by those goals somewhere I suppose, hehe.

To be honest I humbly disagree with adding code to support every erroneous implementation, because:

  • it discourages the bad implementation from fixing itself
  • while it's true that anything could go wrong, we can't program for every possible thing that can, it's just impossible, and it's not a goal of this project
  • trying to program for every bad implementation really muddies your code. Imagine I added your fix, then something in case someone added a bad year, plus another line for one person implementation that had good seconds, except sometimes instead of going to the next minute it used '6a', etc... etc... etc... it would get convoluted pretty quickly, and I can 'imagine' any number of bad possibilities, those are infinite.

In addition, each additional regex requires processing as well, although on their own each one may not add much, but given an infinite possibility of issues, we'd eventually end up with a huge number of regexes handling errors only caused by one or two very uncommon systems, slowing down parsing for the 99.999999% of emails that don't have that problem.

I.e., does it make sense to add that regex check to every single email, when so many will not have that problem?

I think Murphy's law in IT should be considered more in terms of your coding style -- if something can go wrong with what you've coded, it eventually will. For instance if I had a condition that for some reason failed to parse emails at 1:15am if your server is set to the GMT+6 timezone (I'm pretty bad at thinking of realistic examples I guess 😝 ), then yes, given that the code you made becomes more widely used and distributed, that may happen.

Therefore, given that, my approach is:

  1. can I identify that the issue is widespread, or used by a large implementation? If so, fix
  2. If not, can I identify if other parsers (e.g. Thunderbird usually since that's what I use as my mail client) handle that situation? If so, fix
  3. If not, leave it out, revisit if 1 or 2 changes

Note that I wouldn't always take these as gospel either, for instance in this case for the dates, even if another parser handles it differently/better, I don't think I'd code that -- php has it's own DateTime parser, and I'd only be interested in number 1, because I'm not interested in parsing dates.

Does that make sense to you too, and if not, what do you propose as an approach to address that, while also addressing the concerns I have? 😁

@ThomasLandauer
Copy link
Contributor Author

it discourages the bad implementation from fixing itself

I see what you mean: feedback leads to improvement. But for email, this is an illusion! Look at the communication flow:
You (=library maintainer) <--> library users (=programmers) <--> Their end-users <--> Their email partners <--> Their software vendors
That's the very reason why all browsers are error tolerant, and don't even enforce the most basic HTML rules...

And see it from another perspective: I you were looking for an email parsing library, how would you proceed? You'd probably take a sample of (real?) emails and run tests on it (that's what I'm doing right now, BTW). But on which emails would you decide which parser is the "best"? On those 99.9% "perfect" emails any parser can deal with? Or on those 0.1% edge cases where all others are failing?

trying to program for every bad implementation really muddies your code.

Yeah, that's certainly a point! 2 lines for 99.9%, then 50 more lines for the remaining 0.1% ;-) But what about separating the code? There could be a "core" parser (which you maintain). And in all cases where your parser gives up (i.e. in this case something like if ( ! $outcome instanceof \DateTime)), pass it to some other class which is an endless series of elseifs where basically anybody from the community can drop in another check. No performance impact and no cluttered code.
I even had an idea for your other point:

I don't really do any error handling, even for extreme cases... and maybe that's wrong

If you'd set up a "framework" for it, those community classes could do something like $this->addParsingError('Invalid hour'), and the user could retrieve all those errors with something like $message->getParserErrors(). This would at least be an attempt to give some feedback (see above) and improve implementations in the long run...

then something in case someone added a bad year, plus another line for one person implementation that had good seconds, except sometimes instead of going to the next minute it used '6a', etc... etc... etc...

Yeah, but the Date header is certainly the most extreme case! For a Subject, there's just not so many things that can go wrong.

@ThomasLandauer
Copy link
Contributor Author

Just another example (spambot too):

Date: Tue, 07 Jan 2020 -3:43:40 -0700

@ThomasLandauer
Copy link
Contributor Author

One more point: Once you do have such "community classes", and there's another error popping up: It's less work to just add another elseif (i.e. just "fix" it for good), than trying to get answers to those really hard questions: Which software is it? How many users does it have? Will they fix it soon?

@zbateson
Copy link
Owner

zbateson commented Aug 4, 2020

haha, sorry, but I remain unconvinced.

I see what you mean: feedback leads to improvement. But for email, this is an illusion! Look at the communication flow:

That's true, there's no direct communication flow... but say you were testing your new outgoing email system which had a silly error in it, you sent it to mail-mime-parser for testing, and it worked, you're seeing the result you expect even though there's a problem, and you don't bother fixing it. You then send out to everyone based on that, but Thunderbird doesn't display your email, and you're baffled -- you tested it in mail-mime-parser.

My library has a much, much, much smaller user-base than say Thunderbird, which is why I have a mix of judgment based on how widespread I perceive a bad implementation to be, and what they decided (if anything) to do about it.

For my library adding these cases:

  1. Eventually breaks my first and primary goal (well-written)
  2. Isn't in itself a goal of my project (to support every bad implementation -- I have no desire to add support for every spambot writing a bad email)

I also disagree that the email you found is 1 in 100 emails... if I had to guess it would probably be in the 1 in 1,000,000,000 or less -- consider there's an estimated 293.6 billion emails going in a single day, spam is 107 billion, and the vast, vast majority of them won't have this specific problem. Not only that, but it may be 1 in 1,000,000,000 emails sent out today for example assuming it's an issue happening right now from some spambot somewhere, but when the maker of that spambot discovers their emails aren't showing up in Outlook/Thunderbird/whatever correctly because of their weird date header, or are easily identified as spam because of it, they'll fix it... I'll still have the code checking though indefinitely, even though it's potentially 0 of 293.6 billion emails going out on a daily basis.

There could be a "core" parser (which you maintain). And in all cases where your parser gives up (i.e. in this case something like if ( ! $outcome instanceof \DateTime)),

That also won't always work, as you've observed with this issue, the $outcome is in fact an instance of DateTime -- I'd still have to add (in this case) date validation code to check if the date is valid according to some standard first, which I'm not sure I care to do (validation is different from parsing in my view), and then on failure, check if we handle a handful of bad cases.

That means the onus of handling this bad case for this one email falls on you as a unique user who wants this very specialized and rare handling, and so you can add that check in your code, where I feel it belongs.

if (preg_match('blah', $themessage->getHeader('date')->getRawValue()) {
   // handle it differently
}

If you'd set up a "framework" for it, those community classes could do something like

I'd not be against this if someone were to write and submit a wonderful PR for it :) it would have to maintain my first goal of being well written (and designed) for inclusion, but apart from that could be done, and would welcome discussion/ideas on how to approach it nicely.

@ThomasLandauer
Copy link
Contributor Author

On this specific Date issue:

That also won't always work, as you've observed with this issue, the $outcome is in fact an instance of DateTime

Oops, you're right - I forgot that in the heat of the moment ;-) But something else just occurred to me: Why do you use ::createFromFormat() at all?? Why don't you just drop the $dateToken into the \DateTime() constructor, as you're doing at https://github.com/zbateson/mail-mime-parser/blob/master/src/Header/Part/DatePart.php#L52 as a last resort anyway? This worked for all my test mails (PHP 7.2), and fails nicely on malformed dates:

DateTime::__construct(): Failed to parse time string (Tue, 07 Jan 2020 -3:43:40 -0700) at position 22 (:): Unexpected character

@zbateson
Copy link
Owner

zbateson commented Aug 4, 2020

Oof, sorry my memory's not that good -- it's possible if you take those out and run the tests you will (hopefully) find some failing test-cases, or otherwise the git log would have to be dug through. I see for instance in comments that one reason is documented there:

// First check as RFC822 which allows only 2-digit years

I think very old versions of outlook used to send emails with 2-digits (again, bad memory so I could be very wrong about the sender, haha).

The final 'attempt' with the constructor is just a catch-all, I'm not sure I had specific cases for it -- vague memory of the opposite not being true though (that I did it with createFromFormat to catch something specific).

@ThomasLandauer
Copy link
Contributor Author

I've never done this before, so I'll rather ask you:
I cloned the project, then did composer install, then:

./vendor/bin/phpunit tests --configuration tests/phpunit.xml

Is this really the easiest way?

Now for the good news:
With this being my parseDateToken():

try {
    $date = new DateTime($dateToken);
} catch (Exception $e) {
    return false;
}
return $date;

there's just one failure:

--- Expected
+++ Actual
@@ @@
-'2014-03-13T15:02:47+0000'
+'0000-03-16T15:02:47+0000'

So DateTime() doesn't fail but sets the year to 0000. If I fix this by deleting $date === false && at https://github.com/zbateson/mail-mime-parser/blob/master/src/Header/Part/DatePart.php#L47, all tests are passing!

PR?

@zbateson
Copy link
Owner

zbateson commented Aug 4, 2020

Is this really the easiest way?

Yup, you got it :) you can shorten --configuration to just -c for convenience if you like, hehe.

Unfortunately that probably just means either I didn't add relevant tests or it could be failing in a specific PHP version... sure a PR would help (which will get travis to test in all versions also). If you have time looking at the commits that would be very helpful, otherwise I shall in due course.

@zbateson
Copy link
Owner

zbateson commented Aug 6, 2020

Hi @ThomasLandauer -- I'm not seeing what you see.

On PHP7.4, removing all calls to parseDateToken gives me 6 errors (unexpected nulls) and 6 failures.

From what I can tell, RFC(2)822 aren't specifically supported by the DateTime constructor: https://www.php.net/manual/en/datetime.formats.compound.php

@zbateson
Copy link
Owner

zbateson commented Aug 6, 2020

As far as I can tell I have tests covering every line in those functions, commenting any line out results in an error or failure running tests.

ThomasLandauer added a commit to ThomasLandauer/MailMimeParser that referenced this issue Aug 6, 2020
See zbateson#119 (comment)

Quick and dirty, just to see if the tests pass ;-)
@ThomasLandauer
Copy link
Contributor Author

Looks like all tests are passing: #129 (the PR is just a quick proof of concept, no final solution)

For the PHP docs I just filed a bug: https://bugs.php.net/bug.php?id=79939

@zbateson
Copy link
Owner

zbateson commented Aug 6, 2020

Hmmmm, nicely done. Thanks for persisting, and also for the bug report... I guess my bad testing produced bad results, haha.

It may also just be a misunderstanding, perhaps the 'compound formats' page is meant to imply that the 'date' formats, the 'time' formats, and also 'these additional formats' are supported? I dunno -- will be interesting to see what they say.

Having said that, it doesn't look to me like it's even a supported format here though: https://www.php.net/manual/en/datetime.formats.date.php quick glance and I don't see a date starting with a day of week.

I wonder if it's just the locale that's handling it and maybe it won't work for everyone. Anyway, let's see what they say :)

@ThomasLandauer
Copy link
Contributor Author

Answer at the PHP bug:

Because those formats don't need any special handling. They're just regular date strings. Notice how the other RFCs aren't mentioned either?

...which probably means something like: "It's so natural that we support those RFCs that there's no need to mention it somewhere" ;-)

So what's next? Do you want to keep parseDateToken() at all? (So I'd just remove the old, now obsolete, code). Or do you want to remove the function completely?

@zbateson
Copy link
Owner

zbateson commented Aug 6, 2020

Hmmm, that is a weird answer though in a way, you'd think they'd document it... at least one line "RFC date formats are supported" or something.

I'm also wondering if I originally did that for hhvm support. Anyway, based on that I'm happy to drop parseDateToken completely (remove the function, there's no need for it, and keep the existing exceptions... I'm okay with your regex to check if the year is only 2 characters, maybe use createFromFormat for that one case with the RFC822 format?)

@ThomasLandauer
Copy link
Contributor Author

I'm okay with your regex to check if the year is only 2 characters, maybe use createFromFormat for that one case with the RFC822 format?

Which regex? This one? https://github.com/zbateson/mail-mime-parser/pull/129/files#diff-a82427206217fd4cf35fb32cd2a2a051R47
That's your regex, I just deleted the false check, since DateTime() doesn't return false.

Special check for 2-digit year is not necessary - that works out of the box.

@zbateson
Copy link
Owner

zbateson commented Aug 6, 2020

Aah, right that's for the timezone, haha. DateTime() ftw then!

@ThomasLandauer
Copy link
Contributor Author

I'd like to try to finish this, but please help me with Git(Hub)! I forked the repository and then cloned it to my local machine. But I need patch-4 https://github.com/ThomasLandauer/MailMimeParser/tree/patch-4, and git branch just gives me master. Also, my patch-4 is "2 commits behind zbateson:master" - what's the easiest way to update it?

@zbateson
Copy link
Owner

zbateson commented Aug 6, 2020

Haha, for sure --

You should be good to go if you git fetch to update available branches, etc... (a git pull does that too iirc). You should then be able to git checkout patch-4.

You shouldn't need to worry about your commits being off, you should still be able to create a pull request from your patch-4 branch... so long as the files you've changed and the files changed here don't conflict, that should be an easy merge. If they do conflict it means I'd have to see what the differences are first and do a manual merge, but I wouldn't worry about that, that's normal process :)

@zbateson
Copy link
Owner

zbateson commented Aug 6, 2020

Generally I find it easier to keep my changes in separate branches on a fork, that way I can update my 'master' more easily and create new branches off it... you can then add the original repo as a secondary 'remote'... at least this is my favourite way of doing it (there may be better ways, hehe).

git remote add upstream git@github.com:zbateson/mail-mime-parser.git

Then if you want to update it with whatever changes happened 'upstream', you can (on your master branch):

git fetch upstream
git pull upstream master

This will work so long as your local master branch hasn't 'diverged' with your own commits.

Edit: as-in will work without any additional 'merging'... which is fine if it happens, you just might have to resolve conflicts if there are any.

@ThomasLandauer
Copy link
Contributor Author

Thanks - I guess it worked - see #129
The part I was missing was git branch -a and then checkout a previously "invisible" branch :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants