Get decoded/unfolded semantically "full" header #122

ThomasLandauer · 2020-08-04T13:49:51Z

I'm done with Date:, From: is next :-)

I have this:

From: =?ISO-8859-1?Q?f=F6=F6b=E4r?=
 <company@example.com>

And I want to get this:

fööbär <company@example.com>

Currently, there's no easy way to get it, right?

echo $message->getHeader('from'); gives:
```
From: =?ISO-8859-1?Q?f=F6=F6b=E4r?=
 <company@example.com>
```
See AbstractHeader::__toString()

$message->getHeader('from')->getRawValue() gives:

=?ISO-8859-1?Q?f=F6=F6b=E4r?=
 <company@example.com>

$message->getHeaderValue('from') gives:
```
company@example.com
```
So all that's left is to use something like ->getAddresses()[0]->getName() and ->getAddresses()[0]->getEmail() and reassemble them manually, right?

php-mime-mail-parser/php-mime-mail-parser has a getHeader() which does exactly what I mean.

So I'm suggesting that besides giving the raw header, or the deconstructed parts, you should offer an "intermediate" function that gives the semantically full header (i.e. full header decoded and unfolded).

The text was updated successfully, but these errors were encountered:

zbateson · 2020-08-04T18:23:44Z

Yes, I think your examples are right -- there's also a convenience method in AddressHeader to get the email/name of the first header, so you could $message->getHeader('from')->getPersonName() or ->getEmail() directly if memory serves me right 😝 but that's besides the point you're making hehe.

Again, not entirely against it -- so would it display kind of like an email client shows email addresses? "Blah user@email, Second user@secondemail"?

Perhaps AddressHeader::getDisplayValue or something?

ThomasLandauer · 2020-08-05T13:41:35Z

Perhaps AddressHeader::getDisplayValue or something?

No, sorry, I didn't point that out clear enough: This is a general header issue, not just about From!

So if the email contains:

Subject: =?ISO-8859-1?Q?f=F6=F6b=E4r?=

There should be three functions to get:

Subject: =?ISO-8859-1?Q?f=F6=F6b=E4r?=
=?ISO-8859-1?Q?f=F6=F6b=E4r?=
fööbär

For the naming, I'm proposing a new system:

All functions should follow the pattern ->getHeader()->getSomething()

->getHeader()->getRawLine() for 1. The word "line" indicates that you'll get the full line, including Subject: . This would replace all occurrences of just ->getHeader() in README.md (cause Header is an object, and relying on __toString() is always "risky"). This would elegantly solve Remove header name from ->getHeader() #123 :-)
->getHeader()->getRawContent() for 2. Same as ->getRawValue(), but I'd rename it for consistency - see below.
->getHeader()->getContent() for 3.

I would drop (i.e. deprecate) ->getHeaderValue() cause it's inconsistent with this naming system. Besides, its current behavior is inconsistent between Subject and From headers.
The current ->getValue() is IMO dangerous/misleading, since for multiple addresses it returns just the first. So this should be dropped completely. ->getHeader()->getContent() should return the entire decoded content (as string). And for each single address (or other part), keep ->getParts()

ThomasLandauer · 2020-08-05T15:34:57Z

Just noticed that RFC 5322 uses the term "body" for what I called "content":

Header fields are lines beginning with a field name, followed by a colon (":"), followed by a field body, and terminated by CRLF.

So maybe ->getBody() insted of ->getContent()? But this could lead to confusion with the email's body...

zbateson · 2020-08-05T17:36:29Z

I still don't see the value in changing __toString or it's meaning (the "string" representation of the header).

There should be three functions to get:

Subject: =?ISO-8859-1?Q?f=F6=F6b=E4r?=
=?ISO-8859-1?Q?f=F6=F6b=E4r?=
fööbär

But isn't there? __toString, getRawValue, getValue

I get that there's some inconsistency in getHeaderValue/getValue with addresses, but often for AddressHeader's you might only be interested in the first address, so I made it inconsistent so you didn't have to worry about AddressHeader potentially representing more than one address for a 'from' header for instance. I'm not sure what's worse though, both have their drawbacks imo :(

But yeah, that could've been a mistake. I'll give it some though. Maybe a convenient very explicit "getFirstAddress" "getFirstName" or something would've made more sense.

And yeah, if we're to change it we'd have to deprecate getHeaderValue and getValue and change them to getHeaderContent and getContent (or Body).

ThomasLandauer · 2020-08-05T18:46:34Z

But isn't there? __toString, getRawValue, getValue

No, since for a To header, getValue() only returns (a) the first address, and (b) just the email part of that. This behavior is deceptive IMO, since it gives the false impression that the first address is all there is!

In which situations would somebody only be interested in the first address?

zbateson · 2020-08-05T19:04:23Z

I get what you're saying about AddressHeader being inconsistent with its getValue and acknowledge that. In your example though, Subject, getValue would give result 3.

I can think of quite a few examples where you might only care about a single address... the primary being if you're checking 'From' for an address, is it preferable to go if (count($fromHeader->getAddresses()) > 0 && strcasecmp($from->getAddresses()[0]->getEmail(), 'email@example.com')) or is there value in being able to do if (strcasecmp($from->getValue(), 'email@example.com')?

ThomasLandauer · 2020-08-05T20:11:50Z

Yeah, sure the second is shorter - but how can you know in advance that the address you're looking for must be the first one?

zbateson · 2020-08-05T20:22:42Z

I think you'd rarely care if 'from' contained more than one address, I think that's very uncommon.

ThomasLandauer · 2020-08-06T12:52:59Z

In your example though, Subject, getValue would give result 3.

Yeah, sorry, I missed that cause I was misled by the "errors" caused by #130

I think you'd rarely care if 'from' contained more than one address, I think that's very uncommon.

I was about to say that, but was afraid you'd come up with some mailing list (or whatever) example ;-)

Anyway: What about To and Cc?
The real point is: If getValue() (or getHeaderValue()) returns the entire content for most headers, it cannot be that it just returns just some part for some headers! That's really deceptive!

So what I'm mainly suggesting is to change the function's behavior. And since this is a BC break, you consequently need to change its name too. And this is a nice opportunity to rethink the entire naming concept.

zbateson · 2020-08-06T15:53:51Z

I get it and am not disagreeing on this one as well. I feel like 'getValue' is so standard that it might be preferred even if there's a problem with it though... and the deprecation and introduction of 'getContent' also seems at least as equally annoying to me as the problem it's trying to fix. Everyone will have to change their calls from getValue to getContent, and they're all already used to the existing functionality.

ThomasLandauer · 2020-08-06T18:59:38Z

Well, I for myself don't care how you handle the transition :-) Semantic versioning probably demands the deprecation since it's a BC break.

BTW: How could you "send" a deprecation warning message at all? And where/how could users see it?

About the very name: The more I think of it, the more I'm torn towards getBody(). Why? Because it's the term the RFC uses! And nobody was ever fired for sticking to the RFC... ;-)

zbateson · 2020-08-06T20:07:17Z

The only deprecation I've done so far I've put up on the main readme (and main page of mail-mime-parser.org). I'd also put it up in the release notes for the specific version it changed in.

ThomasLandauer · 2020-08-06T20:27:10Z

Here's the answer I was looking for: https://stackoverflow.com/a/194135/1668200

zbateson · 2020-08-06T20:32:02Z

Aah, I see what you're getting at -- yeah, I didn't do that. I only added an @deprecated annotation. That should be added to the deprecated functions I have too then 👍 even better in the comment on that one E_USER_DEPRECATED was mentioned as well.

ThomasLandauer · 2020-08-10T09:30:50Z

Two more points about the naming:

Just noticed that $message->getHeaderValue('content-type') only returns text/plain if the full header is Content-Type: text/plain;charset=utf-8. So I think you're using getHeaderValue() for something like "give me the most important info from this header". It is OK to have a convenience function like that. But it shouldn't be called "getXValue" Can't think of a better name (yet), cause I don't know what it gives for other headers ;-)
So are there more headers (besides To, From, and Content-Type) where getHeaderValue() doesn't return everything? If yes, I'd suggest to set up an overview table listing each header and each function and display what it returns. This will make it easier to find better names.
Since you're parsing the comments in the headers anyway, what about something like getBodyWithoutComments()?

zbateson · 2020-08-10T15:47:00Z

so I think you're using getHeaderValue() for something like "give me the most important info from this header"

Yeah, actually I think in that case it's different though too... if you consider the type of header Content-Type is, it's a header with optional additional 'parameters', so the value of the header is "text/plain", but it may have additional parameters like 'charset', which has a default value of 'us-ascii'... I think this is also different from my processing of emails which I agree is confusing.

You can see what I'm getting at here: RFC 1341 (I'm sure this must be mentioned in RFC822 and/or derivatives somewhere but I couldn't find it with a quick glance lol... anyway, specifically:

Content-Type header field value is defined as follows:
Content-Type := type "/" subtype *[";" parameter]

Note that the 'parameter' is optional, and can be different depending on type/subtype -- for multipart, it could be 'boundary', for content-type it could be charset, etc... the 'value' is one thing, and the 'parameters' are another. This is true for other parameterized headers too, like 'content-disposition'... it's 'value' could be 'attachment', and it could optionally have a 'filename' parameter for example.

Since you're parsing the comments in the headers anyway, what about something like getBodyWithoutComments()?

I think most people expect the comments to be removed in the value, they don't need to worry about that being explicitly stated imo.

ThomasLandauer · 2020-08-11T14:22:38Z

the 'value' is one thing, and the 'parameters' are another

Well, another argument for "getBody()" :-)

I think most people ...

What I definitely can say is that most people don't know the difference between "value", "parameter", "comment" etc. for an email header! If I see something like getValue() I read it as "give me the stuff that's in there".

Comments:
A week ago I didn't even know that there is such a thing :-) The only encounter I had so far was the suffixed timezone name in the Date header (e.g. (GMT)) which sometimes causes problems, see php-mime-mail-parser/php-mime-mail-parser#328
If that's the only occurrence of comments in the wild, there's indeed no need for a getBodyWithoutComments(). Anybody interested in this piece of information could use getRawBody().

BTW: Is your decoding engine usable stand-alone? I.e. can I throw =?ISO-8859-1?Q?f=F6=F6b=E4r?= in somewhere?

zbateson · 2020-08-11T15:53:05Z

What I definitely can say is that most people don't know the difference between "value", "parameter", "comment" etc. for an email header! If I see something like getValue() I read it as "give me the stuff that's in there".

I'd say you're wrong about most people. Consider these examples:

https://javaee.github.io/javamail/docs/api/javax/mail/Header.html#getValue--

https://swiftmailer.symfony.com/docs/headers.html :

// note that value is one part, parameter is another
$headers->addParameterizedHeader(
  'Header-Name', 'header value',
  ['foo' => 'bar']
  );

The only instance of 'getBody' I've seen is here:

http://james.apache.org/mime4j/apidocs/org/apache/james/mime4j/stream/Field.html#getBody()

And in this case it's like my getRawValue.

BTW: Is your decoding engine usable stand-alone? I.e. can I throw =?ISO-8859-1?Q?f=F6=F6b=E4r?= in somewhere?

You should be able to do something like this if you want:

$di = new \ZBateson\Container();
$mlpf = $di->getMimeLiteralPartFactory();
$mlpf->newInstance('=?ISO-8859-1?Q?f=F6=F6b=E4r?=');

You could also create an instance of \ZBateson\MailMimeParser\Header\Part\MimeLiteralPart, pass it a \ZBateson\MbWrapper\MbWrapper instance as first parameter, and encoded word as second parameter. My example here though gives you options with dependency injection (in case future readers are doing that).

ThomasLandauer · 2020-08-12T11:28:44Z

Yeah, sure "getBody" is uncommon. But what's your suggestion?? IMO getValue is ruined now, since you sometimes use it indifferently (just give me whatever is in there), and sometimes you use it specifically (value vs. parameter vs. comment). So I'm guessing it's easier to educate people to use a new function, rather than understand that an existing did change.

Besides: Since you follow the RFCs where ever possible, why not take the names from them too?

zbateson · 2020-08-12T15:01:10Z

I never "just give you whatever's in there"... it's never the equivalent of "getRawValue".

The only instance it's a little different is the way it gives the first email rather than 'all emails and names' in a string.

zbateson · 2020-08-12T15:01:58Z

(unless the header doesn't require any processing of course)

zbateson · 2020-08-12T16:29:38Z

How about I deprecate getValue only in AddressHeader, introduce a getCompleteValue (again just for AddressHeader) in a 1.3... then in 2.0 I deprecate getCompleteValue and restore getValue with the functionality of getCompleteValue.

It might affect a few people upgrading from an older version to the latest, but still less than deprecating all of 'getValue', introducing a new method like 'getBody' or 'getContent' which to me definitely implies the whole header instead of the value of it for something like a parameterized header, and also means most people can just be 'aware' of the change without changing every call to 'getValue' to something else.

I'd say I'm between that or living with the inconsistency, and I'll give it some thought.

ThomasLandauer · 2020-08-12T22:59:47Z

(unless the header doesn't require any processing of course)

This is exactly the problem! To me, a header is a header. I don't know which headers need processing...

How about I deprecate getValue only in AddressHeader

And what about Content-Type?

Seriously: If you want to get a clear, clean nomenclature, please set up a wiki page (is this still possible on GitHub?) with a big table: All headers and all your functions; and in each cell, what the function returns. This will give us an overview which cases we need to cover. This table will ultimately move to README, so it's no lost work but a great documentation for the future :-)

And maybe we should also get Vincent on board, cause you said that was the original idea behind https://gitlab.com/mailcare/parser ? Might take a while till we reach a consensus, but on the other hand: If you two agree on a common interface, this is going to be the de-facto standard for any PHP mail parsing project!

zbateson · 2020-08-12T23:04:19Z

This is exactly the problem! To me, a header is a header. I don't know which headers need processing...

No, you misunderstood me...

Subject: this doesn't need any processing

Subject: =?utf-8?Q=but_this_does?=

I'm always processing headers, some look the same regardless though (first example)... in that case there's no difference between the two.

Seriously: If you want to get a clear, clean nomenclature, please set up a wiki page

Yup, I can set it up for you if you want to undertake that 👍

And maybe we should also get Vincent on board, cause you said that was the original idea behind

That's his project... he's not currently doing much with headers, I think he was more interested in using my header parser (when that was being discussed).

zbateson · 2020-08-12T23:10:47Z

You can have a look at what type of Header object is created for what type of header here:

https://github.com/zbateson/mail-mime-parser/blob/master/src/Header/HeaderFactory.php

The default is a "GenericHeader" which filters out comments and processes quoted parts.

ThomasLandauer · 2020-08-14T09:51:20Z

Yup, I can set it up for you if you want to undertake that

OK, please!

ThomasLandauer · 2020-08-14T18:23:49Z

I think most people expect the comments to be removed in the value

Some data I've just stumbled upon: For the Auto-Submitted header (RFC 3834), adding a comment seems to be quite common: 16% of the emails I'm analyzing are having one. And it indeed adds some value (pun!) to the information:

auto-generated (confirmation)
auto-replied (vacation)

zbateson · 2020-08-17T15:41:24Z

The information in that case is still available via getRawValue if you want it that way of course, but parts are accessible separately too via getParts https://mail-mime-parser.org/api/1.2/classes/ZBateson.MailMimeParser.Header.GenericHeader.html --

That returns all the parsed HeaderPart objects, including any CommentPart objects... although having a 'getComments' defined in AbstractHeader might make that easier.

zbateson · 2020-11-11T18:46:45Z

Hi @ThomasLandauer --

I've had a chance to look at this some more and think about it and the behaviour isn't quite the way you described it.

'getValue' actually consistently returns the value of the first part of a header. Consider:

ParameterHeader returns the value of the first part, e.g. 'text/html' for a Content-Type header, not the value of all parts
AddressHeader returns the value of the first part (the first part being an AddressPart or AddressGroupPart, the 'value' of an Address being the email address)
all other headers only contain a single part.

You can see that this is true in that AddressHeader doesn't override 'getValue', and neither does 'ParameterHeader'. The default implementation of getValue does this in AbstractHeader:

if (!empty($this->parts)) {
    return $this->parts[0]->getValue();
}
return null;

This actually does make sense to me and I think it's still my preferred way... it allows a user to not have to handle knowing if an address contains more than one address in some cases like with a From. If the user expects more than one address, they can specifically handle it.

Having said that, it may be useful to have a convenience method to get all the names/addresses in an email field after decoding, that displays them like Thunderbird would or something. I'm a bit hesitant though because it has to make some decisions for the user -- how should I display this email address for example: "My <weird> Name" <email@example.com> (or an address with mime-encoded non-ascii characters?)... I think what makes sense to me is to return the name always quoted, with non-ascii characters in mime encoded parts included in the quotes as well. It may not make sense or be useful to everyone though.

fkoyer · 2022-05-25T21:10:54Z

I know this is an old topic but I am having a similar issue. There seems to be no way to get the "full" decoded and unfolded contents of a header. For example:

Header

This is the header as it appears in the rfc2822 message:

Authentication-Results: cf04.mxguardian.net;
	dkim=pass (2048-bit key)
	header.d=effectwebagency-com.20150623.gappssmtp.com

Desired output

What I would like to retrieve is the full unfolded header without the header name, like so:

cf04.mxguardian.net; dkim=pass (2048-bit key) header.d=effectwebagency-com.20150623.gappssmtp.com

Code

$result = $message->getHeader('Authentication-Results');
echo $result->getValue(),"\n\n";
echo $result->getRawValue(),"\n\n";
echo $result->__toString(),"\n\n";

Output

cf04.mxguardian.net

cf04.mxguardian.net;
        dkim=pass (2048-bit key)
        header.d=effectwebagency-com.20150623.gappssmtp.com

Authentication-Results: cf04.mxguardian.net;
        dkim=pass (2048-bit key)
        header.d=effectwebagency-com.20150623.gappssmtp.com

As you can see, none of these methods returns the complete header on one line. It seems like my best option is to use getRawValue and do the unfolding myself, however, getRawValue doesn't do any MIME decoding. So if I'm working with a header that is MIME encoded then getRawValue doesn't work and I have to use a different workaround.

Can we have a method such as getDecodedAndUnfoldedValue that works with any header?

I'm using version 2.1.0.

Thanks
Kent

zbateson · 2022-05-25T23:17:32Z

Hi @fkoyer,

Looks like you just need a quick regex for it: echo preg_replace('/\s+/', ' ', $result->getRawValue());.

Edit: should really read things properly before responding, sorry.

fkoyer · 2022-06-01T17:58:26Z

Hi @zbateson,

Thanks for the quick response. Yes, I can definitely do the unfolding myself. It would just be handy to have a function that returns the full decoded and unfolded header text so that we don't have to do it every time we access a header. TBH, that's what I expected getValue() to do. It's too late to change that now but another convenience method would be a welcome addition.

Thanks
Kent

zbateson added enhancement help wanted labels Aug 6, 2020

ThomasLandauer mentioned this issue Aug 20, 2020

Remove header name from ->getHeader() #123

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get decoded/unfolded semantically "full" header #122

Get decoded/unfolded semantically "full" header #122

ThomasLandauer commented Aug 4, 2020

zbateson commented Aug 4, 2020 •

edited

Loading

ThomasLandauer commented Aug 5, 2020

ThomasLandauer commented Aug 5, 2020

zbateson commented Aug 5, 2020

ThomasLandauer commented Aug 5, 2020

zbateson commented Aug 5, 2020 •

edited

Loading

ThomasLandauer commented Aug 5, 2020

zbateson commented Aug 5, 2020

ThomasLandauer commented Aug 6, 2020 •

edited

Loading

zbateson commented Aug 6, 2020

ThomasLandauer commented Aug 6, 2020

zbateson commented Aug 6, 2020

ThomasLandauer commented Aug 6, 2020

zbateson commented Aug 6, 2020

ThomasLandauer commented Aug 10, 2020

zbateson commented Aug 10, 2020 •

edited

Loading

ThomasLandauer commented Aug 11, 2020

zbateson commented Aug 11, 2020 •

edited

Loading

ThomasLandauer commented Aug 12, 2020

zbateson commented Aug 12, 2020

zbateson commented Aug 12, 2020

zbateson commented Aug 12, 2020

ThomasLandauer commented Aug 12, 2020

zbateson commented Aug 12, 2020

zbateson commented Aug 12, 2020

ThomasLandauer commented Aug 14, 2020

ThomasLandauer commented Aug 14, 2020

zbateson commented Aug 17, 2020

zbateson commented Nov 11, 2020

fkoyer commented May 25, 2022

zbateson commented May 25, 2022 •

edited

Loading

fkoyer commented Jun 1, 2022

Get decoded/unfolded semantically "full" header #122

Get decoded/unfolded semantically "full" header #122

Comments

ThomasLandauer commented Aug 4, 2020

zbateson commented Aug 4, 2020 • edited Loading

ThomasLandauer commented Aug 5, 2020

ThomasLandauer commented Aug 5, 2020

zbateson commented Aug 5, 2020

ThomasLandauer commented Aug 5, 2020

zbateson commented Aug 5, 2020 • edited Loading

ThomasLandauer commented Aug 5, 2020

zbateson commented Aug 5, 2020

ThomasLandauer commented Aug 6, 2020 • edited Loading

zbateson commented Aug 6, 2020

ThomasLandauer commented Aug 6, 2020

zbateson commented Aug 6, 2020

ThomasLandauer commented Aug 6, 2020

zbateson commented Aug 6, 2020

ThomasLandauer commented Aug 10, 2020

zbateson commented Aug 10, 2020 • edited Loading

ThomasLandauer commented Aug 11, 2020

zbateson commented Aug 11, 2020 • edited Loading

ThomasLandauer commented Aug 12, 2020

zbateson commented Aug 12, 2020

zbateson commented Aug 12, 2020

zbateson commented Aug 12, 2020

ThomasLandauer commented Aug 12, 2020

zbateson commented Aug 12, 2020

zbateson commented Aug 12, 2020

ThomasLandauer commented Aug 14, 2020

ThomasLandauer commented Aug 14, 2020

zbateson commented Aug 17, 2020

zbateson commented Nov 11, 2020

fkoyer commented May 25, 2022

Header

Desired output

Code

Output

zbateson commented May 25, 2022 • edited Loading

fkoyer commented Jun 1, 2022

zbateson commented Aug 4, 2020 •

edited

Loading

zbateson commented Aug 5, 2020 •

edited

Loading

ThomasLandauer commented Aug 6, 2020 •

edited

Loading

zbateson commented Aug 10, 2020 •

edited

Loading

zbateson commented Aug 11, 2020 •

edited

Loading

zbateson commented May 25, 2022 •

edited

Loading