-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PATCH] remove note about BOM from to 'use utf8' docs #13609
Comments
From efimov@reg.ruproposed patch attached. some related discussion: currently BOM does not seem to trigger 'use utf8' behaviour. === Platform: Characteristics of this binary (from libperl): |
From efimov@reg.ru0001-Removing-note-about-Byte-Order-Mark-from-utf8-docs.-.patchFrom 54bcadfbceca8d8155745f4b0bf258737f5c6117 Mon Sep 17 00:00:00 2001
From: Victor Efimov <efimov@reg.ru>
Date: Tue, 18 Feb 2014 12:43:47 +0400
Subject: [PATCH] Removing note about Byte Order Mark from 'utf8' docs. UTF-8
BOM does not seem to work as alternative to 'use utf8'.
---
lib/utf8.pm | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/lib/utf8.pm b/lib/utf8.pm
index 43c7277..67a57dc 100644
--- a/lib/utf8.pm
+++ b/lib/utf8.pm
@@ -57,8 +57,7 @@ script is written in UTF-8.> The utility functions described below are
directly usable without C<use utf8;>.
Because it is not possible to reliably tell UTF-8 from native 8 bit
-encodings, you need either a Byte Order Mark at the beginning of your
-source code, or C<use utf8;>, to instruct perl.
+encodings, you need C<use utf8;>, to instruct perl.
When UTF-8 becomes the standard source format, this pragma will
effectively become a no-op. For convenience in what follows the term
--
1.7.9.5
|
From @tonycozOn Tue Feb 18 01:04:36 2014, efimov@reg.ru wrote:
A UTF-16 BOM is recognized though, and treats the source as unicode. Tony |
The RT System itself - Status changed from 'new' to 'open' |
From @demerphqOn 19 February 2014 01:46, Tony Cook via RT <perlbug-followup@perl.org> wrote:
Personally I think if we support UTF-16 BOM we should support them all. Yves -- |
From @TuxOn Wed, 19 Feb 2014 11:48:14 +0100, demerphq <demerphq@gmail.com> wrote:
So do I -- |
From @HugmeirOn Wed, Feb 19, 2014 at 11:48 AM, demerphq <demerphq@gmail.com> wrote:
Agree, but easier said than done :) UTF-16 is supported by having our own Some months back I tried replacing all of that mess with just an encoding [0] Scripts would actually need that turned *off* by default; after the |
From efimov@reg.ru2014-02-19 4:46 GMT+04:00 Tony Cook via RT <perlbug-followup@perl.org>:
yes. indeed. starting from ~ 5.12. I wonder if this should be documented in 'utf8' or no:
|
From @ikegamiOn Wed, Feb 19, 2014 at 5:48 AM, demerphq <demerphq@gmail.com> wrote:
That would break our code. We have UTF-8 files, but we don't use C<< use |
From victor@vsespb.ruOn Wed Feb 19 02:48:40 2014, demerphq wrote:
I am opposed to this. 1) 'use utf8' behaviour change how program behaves. It's not just cosmetic/metadata. 2) How patches to add/remove BOM will look? How diff/git/github/IDEs/other tools display BOM ? Will patch to add 'use utf8' make sense if applied to file with BOM? 3) People will have to control BOM - they'll invent Test::BOM, Test::NoBOM etc to make sure their code not broken because someone commited BOM by accident. 4) How other programming languages control this? I know Ruby use special pragma "#encoding". They've choosen to not use BOM. What about others? 5) TemplateToolkit uses BOM for similar purpose. My observation that it's PITA. Everytime something breaks I have to 6) There are still people out of there who use UTF-8 files without 'use utf8' - they want utf-8 constants without flag. 7) Security issue? Someone compromissed a server, added BOM to a file, introduced security hole, and no one can't find it. Above does not apply to UTF-16, because UTF-8 and UTF-16 are different. UTF-8 is ASCII-compatible and UTF-16 - no. so a) UTF-16 cannot live without BOM because it's not ASCII compatible. It needs BOM anyway. (altrought I think that even with UTF-16 BOM, BOM just should be ignored and text parsed as UTF-16 but without 'use utf8' behaviour). |
From @demerphqOn 19 February 2014 16:23, Eric Brine <ikegami@adaelis.com> wrote:
Do they have BOMs in them? Yves -- |
From @HugmeirOn Wed, Feb 19, 2014 at 4:48 PM, demerphq <demerphq@gmail.com> wrote:
Corollary: Why would you ever have a UTF-8 BOM? |
From @demerphqOn 19 February 2014 16:35, Victor Efimov via RT
All of the above points apply equally to UTF-8 and UTF-16. If we
Why is this relevant? Why not look at how OS'es handle this? On
Yes, TT is doing it right. This is what BOM's are for.
So then they should not put a BOM on it.
without supporting evidence that this actually happened I will ignore
This doesn't make sense to me.
You could argue that all Unicode files need a BOM otherwise you can't
True, but then that is what I expect from Perl encountering a utf8 bom as well.
IMO that doesn't make sense. The 'use utf8' behavior tells perl that Yves -- |
From @demerphqOn 19 February 2014 16:57, Brian Fraser <fraserbn@gmail.com> wrote:
As I recall Windows editors normally insert them when writing in utf8. http://en.wikipedia.org/wiki/Byte_order_mark is pretty good for understanding these issues. Yves -- |
From @HugmeirOn Wed, Feb 19, 2014 at 5:02 PM, demerphq <demerphq@gmail.com> wrote:
Hm, I think that the question that I meant to ask was, Why would you ever |
From victor@vsespb.ruOn Wed Feb 19 08:01:09 2014, demerphq wrote:
No, I explained how UTF-8 differs from UTF-16. UTF-8 is ascii compatible,
Because Perl is programming language (interpreter) and not OS, nor text editor. So only thing that is relevant is how other programming languages behave.
No, BOM is to distinct LE vs BE. And BOM for UTF-8 is something non-standard: http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
Again, I explained that there will be a mess just like with Tabs vs Spaces.
Happened what? UTF-8 BOM is not triggering anything in any released version of perl, yet. So nothing could happened.
Agree, that would be pretty strange to parse UTF-16, convent it to UTF-8 and use it without utf-8 flag. But allowing file format
|
From perl5-porters@perl.orgYves Orton wrote:
Some of mine do, and fit that description perfectly. It happens when |
From @demerphqOn 19 February 2014 17:16, Victor Efimov via RT
You keep saying that. And I keep thinking you must mean something Think about squares and rectangles. Squares are a subset of UTF-8 is NOT ASCII compatible. You cannot take any arbitrary UTF-8 You can convert ASCII to UTF-8. And indeed ASCII is a subset of UTF-8.
Ok. Well IMO most programming languages dont know anything about
Sigh. I guess we have a different definition of "non-standard".
I dont think that is relevent to a discussion about why we should
So it *is* FUD.
I think we will just have to agree to disagree. Yves |
From perl5-porters@perl.orgYves Orton wrote:
Backward compatibility.
So my scripts that already have BOMbs will start to behave |
From @demerphqOn 19 February 2014 17:26, Father Chrysostomos <sprout@cpan.org> wrote:
Since you are the king of consistency I am a bit surprised. :-) The core of the issue here is that if you took that file and naively IMO either we should respect BOM's or we should not. Which encoding is Yves -- |
From victor@vsespb.ruOn Wed Feb 19 08:29:05 2014, demerphq wrote:
http://www.cl.cam.ac.uk/~mgk25/ucs/man-utf-8.html http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings In our case that means that perl won't compile UTF-16 files with missing BOM. Because perl operators are all in ASCII and UTF-16 is not ASCII compatible. Thus you can't mess with UTF-16 BOM.
Ruby do. And it does not use BOM. There should be a success story with programming languge changing program behaviour because of UTF-8 BOM. Otherwise it's risky. As for TemplateToolkit - I have experience when HTML designers refused to deal with BOM. And thus we had no Unicode in our templates (we had UTF-8 without utf-8 flag). That made migration to unicode just harder.
ok, not recommended.
it's relevant imho.
It's not FUD. I explained why it can be security issue. And you've just asked me to provide proof that there _was_ a security problem with _not-yet-released_ feature? That *is* BS. also, here is example: my $s = <<"END"; print length $s; it prints 6 with LF line feed and CRLF line feed. program behaviour not affected by file format. and now we're going to break this. any code sent over email should have attached note "use without BOM" "use with BOM" |
From @ikegamiOn Wed, Feb 19, 2014 at 10:57 AM, Brian Fraser <fraserbn@gmail.com> wrote:
Yes, so that UltraEdit detects them as UTF-8.
To distinguish UTF-8 files from native files. This is particularly common |
From victor@vsespb.ruOn Wed Feb 19 07:35:49 2014, vsespb wrote:
UPD: 1) ruby1.8 - does not support UTF-8 BOM 2) ruby1.9 - does support it (together with "encoding" pseudo-comment) so you cannot use wrong encoding for file. you cannot change program behaviour by putting or removing BOM (unlike Perl) (note that if file is pure ASCII, actually you can use it with and without BOM. and string constant encoding will be different, 3) python 2.7.3 - supports both "encoding" pseudo-comment and UTF-8 BOM. (note that I don't know python at all, I might miss something?) 4) python does not support UTF-16 |
From @LeontOn Wed, Feb 19, 2014 at 11:48 AM, demerphq <demerphq@gmail.com> wrote:
No, we should not. A BOM was never intended as an encoding marker. It's intended as a byte Actually, I'm not sure UTF-16 support was a good idea either. Given it Leon |
From perl5-porters@perl.orgYves Orton:
But I never intend to do that! :-) Honestly, I don't care how UTF-16 files are treated. |
From @tonycozOn Wed Feb 19 03:56:08 2014, efimov@reg.ru wrote:
Turns out it recognizes but skips if there is a UTF-8 BOM. Despite this from perlunicode.pod: =item C<BOM>-marked scripts and UTF-16 scripts autodetected If a Perl script begins marked with the Unicode C<BOM> (UTF-16LE, UTF16-BE, the source doesn't appear to be treated as unicode: tony@mars:.../git/perl2$ hd test.pl So, except for the above, the behaviour appears to be documented in perlunicode, which is one of the logical places to look for it. I don't think it needs further documentation in perlrun. As to support for other BOMs - that belongs in a new ticket. Tony |
So, what should be done with this ticket? |
Migrated from rt.perl.org#121269 (status was 'open')
Searchable as RT121269$
The text was updated successfully, but these errors were encountered: