diff --git a/README.md b/README.md index 2fd8612..269ffb2 100644 --- a/README.md +++ b/README.md @@ -29,7 +29,8 @@ into the ASCII character space. The focus will be on truly problematic characters. Older releases and version-specific branches are still available if you need -that functionality. +that functionality. During this transition, the old tables are also available +in `table/legacy/` --- diff --git a/man/detox.1 b/man/detox.1 index 89c8c8e..9f74631 100644 --- a/man/detox.1 +++ b/man/detox.1 @@ -6,7 +6,7 @@ .\" For the full copyright and license information, please view the LICENSE .\" file that was distributed with this source code. .\" -.Dd February 24, 2021 +.Dd March 31, 2024 .Dt DETOX 1 .Os .Sh NAME @@ -32,12 +32,12 @@ .Sh DESCRIPTION The .Nm -utility renames files to make them easier to work with under Unix and Unix-like -operating systems. +utility renames files to make them easier to work with under Linux and other +Unix-like operating systems. It replaces characters that make it hard to type out a filename with dashes and underscores. -It also provides transliteration-based filters, converting ISO 8859-1 or UTF-8 -to ASCII, in part or in whole. +It also provides transcoding-based filters, converting ISO-8859-1 or CP-1252 to +UTF-8. An additional filter unescapes CGI-escaped filenames. .Ss Sequences .Nm @@ -55,8 +55,8 @@ filters. Other examples of pre-configured sequences are .Ar iso8859_1 and -.Ar utf_8 , -which both provide transliteration to ASCII and then finish with the +.Ar iso8859_1-legacy , +which both provide transcoding to UTF-8, and then finish with the .Ar safe and .Ar wipeup @@ -125,16 +125,14 @@ unless .Fl f has been specified, in which case, it is ignored. .It Pa /usr/share/detox/cp1252.tbl -The provided CP-1252 transliteration table. +The provided CP-1252 transcoding table. .It Pa /usr/share/detox/iso8859_1.tbl -The provided ISO 8859-1 transliteration table. +The provided ISO-8859-1 transcoding table. .It Pa /usr/share/detox/safe.tbl The provided safe character translation table. .It Pa /usr/share/detox/unicode.tbl -The provided Unicode transliteration table, used by the UTF-8 filter. -.It Pa /usr/share/detox/unidecode.tbl -An additional Unicode tranlsiteration table, based on -.Xr Text::Unidecode 3pm . +The provided Unicode control character filtering table, used by the UTF-8 +filter. .El .Sh EXAMPLES .Bl -tag -width Fl @@ -151,7 +149,6 @@ showing their filters and options. .El .Sh SEE ALSO .Xr inline-detox 1 , -.Xr Text::Unidecode 3pm , .Xr detox.tbl 5 , .Xr detoxrc 5 , .Xr ascii 7 , @@ -172,7 +169,8 @@ I created to clean up these files. .Pp Version 2.0 stepped back from transliteration out of the box, instead focusing -on ease of use. +on ease of use. Version 3.0 further shifted this, by removing most of the +transliteration from the tables. The primary motivations for this were user-provided feedback, and the fact that many modern Unix-like OSs use UTF-8 as their primary character set. Transliterating from UTF-8 to ASCII in this scenario is lossy and pointless. diff --git a/man/detox.1.pdf b/man/detox.1.pdf index 2074769..d0dc664 100644 Binary files a/man/detox.1.pdf and b/man/detox.1.pdf differ diff --git a/man/detox.tbl.5 b/man/detox.tbl.5 index c293925..efdf62b 100644 --- a/man/detox.tbl.5 +++ b/man/detox.tbl.5 @@ -15,7 +15,7 @@ .Xr detox 1 .Sh OVERVIEW .Cm detox -allows for configuration of how the safe, ISO 8859-1, and UTF-8 (Unicode) +allows for configuration of how the safe, ISO-8859-1, and UTF-8 (Unicode) filters operate. Through text-based translation tables, it is possible to tune how these character sets are interpreted. diff --git a/man/detox.tbl.5.pdf b/man/detox.tbl.5.pdf index ecc70cd..b50afb9 100644 Binary files a/man/detox.tbl.5.pdf and b/man/detox.tbl.5.pdf differ diff --git a/man/detoxrc.5 b/man/detoxrc.5 index 1c4f058..bc7a949 100644 --- a/man/detoxrc.5 +++ b/man/detoxrc.5 @@ -81,8 +81,8 @@ block. .It Cm iso8859_1 ; .It Cm iso8859_1 Bro Cm builtin Qo Ar name Qc ; Brc ; .It Cm iso8859_1 Bro Cm filename Qo Ar /path/to/filename Qc ; Brc ; -This transliterates ISO 8859-1 characters between 0xA0 and 0xFF into lower -ASCII equivalents. +This transcodes ISO-8859-1 characters between 0xA0 and 0xFF into their UTF-8 +equivalents, with a few exceptions. The output is not necessarily safe, and should also be run through the .Ar safe filter. @@ -95,7 +95,7 @@ Under normal circumstances, the filename syntax is not needed. .Cm detox looks in several locations for a file called .Pa iso8859_1.tbl , -which is a set of rules defining how an ISO 8859-1 character should be +which is a set of rules defining how an ISO-8859-1 character should be translated. If .Cm detox @@ -118,8 +118,7 @@ filter. .It Cm utf_8 ; .It Cm utf_8 Bro Cm builtin Qo Ar name Qc ; Brc ; .It Cm utf_8 Bro Cm filename Qo Ar /path/to/filename Qc ; Brc ; -This transliterations Unicode characters, encoded using UTF-8, into lower ASCII -equivalents. +This filters Unicode control characters, encoded using UTF-8. .Pp This operates in a manner similar to .Ar iso8859_1 , @@ -179,22 +178,23 @@ It only works on ASCII characters. .Sh BUILTIN TABLES .Bl -tag -width 0.25i .It cp1252 -A translation table for transliterating CP-1252 characters to ASCII. +A translation table for transcoding CP-1252 characters to UTF-8, with a few +exceptions. This is no longer a common use case, and has been moved to a separate table. .It iso8859_1 -A translation table for transliterating single-byte characters with the high -bit set from ISO 8859-1 to ASCII. +A translation table for transcoding single-byte characters with the high bit +set from ISO-8859-1 to UTF-8. .It safe A replacement table for characters that are hard to work with under Unix and Unix-like OSs. .It unicode -A translation table for transliterating multi-byte characters encoded in UTF-8 -to ASCII. +A translation table for converting multi-byte control characters encoded in +UTF-8 to safe characters. .El .Sh EXAMPLES .Bd -literal .\" START SAMPLE -# transliterate UTF-8 to ASCII (using chained tables), clean up +# filter UTF-8 control characters to ASCII (using chained tables), clean up sequence utf8 { utf_8 { filename "/usr/local/share/detox/custom.tbl"; @@ -212,7 +212,7 @@ sequence utf8 { length 128; }; }; -# decode CGI, transliterate CP-1252 to ASCII, clean up +# decode CGI, transcode CP-1252 to UTF-8, clean up sequence "cgi-cp1252" { uncgi; iso8859_1 { diff --git a/man/detoxrc.5.pdf b/man/detoxrc.5.pdf index 6ed1e37..506c7f4 100644 Binary files a/man/detoxrc.5.pdf and b/man/detoxrc.5.pdf differ diff --git a/man/inline-detox.1 b/man/inline-detox.1 index 9c0f6c0..81b6001 100644 --- a/man/inline-detox.1 +++ b/man/inline-detox.1 @@ -6,7 +6,7 @@ .\" For the full copyright and license information, please view the LICENSE .\" file that was distributed with this source code. .\" -.Dd February 24, 2021 +.Dd March 31, 2024 .Dt INLINE-DETOX 1 .Os .Sh NAME @@ -33,12 +33,12 @@ .Sh DESCRIPTION The .Nm -utility generates new filenames to make them easier to work with under Unix and -Unix-like operating systems. +utility generates new filenames to make them easier to work with under Linux +and other Unix-like operating systems. It replaces characters that make it hard to type out a filename with dashes and underscores. -It also provides transliteration-based filters, converting ISO 8859-1 or UTF-8 -to ASCII, in part or in whole. +It also provides transcoding-based filters, converting ISO-8859-1 or CP-1252 to +UTF-8. An additional filter unescapes CGI-escaped filenames. .Pp .Nm @@ -70,8 +70,8 @@ filters. Other examples of pre-configured sequences are .Ar iso8859_1 and -.Ar utf_8 , -which both provide transliteration to ASCII and then finish with the +.Ar iso8859_1-legacy , +which both provide transcoding to UTF-8, and then finish with the .Ar safe and .Ar wipeup @@ -115,16 +115,14 @@ unless .Fl f has been specified, in which case, it is ignored. .It Pa /usr/share/detox/cp1252.tbl -The provided CP-1252 transliteration table. +The provided CP-1252 transcoding table. .It Pa /usr/share/detox/iso8859_1.tbl -The provided ISO 8859-1 transliteration table. +The provided ISO-8859-1 transcoding table. .It Pa /usr/share/detox/safe.tbl The provided safe character translation table. .It Pa /usr/share/detox/unicode.tbl -The provided Unicode transliteration table, used by the UTF-8 filter. -.It Pa /usr/share/detox/unidecode.tbl -An additional Unicode tranlsiteration table, based on -.Xr Text::Unidecode 3pm . +The provided Unicode control character filtering table, used by the UTF-8 +filter. .El .Sh EXAMPLES .Bl -tag -width Fl @@ -135,7 +133,6 @@ listing any changes and returning the result to the output stream. .El .Sh SEE ALSO .Xr detox 1 , -.Xr Text::Unidecode 3pm , .Xr detox.tbl 5 , .Xr detoxrc 5 , .Xr ascii 7 , @@ -156,7 +153,8 @@ I created to clean up these files. .Pp Version 2.0 stepped back from transliteration out of the box, instead focusing -on ease of use. +on ease of use. Version 3.0 further shifted this, by removing most of the +transliteration from the tables. The primary motivations for this were user-provided feedback, and the fact that many modern Unix-like OSs use UTF-8 as their primary character set. Transliterating from UTF-8 to ASCII in this scenario is lossy and pointless. diff --git a/man/inline-detox.1.pdf b/man/inline-detox.1.pdf index 5510950..ffdc583 100644 Binary files a/man/inline-detox.1.pdf and b/man/inline-detox.1.pdf differ diff --git a/tests/legacy/man-page-example/detoxrc.detoxrc.5 b/tests/legacy/man-page-example/detoxrc.detoxrc.5 index 2254a71..3d563f1 100644 --- a/tests/legacy/man-page-example/detoxrc.detoxrc.5 +++ b/tests/legacy/man-page-example/detoxrc.detoxrc.5 @@ -1,5 +1,5 @@ # START SAMPLE -# transliterate UTF-8 to ASCII (using chained tables), clean up +# filter UTF-8 control characters to ASCII (using chained tables), clean up sequence utf8 { utf_8 { filename "/usr/local/share/detox/custom.tbl"; @@ -17,7 +17,7 @@ sequence utf8 { length 128; }; }; -# decode CGI, transliterate CP-1252 to ASCII, clean up +# decode CGI, transcode CP-1252 to UTF-8, clean up sequence "cgi-cp1252" { uncgi; iso8859_1 {