Skip to content

Commit

Permalink
Utf 16 fixes (#166)
Browse files Browse the repository at this point in the history
* Use embedded nulls to detect UTF-16 without BOM

Qt's UTF-8 decoder will decode UTF-16 strings with only low code points as the right string with a null character after each real character, which encodes back to the same data, so the reverse conversion check fails.
If there's a BOM, then there's a high code point that won't decode, so this problem was avoided, and only occurred when there was no BOM.

QStringConverter::encodingForData returns nullopt if there's no BOM to identify the encoding instead of trying to work it out, so it's safer to guess UTF-16 if there's no identified encoding but there were null bytes.
  • Loading branch information
AnyOldName3 authored Feb 28, 2025
1 parent 55f6346 commit e80028c
Show file tree
Hide file tree
Showing 2 changed files with 31 additions and 9 deletions.
35 changes: 28 additions & 7 deletions src/utility.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -847,7 +847,7 @@ bool shellDeleteQuiet(const QString& fileName, QWidget* dialog)
return true;
}

QString readFileText(const QString& fileName, QString* encoding)
QString readFileText(const QString& fileName, QString* encoding, bool* hadBOM)
{

QFile textFile(fileName);
Expand All @@ -856,26 +856,37 @@ QString readFileText(const QString& fileName, QString* encoding)
}

QByteArray buffer = textFile.readAll();
return decodeTextData(buffer, encoding);
return decodeTextData(buffer, encoding, hadBOM);
}

QString decodeTextData(const QByteArray& fileData, QString* encoding)
QString decodeTextData(const QByteArray& fileData, QString* encoding, bool* hadBOM)
{
QStringConverter::Encoding codec = QStringConverter::Encoding::Utf8;
QStringEncoder encoder(codec);
QStringDecoder decoder(codec);
QStringDecoder decoder(codec, QStringConverter::Flag::ConvertInitialBom);
QString text = decoder.decode(fileData);

// embedded nulls probably mean it was UTF-16 - they're rare/illegal in text files
bool hasEmbeddedNulls = false;
for (const auto& character : text) {
if (character.isNull()) {
hasEmbeddedNulls = true;
break;
}
}

// check reverse conversion. If this was unicode text there can't be data loss
// this assumes QString doesn't normalize the data in any way so this is a bit unsafe
if (encoder.encode(text) != fileData) {
if (hasEmbeddedNulls || encoder.encode(text) != fileData) {
log::debug("conversion failed assuming local encoding");
auto codecSearch = QStringConverter::encodingForData(fileData);
if (codecSearch.has_value()) {
codec = codecSearch.value();
decoder = QStringDecoder(codec);
decoder = QStringDecoder(codec, QStringConverter::Flag::ConvertInitialBom);
} else {
decoder = QStringDecoder(QStringConverter::Encoding::System);
// encodingForData doesn't handle UTF-16 without BOM
decoder = QStringDecoder(hasEmbeddedNulls ? QStringConverter::Encoding::Utf16
: QStringConverter::Encoding::System);
}
text = decoder.decode(fileData);
}
Expand All @@ -884,6 +895,16 @@ QString decodeTextData(const QByteArray& fileData, QString* encoding)
*encoding = QStringConverter::nameForEncoding(codec);
}

if (!text.isEmpty() && text.startsWith(QChar::ByteOrderMark)) {
text.remove(0, 1);

if (hadBOM != nullptr) {
*hadBOM = true;
}
} else if (hadBOM != nullptr) {
*hadBOM = false;
}

return text;
}

Expand Down
5 changes: 3 additions & 2 deletions src/utility.h
Original file line number Diff line number Diff line change
Expand Up @@ -452,7 +452,8 @@ QDLLEXPORT QString getStartMenuDirectory();
*the encoding used
* @return the textual content of the file or an empty string if the file doesn't exist
**/
QDLLEXPORT QString readFileText(const QString& fileName, QString* encoding = nullptr);
QDLLEXPORT QString readFileText(const QString& fileName, QString* encoding = nullptr,
bool* hadBOM = nullptr);

/**
* @brief decode raw text data. This tries to guess the encoding used in the file
Expand All @@ -462,7 +463,7 @@ QDLLEXPORT QString readFileText(const QString& fileName, QString* encoding = nul
* @return the textual content of the file or an empty string if the file doesn't exist
**/
QDLLEXPORT QString decodeTextData(const QByteArray& fileData,
QString* encoding = nullptr);
QString* encoding = nullptr, bool* hadBOM = nullptr);

/**
* @brief delete files matching a pattern
Expand Down

0 comments on commit e80028c

Please sign in to comment.