Open
Description
Describe the bug and add attachments
I encountered an issue with some documents of MsDoc type when parsing the structure returns text in wrong encoding.
Here is one of such documents and the code used for parsing. In this case instead of the original text Test credit card 4316020000630490 Thank you
it returns 敔瑳挠敲楤⁴慣摲㐠ㄳ〶〲〰㘰〳㤴ര桔湡
.
Unfortunately, Github does not allow you to attach this file as an attachment, so here is a link to download it.
Expected behavior
Returns the text in the correct encoding
Steps to reproduce
Here is the sample (simplified) code used for parsing.
$text = '';
$parser = \PhpOffice\PhpWord\IOFactory::load('/path/to/base_c1.doc', 'MsDoc');
$sections = $parser->getSections();
foreach ($sections as $section) {
$elements = $section->getElements();
parse_document($elements, $text);
}
function parse_document($elements, &$text) {
foreach ($elements as $element) {
$class = get_class($element);
if (method_exists($class, 'getText')) {
$text .= $element->getText() . PHP_EOL;
} else {
if ($class == 'PhpOffice\PhpWord\Element\TextRun') {
parse_document($element->getElements(), $text);
}
$text .= PHP_EOL;
}
}
}
PHPWord version(s) where the bug happened
1.3.0
PHP version(s) where the bug happened
8.1
Priority
- I want to crowdfund the bug fix (with @algora-io) and fund a community developer.
- I want to pay the bug fix and fund a maintainer for that. (Contact @Progi1984)