Skip to content

Wrong encoding detection while reading a text from the MS Word document #2713

Open
@xm74

Description

@xm74

Describe the bug and add attachments

I encountered an issue with some documents of MsDoc type when parsing the structure returns text in wrong encoding.
Here is one of such documents and the code used for parsing. In this case instead of the original text Test credit card 4316020000630490 Thank you it returns 敔瑳挠敲楤⁴慣摲㐠ㄳ〶〲〰㘰〳㤴ര桔湡.
Unfortunately, Github does not allow you to attach this file as an attachment, so here is a link to download it.

Expected behavior

Returns the text in the correct encoding

Steps to reproduce

Here is the sample (simplified) code used for parsing.

$text = '';

$parser = \PhpOffice\PhpWord\IOFactory::load('/path/to/base_c1.doc', 'MsDoc');
$sections = $parser->getSections();
foreach ($sections as $section) {

	$elements = $section->getElements();
	parse_document($elements, $text);
}

function parse_document($elements, &$text) {

	foreach ($elements as $element) {

		$class = get_class($element);
		if (method_exists($class, 'getText')) {
			$text .=  $element->getText() . PHP_EOL;
		} else {
			if ($class == 'PhpOffice\PhpWord\Element\TextRun') {
				parse_document($element->getElements(), $text);
			}			
			$text .= PHP_EOL;
		}
	}
}

PHPWord version(s) where the bug happened

1.3.0

PHP version(s) where the bug happened

8.1

Priority

  • I want to crowdfund the bug fix (with @algora-io) and fund a community developer.
  • I want to pay the bug fix and fund a maintainer for that. (Contact @Progi1984)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions