-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't get title and subject from metadata #529
Comments
Did this work before? If not, its a feature request, isn't it? |
Hi @k00ni ! Thank you for the reply. |
Hi! |
No sorry. |
This is happening because there is a 'name' in the metadata that includes a hexadecimal-encoded space: "Document#20Type". The fix proposed by @philippze adds a position increment that loops and eventually bypasses the offending metadata name instead of parsing it. The correct fix is to allow the '#' character in property names, and then convert them to the correct characters. From Element.php ...
if (!$only_values) {
if (!preg_match('/\G\s*(?P<name>\/[A-Z#0-9\._]+)(?P<value>.*)/si', $content, $match, 0, $position)) {
break;
} else {
$name = preg_replace_callback(
'/#(\d\d)/',
function($m) {
return \chr(base_convert($m[1], 16, 10));
},
ltrim($match['name'], '/')
);
$value = $match['value'];
$position = strpos($content, $value, $position + \strlen($match['name']));
}
} else {
... The result of
|
Add ability for PdfParser to parse metadata names with hexadecimal encoded characters such as "Document#20Type" where \smalot#20 is a space. Resolves Issue smalot#529
* Update Element.php Add ability for PdfParser to parse metadata names with hexadecimal encoded characters such as "Document#20Type" where \#20 is a space. Resolves Issue #529 * Update ElementTest.php Add test for spaces in metadata property names. * Make sure we fully support hex Too quick on the commit! Make sure our two 'digit' regexp also finds A-F hex digits. Add a test for #2d which is a hyphen. * fixed coding style issue in Element.php --------- Co-authored-by: Konrad Abicht <hi@inspirito.de>
Description:
When trying to get this specific PDF metadata, I only get the author. Also need the title and subject which are there as you can see on the screenshot below:
https://prnt.sc/aOy-Sl6sSIUW
PDF input
https://www.alliancehealthplan.org/document-library/67033/
Expected output & actual output
Expected output:
Actual output:
Code
Note: It works as expected, for example, with this file:
https://www.alliancehealthplan.org/document-library/70434/
Output:
The text was updated successfully, but these errors were encountered: