Can't get title and subject from metadata #529

borgir · 2022-04-24T01:33:44Z

PHP Version: 7.3.21
PDFParser Version: 2.2.0

Description:

When trying to get this specific PDF metadata, I only get the author. Also need the title and subject which are there as you can see on the screenshot below:
https://prnt.sc/aOy-Sl6sSIUW

PDF input

https://www.alliancehealthplan.org/document-library/67033/

Expected output & actual output

Expected output:

Array
(
    [Author] => Tasha Jennings
    [Company] => Alliance Health
    [CreationDate] => 2022-04-21T11:25:35-04:00
    [Creator] => Acrobat PDFMaker 22 for PowerPoint
    [Pages] => 3
    [Title] => Enrollment and Client Update Overview
    [Subject] => Enrollment and Client Update Overview
)

Actual output:

Array
(
    [Author] => Tasha Jennings
    [Company] => Alliance Health
    [CreationDate] => 2022-04-21T11:25:35-04:00
    [Creator] => Acrobat PDFMaker 22 for PowerPoint
    [Pages] => 3
)

Code

    try {
        $parser = new \Smalot\PdfParser\Parser();
        $pdf = $parser->parseFile($file_path);
        $meta_data = $pdf->getDetails();
    } catch (Exception $e) {
        $field['value'] = 'Error: ' . $e->getMessage();
    }

Note: It works as expected, for example, with this file:
https://www.alliancehealthplan.org/document-library/70434/

Output:

Array
(
    [Author] => Hewett-RobinsonL
    [Company] => SRAHEC
    [CreationDate] => 2022-04-19T10:36:25-04:00
    [Creator] => Acrobat PDFMaker 22 for Word
    [Keywords] => 
    [ModDate] => 2022-04-19T10:36:26-04:00
    [Producer] => Adobe PDF Library 22.1.149
    [SourceModified] => 2022-04-19T14:36:12+00:00
    [Subject] => Social Determinants of Health: Now is the Time
    [Title] => Social Determinants of Health: Now is the Time
    [Pages] => 2
)

The text was updated successfully, but these errors were encountered:

k00ni · 2022-04-25T06:40:29Z

Did this work before? If not, its a feature request, isn't it?

borgir · 2022-04-26T01:59:49Z

Hi @k00ni ! Thank you for the reply.
It works with some PDF files. With this in particular, it doesn't work.
I've update the question with and example that works correctly.
Thks.

borgir · 2022-04-29T23:37:37Z

Hi!
Any clue on what might be the problem?
Thks

k00ni · 2022-04-30T06:35:25Z

No sorry.

GreyWyvern · 2023-07-06T15:54:25Z

This is happening because there is a 'name' in the metadata that includes a hexadecimal-encoded space: "Document#20Type". The fix proposed by @philippze adds a position increment that loops and eventually bypasses the offending metadata name instead of parsing it.

The correct fix is to allow the '#' character in property names, and then convert them to the correct characters.

From Element.php

            ...
            if (!$only_values) {
                if (!preg_match('/\G\s*(?P<name>\/[A-Z#0-9\._]+)(?P<value>.*)/si', $content, $match, 0, $position)) {
                    break;
                } else {
                    $name = preg_replace_callback(
                        '/#(\d\d)/',
                        function($m) {
                            return \chr(base_convert($m[1], 16, 10));
                        },
                        ltrim($match['name'], '/')
                    );
                    $value = $match['value'];
                    $position = strpos($content, $value, $position + \strlen($match['name']));
                }
            } else {
            ...

The result of getDetails() will then be:

array(9) {
  ["Author"]=>
  string(14) "Tasha Jennings"
  ["Company"]=>
  string(15) "Alliance Health"
  ["CreationDate"]=>
  string(25) "2022-04-21T11:25:35-04:00"
  ["Creator"]=>
  string(34) "Acrobat PDFMaker 22 for PowerPoint"
  ["Document Type"]=>
  string(9) "Templates"
  ["ModDate"]=>
  string(25) "2022-04-21T11:25:36-04:00"
  ["Producer"]=>
  string(26) "Adobe PDF Library 22.1.149"
  ["Subject"]=>
  string(37) "Enrollment and Client Update Overview"
  ["Title"]=>
  string(37) "Enrollment and Client Update Overview"
}

Add ability for PdfParser to parse metadata names with hexadecimal encoded characters such as "Document#20Type" where \smalot#20 is a space. Resolves Issue smalot#529

* Update Element.php Add ability for PdfParser to parse metadata names with hexadecimal encoded characters such as "Document#20Type" where \#20 is a space. Resolves Issue #529 * Update ElementTest.php Add test for spaces in metadata property names. * Make sure we fully support hex Too quick on the commit! Make sure our two 'digit' regexp also finds A-F hex digits. Add a test for #2d which is a hyphen. * fixed coding style issue in Element.php --------- Co-authored-by: Konrad Abicht <hi@inspirito.de>

k00ni added the needs more info label Apr 25, 2022

k00ni added bug and removed needs more info labels Apr 26, 2022

philippze added a commit to philippze/pdfparser that referenced this issue May 15, 2022

risky fix for issue smalot#529

0a052f7

GreyWyvern added a commit to GreyWyvern/pdfparser that referenced this issue Jul 6, 2023

Update Element.php

caa0af0

Add ability for PdfParser to parse metadata names with hexadecimal encoded characters such as "Document#20Type" where \smalot#20 is a space. Resolves Issue smalot#529

GreyWyvern mentioned this issue Jul 6, 2023

Support metadata element names containing spaces #612

Merged

k00ni closed this as completed in #612 Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't get title and subject from metadata #529

Can't get title and subject from metadata #529

borgir commented Apr 24, 2022 •

edited

Loading

k00ni commented Apr 25, 2022

borgir commented Apr 26, 2022

borgir commented Apr 29, 2022

k00ni commented Apr 30, 2022

GreyWyvern commented Jul 6, 2023 •

edited

Loading

Can't get title and subject from metadata #529

Can't get title and subject from metadata #529

Comments

borgir commented Apr 24, 2022 • edited Loading

Description:

PDF input

Expected output & actual output

Code

k00ni commented Apr 25, 2022

borgir commented Apr 26, 2022

borgir commented Apr 29, 2022

k00ni commented Apr 30, 2022

GreyWyvern commented Jul 6, 2023 • edited Loading

borgir commented Apr 24, 2022 •

edited

Loading

GreyWyvern commented Jul 6, 2023 •

edited

Loading