Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't get title and subject from metadata #529

Closed
borgir opened this issue Apr 24, 2022 · 5 comments · Fixed by #612
Closed

Can't get title and subject from metadata #529

borgir opened this issue Apr 24, 2022 · 5 comments · Fixed by #612
Labels

Comments

@borgir
Copy link

borgir commented Apr 24, 2022

  • PHP Version: 7.3.21
  • PDFParser Version: 2.2.0

Description:

When trying to get this specific PDF metadata, I only get the author. Also need the title and subject which are there as you can see on the screenshot below:
https://prnt.sc/aOy-Sl6sSIUW

PDF input

https://www.alliancehealthplan.org/document-library/67033/

Expected output & actual output

Expected output:

Array
(
    [Author] => Tasha Jennings
    [Company] => Alliance Health
    [CreationDate] => 2022-04-21T11:25:35-04:00
    [Creator] => Acrobat PDFMaker 22 for PowerPoint
    [Pages] => 3
    [Title] => Enrollment and Client Update Overview
    [Subject] => Enrollment and Client Update Overview
)

Actual output:

Array
(
    [Author] => Tasha Jennings
    [Company] => Alliance Health
    [CreationDate] => 2022-04-21T11:25:35-04:00
    [Creator] => Acrobat PDFMaker 22 for PowerPoint
    [Pages] => 3
)

Code

    try {
        $parser = new \Smalot\PdfParser\Parser();
        $pdf = $parser->parseFile($file_path);
        $meta_data = $pdf->getDetails();
    } catch (Exception $e) {
        $field['value'] = 'Error: ' . $e->getMessage();
    }

Note: It works as expected, for example, with this file:
https://www.alliancehealthplan.org/document-library/70434/

Output:

Array
(
    [Author] => Hewett-RobinsonL
    [Company] => SRAHEC
    [CreationDate] => 2022-04-19T10:36:25-04:00
    [Creator] => Acrobat PDFMaker 22 for Word
    [Keywords] => 
    [ModDate] => 2022-04-19T10:36:26-04:00
    [Producer] => Adobe PDF Library 22.1.149
    [SourceModified] => 2022-04-19T14:36:12+00:00
    [Subject] => Social Determinants of Health: Now is the Time
    [Title] => Social Determinants of Health: Now is the Time
    [Pages] => 2
)
@k00ni
Copy link
Collaborator

k00ni commented Apr 25, 2022

Did this work before? If not, its a feature request, isn't it?

@borgir
Copy link
Author

borgir commented Apr 26, 2022

Hi @k00ni ! Thank you for the reply.
It works with some PDF files. With this in particular, it doesn't work.
I've update the question with and example that works correctly.
Thks.

@k00ni k00ni added bug and removed needs more info labels Apr 26, 2022
@borgir
Copy link
Author

borgir commented Apr 29, 2022

Hi!
Any clue on what might be the problem?
Thks

@k00ni
Copy link
Collaborator

k00ni commented Apr 30, 2022

No sorry.

philippze added a commit to philippze/pdfparser that referenced this issue May 15, 2022
@GreyWyvern
Copy link
Contributor

GreyWyvern commented Jul 6, 2023

This is happening because there is a 'name' in the metadata that includes a hexadecimal-encoded space: "Document#20Type". The fix proposed by @philippze adds a position increment that loops and eventually bypasses the offending metadata name instead of parsing it.

The correct fix is to allow the '#' character in property names, and then convert them to the correct characters.

From Element.php

            ...
            if (!$only_values) {
                if (!preg_match('/\G\s*(?P<name>\/[A-Z#0-9\._]+)(?P<value>.*)/si', $content, $match, 0, $position)) {
                    break;
                } else {
                    $name = preg_replace_callback(
                        '/#(\d\d)/',
                        function($m) {
                            return \chr(base_convert($m[1], 16, 10));
                        },
                        ltrim($match['name'], '/')
                    );
                    $value = $match['value'];
                    $position = strpos($content, $value, $position + \strlen($match['name']));
                }
            } else {
            ...

The result of getDetails() will then be:

array(9) {
  ["Author"]=>
  string(14) "Tasha Jennings"
  ["Company"]=>
  string(15) "Alliance Health"
  ["CreationDate"]=>
  string(25) "2022-04-21T11:25:35-04:00"
  ["Creator"]=>
  string(34) "Acrobat PDFMaker 22 for PowerPoint"
  ["Document Type"]=>
  string(9) "Templates"
  ["ModDate"]=>
  string(25) "2022-04-21T11:25:36-04:00"
  ["Producer"]=>
  string(26) "Adobe PDF Library 22.1.149"
  ["Subject"]=>
  string(37) "Enrollment and Client Update Overview"
  ["Title"]=>
  string(37) "Enrollment and Client Update Overview"
}

GreyWyvern added a commit to GreyWyvern/pdfparser that referenced this issue Jul 6, 2023
Add ability for PdfParser to parse metadata names with hexadecimal encoded characters such as "Document#20Type" where \smalot#20 is a space.
Resolves Issue smalot#529
k00ni added a commit that referenced this issue Jul 11, 2023
* Update Element.php

Add ability for PdfParser to parse metadata names with hexadecimal encoded characters such as "Document#20Type" where \#20 is a space.
Resolves Issue #529

* Update ElementTest.php

Add test for spaces in metadata property names.

* Make sure we fully support hex

Too quick on the commit! Make sure our two 'digit' regexp also finds A-F hex digits. Add a test for #2d which is a hyphen.

* fixed coding style issue in Element.php

---------

Co-authored-by: Konrad Abicht <hi@inspirito.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants