Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added calculateTextWidth function to Font #517

Merged
merged 3 commits into from
Mar 21, 2022

Conversation

shtayerc
Copy link
Contributor

Character width is read from font details field 'Widths'.

@shtayerc shtayerc force-pushed the font-calculate-text-width branch from 311ed58 to 538708a Compare February 25, 2022 11:35
Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for another pull request @shtayerc!

I have one question regarding $missing parameter. I don't like the current approach, because $missing seems to be an array but is outlined as a non-typed variable. Furthermore having a parameter which contains additional data and is given by reference might lead to undefined/unexpected behavior. What if inside the for-loop an index reference is not set? How reliable is $missing in such situations?

@shtayerc
Copy link
Contributor Author

@k00ni
$missing parameter is solving problem about user not knowing if text width is correct or not. Now we give user a list of missing characters so he can handle errors by himself. I also wanted to make that parameter optional. I saw similar usage in preg_match function. I am not sure what undefined behavior you mean.

If you prefer to not use parameter by reference, it could be done without reference by returning array, for example:

[
    'width' => 123,
    'missing' => ['a', 'b']
]

@k00ni
Copy link
Collaborator

k00ni commented Mar 2, 2022

I am not sure what undefined behavior you mean.

Can we assume that all calls using an array-index inside the function are succeed (meaning all keys we wanna access are set)? Related variables are $width_map[$index_map[$char]], $index_map[$char] and $widths[$width_index].

@shtayerc shtayerc force-pushed the font-calculate-text-width branch from 538708a to c2f801f Compare March 2, 2022 09:18
@shtayerc
Copy link
Contributor Author

shtayerc commented Mar 2, 2022

I am not sure, both arrays are read from pdf file. I added check for all indexes.

@k00ni
Copy link
Collaborator

k00ni commented Mar 2, 2022

I am not sure, both arrays are read from pdf file. I added check for all indexes.

Thanks, this way we can be sure that an index is always set before accessing it.

Thinking about the return value, we should keep float as return value. But is it semantically correct to assume that the width is a float equal or higher than 0? I am not sure here, maybe PDF specification is more clear. What do you think about setting $width = null; instead of $width = 0; (and also adapt return value of function to ?float)?

@shtayerc shtayerc force-pushed the font-calculate-text-width branch 2 times, most recently from 91726cb to c33a4b5 Compare March 2, 2022 14:58
@shtayerc
Copy link
Contributor Author

shtayerc commented Mar 2, 2022

$width = null; is better (character width can be 0. Now user can see if loop was skipped or width is actually 0), changed. According to pdf spec widths are integers, but the array returned by Font::getDetails() contains floats. I guess it makes sense to continue returning float or null.

@k00ni
Copy link
Collaborator

k00ni commented Mar 3, 2022

According to pdf spec widths are integers [...]

We should be in line with the PDF specification here, even though other functions do this differently. Can you paste the part of the specification where you found this information, please? Also change return value from float to integer.

@shtayerc
Copy link
Contributor Author

shtayerc commented Mar 3, 2022

Both specs were found here
In PDF 1.0 spec chapter 4.3

PDF provides two types of numbers, integer and real. Integers may be specified by
signed or unsigned constants. Reals may only be in decimal format. Throughout
this book, number means an object whose type is either integer or real.

In examples of Widths array in the same pdf there are only intergers. Because reals should be written only in decimal format I assumed that Widths are integers.

But in PDF 1.7 spec chapter 3.2.2

Throughout this book, the term number refers to an object whose type may be
either integer or real. Wherever a real number is expected, an integer may be used
instead and is automatically converted to an equivalent real value. For example, it
is not necessary to write the number 1.0 in real format; the integer 1 is sufficient.

But here it says that reals can be written without decimal part. I only checked spec 1.0 before.
Now I think it makes sense to keep float type. It handles both specs (it does not hurt to have int represented as float).

@k00ni
Copy link
Collaborator

k00ni commented Mar 3, 2022

Thank you! OK, lets keep float.

@k00ni
Copy link
Collaborator

k00ni commented Mar 7, 2022

I merged #513, can you add a basic example in https://github.com/smalot/pdfparser/blob/master/doc/Usage.md how to use this new function, please?

Character width is read from font details field 'Widths'.
@shtayerc shtayerc force-pushed the font-calculate-text-width branch from c33a4b5 to 769e594 Compare March 7, 2022 11:32
@shtayerc
Copy link
Contributor Author

shtayerc commented Mar 7, 2022

I added basic example

@k00ni
Copy link
Collaborator

k00ni commented Mar 7, 2022

I extended your example a bit.

I wanna make sure I understand the meaning of this function: You call it with a text part (which must be in the PDF?) and a variable (in which any characters are stored, which have no width). It tries to find the width for the whole text part?

@shtayerc
Copy link
Contributor Author

shtayerc commented Mar 7, 2022

Yes, it is close to what you said. Text part must be written in font from which you call the function - otherwise it is missing. Preferably it should be used with getDataTm() text part and font identifier (from my other PR) to make sure that widths for all characters are available.

@k00ni
Copy link
Collaborator

k00ni commented Mar 7, 2022

Preferably it should be used with getDataTm() text part and font identifier (from my other PR) to make sure that widths for all characters are available.

If you think that is important, please add a hint and/or example to the usage passage. Both of your PRs will be merged so it doesn't matter in which you do this. That will help users and avoid confusion.

Thank you for your time and effort btw.

@k00ni
Copy link
Collaborator

k00ni commented Mar 16, 2022

If you think that is important, please add a hint and/or example to the usage passage. Both of your PRs will be merged so it doesn't matter in which you do this.

@shtayerc, can you give me a short status info about remaining tasks here? Thanks!

@shtayerc
Copy link
Contributor Author

@k00ni I think everything is done. Example usage with dataTm was added in previous PR (at the end of this section)

Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for getting back to me so fast.

PR looks good. I will leave it open until end of this week. If there are no objections or comments, it will be merged next week.

@k00ni k00ni merged commit ef70983 into smalot:master Mar 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants