Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Range Error on freshly generated and valid PDF #11

Closed
dnnsjrng opened this issue Mar 8, 2020 · 17 comments
Closed

Range Error on freshly generated and valid PDF #11

dnnsjrng opened this issue Mar 8, 2020 · 17 comments

Comments

@dnnsjrng
Copy link

dnnsjrng commented Mar 8, 2020

Hey!

First of all thanks for your amazing gits! Love using theme.

Today i encountered a problem with freshly generated and valid PDFs (InDesign / Affinity Publisher, Adobe Acrobat or even self generated with qpdf).

Scenario:

Read the File with FileReader as BinaryString -> handover to a PDF Stream Extraction Function (Regex the FlateDecode Part) -> UZIP.inflate -> Put together the PDF File

FileReader:

var request = new XMLHttpRequest();
request.open('GET', file, true);
request.responseType = 'blob';
request.onload = function() {
    var reader = new FileReader();
    reader.readAsBinaryString(request.response);
    reader.onload =  function(e){
        console.log(e.target.result);
        stripPDFStream(e.target.result);
    };
};
request.send();

Stream Extraction:

function stripPDFStream(input) {
    var regex = /.*?FlateDecode.*?stream(.*?)endstream/gms;
    function replaceDeflate() {
        var result;
        while ((result = regex.exec(input)) !== null) {
            console.log(result[0]);
            console.log(result[1]); // <---- Stream Data
            var stripped = result[1].replace(/^\s+|\s+$/g, '');
            var inflate = inflatePDF(stripped);
            var output = input.replace(result[0], inflate);
        }
    }
    replaceDeflate();
}

Inflate:

function inflatePDF(input) {
    var enc = new TextEncoder(); // always utf-8
    var uint8array = enc.encode(input);
    var data = UZIP.inflateRaw(uint8array);
    console.log(data);
}

UZip throws the error: Uncaught RangeError: Invalid type array lenght: length.

You said something over at pako.js git, that your uzip repository is capable of inflating a pdf flatedecode stream and in another issue here, that array lenght doesnt matter cause you copy over to a bigger array, if it will not fit into the output array.

Would be awesome, if you could help. I'm at a loss. :(

Greetings,
Dennis
test.pdf

@photopea
Copy link
Owner

photopea commented Mar 8, 2020

Hi, first, let's check if your uint8array is a valid input. Can you try using pako.js instead and call pako.inflateRaw(...) insted of UZIP.inflateRaw(...) ? Does it work?

@dnnsjrng
Copy link
Author

dnnsjrng commented Mar 8, 2020

Thats for your quick response!

Pako throws the following error:
Uncaught invalid stored block lengths

Btw.: I read about your problems with pako (nodeca/pako#174), thats why i remembered your repositories and switched to uzip :)

@photopea
Copy link
Owner

photopea commented Mar 8, 2020

In this case nodeca/pako#174 the input data were not completely correct.

pako.js is a direct rewrite of ZLIB library and if something does not work in pako.js, then, your input is probably wrong (and UZIP.js will not help you). The only advantage of UZIP.js over pako.js is, that it is faster :)

@dnnsjrng
Copy link
Author

dnnsjrng commented Mar 8, 2020

I appreciate your quick responses here, thank you!

Did you take a look at the PDF? Unless it is due to pako/uzip, the only way I can explain it is that it is due to my regex pattern. Maybe you have an idea, after all you have much more experience in reading PDFs. Hopefully :)

@photopea
Copy link
Owner

photopea commented Mar 8, 2020

I am not that good with Blobs and regexes :) What exactly is your goal? Maybe you could use UDOC.js.

Could you open your PDF in a text editor and check, if the bytes you are extracting correspond to what you want?

@dnnsjrng
Copy link
Author

dnnsjrng commented Mar 8, 2020

To avoid unnecessary upload time with very large PDFs (>3GB), I check the pdfs in a preflight for print press on the client side with JS.

Because of the pdf size, i read the pdfs in chunks, parse it in a clean array, search for stream objects (FlateDecode) and try to inflate them. If the stream contains further objects and maybe images, i will keep the objects, discard the image data except the metadata.

At the end I will hopefully have more or less a PDF skeleton with a clean dictionary incl. page tree and so on without any images in a handy array format and can throw it into the preflight. If this is successful, I upload the original pdf.

The preflight system is done already and works well. Because of the chunked reading, the stream objects give me a headache. Without the chunked upload i could let pdfjs do the work.

@photopea
Copy link
Owner

photopea commented Mar 8, 2020

So you want to remove images from a PDF, but keep the rest as a valid PDF file? You would have to rebuild the XREF table, etc. It is not that easy.

@dnnsjrng
Copy link
Author

dnnsjrng commented Mar 8, 2020

The PDF freed from the images no longer needs to be valid. I only need all information like TrimBoxes, OutputIntent etc. for the preflight. The PDF is no longer displayed or needed.

For later printing I upload the unmodified original PDF, of course.

Is your udoc.js capable of that?

@photopea
Copy link
Owner

photopea commented Mar 8, 2020

UDOC.js only extracts the graphic content and allows your own code to process it (e.g. render it or convert it to another format). It does not read metadata etc.

If you have a valid Flate stream, both pako.js and UZIP.js should work. If you load a file in chunks, a single stream can start in one chunk and end in another.

@dnnsjrng
Copy link
Author

dnnsjrng commented Mar 8, 2020

Thanks! Nice. Reading the RFC I had on my todo list whether I had to wait for the end of the stream when encoding Deflate or not. Do I understand you correctly that I don't have to read the stream, which is possibly not yet finished in my chunk, but can encode it directly?

If I may ask one more question about the stream: Are there any characters between -stream- and -endstream- that I need to escape? There is no mention of such a thing in the PDF Reference or I didn't find it. So I can actually read from the first byte after -stream- to the last byte before -endstream- and give it to pako/uzip for conversion?

thanks a lot for your help!

@photopea
Copy link
Owner

photopea commented Mar 8, 2020

Yes, if you extract a flate stream from PDF, you can decompress it.
PDF is a bit tricky. There can be line ends after "stream" and before "endstream". In theory, there could even be "endstream" inside a stream. The true length of a stream is stored in a stream dictionary, e.g. << .... /Length 268

@dnnsjrng
Copy link
Author

dnnsjrng commented Mar 8, 2020

So i got back to the beginning just to make sure everything is right so far:

a) I generate a deflate base64 string eJwrycgszixOLE7MLchJLUmtKAEAQN0HPQ== from this source thisisasampletext
b) I checked if it is working back and forth with some online generators
c) Check it with pako / uzip in a simple function

And voila. It is working back and forth.

.... 1h later ...

I got rid of the TextEncoder:

var enc = new TextEncoder(); // always utf-8
var uint8array = enc.encode(input);

function inflatePDF(input) {
    var data = pako.inflate(input);
    console.log(data);
}

and check! Finally i got my PDF FlateDecoded stream object.

It was an instructive weekend about PDF streams.
Thanks a lot for your help!

@dnnsjrng dnnsjrng closed this as completed Mar 8, 2020
@dnnsjrng
Copy link
Author

dnnsjrng commented Mar 8, 2020

PS.: When i use your UZIP library, then i got the Uncaught RangeError: Invalid typed array length: 0 Error. With pako everything is fine.

@photopea
Copy link
Owner

photopea commented Mar 8, 2020

I uploaded a new version of UZIP.js , does it work in it?

@dnnsjrng
Copy link
Author

dnnsjrng commented Mar 8, 2020

I'll have a try tomorrow!

@dnnsjrng
Copy link
Author

It is working now! Thanks a lot!

@photopea
Copy link
Owner

Great :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants