Range Error on freshly generated and valid PDF #11

dnnsjrng · 2020-03-08T15:23:55Z

Hey!

First of all thanks for your amazing gits! Love using theme.

Today i encountered a problem with freshly generated and valid PDFs (InDesign / Affinity Publisher, Adobe Acrobat or even self generated with qpdf).

Scenario:

Read the File with FileReader as BinaryString -> handover to a PDF Stream Extraction Function (Regex the FlateDecode Part) -> UZIP.inflate -> Put together the PDF File

FileReader:

var request = new XMLHttpRequest();
request.open('GET', file, true);
request.responseType = 'blob';
request.onload = function() {
    var reader = new FileReader();
    reader.readAsBinaryString(request.response);
    reader.onload =  function(e){
        console.log(e.target.result);
        stripPDFStream(e.target.result);
    };
};
request.send();

Stream Extraction:

function stripPDFStream(input) {
    var regex = /.*?FlateDecode.*?stream(.*?)endstream/gms;
    function replaceDeflate() {
        var result;
        while ((result = regex.exec(input)) !== null) {
            console.log(result[0]);
            console.log(result[1]); // <---- Stream Data
            var stripped = result[1].replace(/^\s+|\s+$/g, '');
            var inflate = inflatePDF(stripped);
            var output = input.replace(result[0], inflate);
        }
    }
    replaceDeflate();
}

Inflate:

function inflatePDF(input) {
    var enc = new TextEncoder(); // always utf-8
    var uint8array = enc.encode(input);
    var data = UZIP.inflateRaw(uint8array);
    console.log(data);
}

UZip throws the error: Uncaught RangeError: Invalid type array lenght: length.

You said something over at pako.js git, that your uzip repository is capable of inflating a pdf flatedecode stream and in another issue here, that array lenght doesnt matter cause you copy over to a bigger array, if it will not fit into the output array.

Would be awesome, if you could help. I'm at a loss. :(

Greetings,
Dennis
test.pdf

The text was updated successfully, but these errors were encountered:

photopea · 2020-03-08T15:33:06Z

Hi, first, let's check if your uint8array is a valid input. Can you try using pako.js instead and call pako.inflateRaw(...) insted of UZIP.inflateRaw(...) ? Does it work?

dnnsjrng · 2020-03-08T15:35:41Z

Thats for your quick response!

Pako throws the following error:
Uncaught invalid stored block lengths

Btw.: I read about your problems with pako (nodeca/pako#174), thats why i remembered your repositories and switched to uzip :)

photopea · 2020-03-08T16:13:18Z

In this case nodeca/pako#174 the input data were not completely correct.

pako.js is a direct rewrite of ZLIB library and if something does not work in pako.js, then, your input is probably wrong (and UZIP.js will not help you). The only advantage of UZIP.js over pako.js is, that it is faster :)

dnnsjrng · 2020-03-08T16:19:23Z

I appreciate your quick responses here, thank you!

Did you take a look at the PDF? Unless it is due to pako/uzip, the only way I can explain it is that it is due to my regex pattern. Maybe you have an idea, after all you have much more experience in reading PDFs. Hopefully :)

photopea · 2020-03-08T16:25:39Z

I am not that good with Blobs and regexes :) What exactly is your goal? Maybe you could use UDOC.js.

Could you open your PDF in a text editor and check, if the bytes you are extracting correspond to what you want?

dnnsjrng · 2020-03-08T16:42:24Z

To avoid unnecessary upload time with very large PDFs (>3GB), I check the pdfs in a preflight for print press on the client side with JS.

Because of the pdf size, i read the pdfs in chunks, parse it in a clean array, search for stream objects (FlateDecode) and try to inflate them. If the stream contains further objects and maybe images, i will keep the objects, discard the image data except the metadata.

At the end I will hopefully have more or less a PDF skeleton with a clean dictionary incl. page tree and so on without any images in a handy array format and can throw it into the preflight. If this is successful, I upload the original pdf.

The preflight system is done already and works well. Because of the chunked reading, the stream objects give me a headache. Without the chunked upload i could let pdfjs do the work.

photopea · 2020-03-08T16:45:07Z

So you want to remove images from a PDF, but keep the rest as a valid PDF file? You would have to rebuild the XREF table, etc. It is not that easy.

dnnsjrng · 2020-03-08T16:48:02Z

The PDF freed from the images no longer needs to be valid. I only need all information like TrimBoxes, OutputIntent etc. for the preflight. The PDF is no longer displayed or needed.

For later printing I upload the unmodified original PDF, of course.

Is your udoc.js capable of that?

photopea · 2020-03-08T16:52:26Z

UDOC.js only extracts the graphic content and allows your own code to process it (e.g. render it or convert it to another format). It does not read metadata etc.

If you have a valid Flate stream, both pako.js and UZIP.js should work. If you load a file in chunks, a single stream can start in one chunk and end in another.

dnnsjrng · 2020-03-08T17:26:42Z

Thanks! Nice. Reading the RFC I had on my todo list whether I had to wait for the end of the stream when encoding Deflate or not. Do I understand you correctly that I don't have to read the stream, which is possibly not yet finished in my chunk, but can encode it directly?

If I may ask one more question about the stream: Are there any characters between -stream- and -endstream- that I need to escape? There is no mention of such a thing in the PDF Reference or I didn't find it. So I can actually read from the first byte after -stream- to the last byte before -endstream- and give it to pako/uzip for conversion?

thanks a lot for your help!

photopea · 2020-03-08T17:31:29Z

Yes, if you extract a flate stream from PDF, you can decompress it.
PDF is a bit tricky. There can be line ends after "stream" and before "endstream". In theory, there could even be "endstream" inside a stream. The true length of a stream is stored in a stream dictionary, e.g. << .... /Length 268

dnnsjrng · 2020-03-08T20:05:56Z

So i got back to the beginning just to make sure everything is right so far:

a) I generate a deflate base64 string eJwrycgszixOLE7MLchJLUmtKAEAQN0HPQ== from this source thisisasampletext
b) I checked if it is working back and forth with some online generators
c) Check it with pako / uzip in a simple function

And voila. It is working back and forth.

.... 1h later ...

I got rid of the TextEncoder:

~~var enc = new TextEncoder(); // always utf-8~~
~~var uint8array = enc.encode(input);~~

function inflatePDF(input) {
    var data = pako.inflate(input);
    console.log(data);
}

and check! Finally i got my PDF FlateDecoded stream object.

It was an instructive weekend about PDF streams.
Thanks a lot for your help!

dnnsjrng · 2020-03-08T20:18:38Z

PS.: When i use your UZIP library, then i got the Uncaught RangeError: Invalid typed array length: 0 Error. With pako everything is fine.

photopea · 2020-03-08T20:42:55Z

I uploaded a new version of UZIP.js , does it work in it?

dnnsjrng · 2020-03-08T21:52:52Z

I'll have a try tomorrow!

dnnsjrng · 2020-03-11T14:34:05Z

It is working now! Thanks a lot!

photopea · 2020-03-11T14:42:59Z

Great :)

dnnsjrng closed this as completed Mar 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Range Error on freshly generated and valid PDF #11

Range Error on freshly generated and valid PDF #11

dnnsjrng commented Mar 8, 2020 •

edited

Loading

photopea commented Mar 8, 2020

dnnsjrng commented Mar 8, 2020 •

edited

Loading

photopea commented Mar 8, 2020 •

edited

Loading

dnnsjrng commented Mar 8, 2020

photopea commented Mar 8, 2020

dnnsjrng commented Mar 8, 2020 •

edited

Loading

photopea commented Mar 8, 2020

dnnsjrng commented Mar 8, 2020 •

edited

Loading

photopea commented Mar 8, 2020

dnnsjrng commented Mar 8, 2020

photopea commented Mar 8, 2020

dnnsjrng commented Mar 8, 2020 •

edited

Loading

dnnsjrng commented Mar 8, 2020

photopea commented Mar 8, 2020

dnnsjrng commented Mar 8, 2020

dnnsjrng commented Mar 11, 2020

photopea commented Mar 11, 2020

Range Error on freshly generated and valid PDF #11

Range Error on freshly generated and valid PDF #11

Comments

dnnsjrng commented Mar 8, 2020 • edited Loading

photopea commented Mar 8, 2020

dnnsjrng commented Mar 8, 2020 • edited Loading

photopea commented Mar 8, 2020 • edited Loading

dnnsjrng commented Mar 8, 2020

photopea commented Mar 8, 2020

dnnsjrng commented Mar 8, 2020 • edited Loading

photopea commented Mar 8, 2020

dnnsjrng commented Mar 8, 2020 • edited Loading

photopea commented Mar 8, 2020

dnnsjrng commented Mar 8, 2020

photopea commented Mar 8, 2020

dnnsjrng commented Mar 8, 2020 • edited Loading

dnnsjrng commented Mar 8, 2020

photopea commented Mar 8, 2020

dnnsjrng commented Mar 8, 2020

dnnsjrng commented Mar 11, 2020

photopea commented Mar 11, 2020

dnnsjrng commented Mar 8, 2020 •

edited

Loading

dnnsjrng commented Mar 8, 2020 •

edited

Loading

photopea commented Mar 8, 2020 •

edited

Loading

dnnsjrng commented Mar 8, 2020 •

edited

Loading

dnnsjrng commented Mar 8, 2020 •

edited

Loading

dnnsjrng commented Mar 8, 2020 •

edited

Loading