Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: no method matching read_internal_stream_data(::IOStream, ::CosDict, ::Base.GenericIOBuffer{Array{UInt8,1}}) #90

Closed
jakewilliami opened this issue Jun 12, 2020 · 6 comments

Comments

@jakewilliami
Copy link

Once again I am recursively reading PDFs and I am trying to use this tool. However, I get to the attached pdf and it throws this error:

ERROR: LoadError: MethodError: no method matching read_internal_stream_data(::IOStream, ::CosDict, ::Base.GenericIOBuffer{Array{UInt8,1}})
Closest candidates are:
  read_internal_stream_data(::IO, ::CosDict, !Matched::Int64) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/CosReader.jl:256
Stacktrace:
 [1] postprocess_indirect_object(::IOStream, ::Int64, ::CosDict, ::Dict{CosIndirectObjectRef,PDFIO.Cos.CosObjectLoc}) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/CosReader.jl:331
 [2] parse_indirect_obj(::IOStream, ::Int64, ::Dict{CosIndirectObjectRef,PDFIO.Cos.CosObjectLoc}) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/CosReader.jl:359
 [3] cosDocGetObject(::PDFIO.Cos.CosDocImpl, ::PDFIO.Cos.CosNullType, ::CosIndirectObjectRef, ::PDFIO.Cos.CosObjectLoc) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/CosDoc.jl:282
 [4] cosDocGetObject(::PDFIO.Cos.CosDocImpl, ::CosIndirectObjectRef) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/CosDoc.jl:275
 [5] cosDocGetObject(::PDFIO.Cos.CosDocImpl, ::CosDict, ::CosName) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/CosDoc.jl:248
 [6] find_resource(::PDFIO.PD.PDFormXObject, ::CosName, ::CosName) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDXObject.jl:54
 [7] get_xobject(::PDFIO.PD.PDFormXObject, ::CosName) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDXObject.jl:62
 [8] evalContent!(::PDPageElement{:Do}, ::PDFIO.PD.GState{:PDFIO}) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPageElement.jl:846
 [9] evalContent! at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPageElement.jl:657 [inlined]
 [10] Do(::PDFIO.PD.PDFormXObject, ::PDFIO.PD.GState{:PDFIO}) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDXObject.jl:92
 [11] evalContent!(::PDPageElement{:Do}, ::PDFIO.PD.GState{:PDFIO}) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPageElement.jl:848
 [12] evalContent! at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPageElement.jl:657 [inlined]
 [13] pdPageEvalContent(::PDFIO.PD.PDPageImpl, ::PDFIO.PD.GState{:PDFIO}) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPage.jl:145
 [14] pdPageEvalContent at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPage.jl:144 [inlined]
 [15] pdPageExtractText at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPage.jl:178 [inlined]
 [16] (::var"#3#4"{PDFIO.PD.PDDocImpl})(::IOStream) at /Users/jakeireland/scripts/pdfsearches/pdfsearch.jl:34
 [17] open(::var"#3#4"{PDFIO.PD.PDDocImpl}, ::String, ::Vararg{String,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./io.jl:298
 [18] open at ./io.jl:296 [inlined]
 [19] getPDFText at /Users/jakeireland/scripts/pdfsearches/pdfsearch.jl:23 [inlined]
 [20] scanFiles(::String, ::String) at /Users/jakeireland/scripts/pdfsearches/pdfsearch.jl:67
 [21] top-level scope at /Users/jakeireland/scripts/pdfsearches/pdfsearch.jl:91
 [22] include(::Module, ::String) at ./Base.jl:377
 [23] exec_options(::Base.JLOptions) at ./client.jl:288
 [24] _start() at ./client.jl:484
in expression starting at /Users/jakeireland/scripts/pdfsearches/pdfsearch.jl:91

Any idea why?

Thanks for all the work you do on this! It really is excellent.

1.0 (Limits and Continuity).pdf

@sambitdash
Copy link
Owner

sambitdash commented Jun 12, 2020

file.txt

I see no issues in my set up. Can you check if you are using the current versions?

@sambitdash
Copy link
Owner

sambitdash commented Jun 12, 2020

You can follow the following steps.

  1. Create a fresh directory and change into that.
  2. $ julia
  3. julia> ]activate .
  4. (dir) pkg> add PDFIO
  5. julia> getPDFText("file.pdf", "file.txt")

Now upload any errors you see. Alongwith your error send me the Project.toml and Manifest.toml files. And also versioninfo(). Following is an output from my machine.

julia> versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 5 2600X Six-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, znver1)

@jakewilliami
Copy link
Author

I followed your steps

I followed your steps and they worked. So I was very confused why this wasn't working. I found out that the file that was throwing the error was not the one I originally sent—sorry!!

So I followed your steps again with the new file (find directory (compressed) attached):

  1. $ mkdir ~/Desktop/test
  2. $ cd ~/Desktop/test/
  3. $ mv file.pdf ~/Desktop/test/
  4.  $ julia
                    _
        _       _ _(_)_     |  Documentation: https://docs.julialang.org
       (_)     | (_) (_)    |
        _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
       | | | | | | |/ _` |  |
       | | |_| | | | (_| |  |  Version 1.4.1 (2020-04-14)
      _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
     |__/                   |
    
    
  5.  (@v1.4) pkg> activate .
      Activating new environment at `~/Desktop/test/Project.toml`
    
  6.  (test) pkg> add PDFIO
     Updating registry at `~/.julia/registries/General`
     Updating git-repo `https://github.com/JuliaRegistries/General.git`
     Resolving package versions...
     Updating `~/Desktop/test/Project.toml`
     [4d0d745f] + PDFIO v0.1.9
     Updating `~/Desktop/test/Manifest.toml`
     [1520ce14] + AbstractTrees v0.2.1
     [715cd884] + AdobeGlyphList v0.1.1
     [9e28174c] + BinDeps v0.8.10
     [34da2185] + Compat v2.2.0
     [2e475f56] + LabelNumerals v0.1.0
     [4d0d745f] + PDFIO v0.1.9
     [27ebfcd6] + Primes v0.4.0
     [9a9db56c] + Rectangle v0.1.2
     [37834d88] + RomanNumerals v0.3.1
     [30578b45] + URIParser v0.4.1
     [2a0f44e3] + Base64
     [ade2ca70] + Dates
     [8bb1440f] + DelimitedFiles
     [8ba89e20] + Distributed
     [b77e0a4c] + InteractiveUtils
     [76f85450] + LibGit2
     [8f399da3] + Libdl
     [37e2e46d] + LinearAlgebra
     [56ddb016] + Logging
     [d6f4376e] + Markdown
     [a63ad114] + Mmap
     [44cfe95a] + Pkg
     [de0858da] + Printf
     [3fa0cd96] + REPL
     [9a3f8284] + Random
     [ea8e919c] + SHA
     [9e88b42a] + Serialization
     [1a1011a3] + SharedArrays
     [6462fe0b] + Sockets
     [2f01184e] + SparseArrays
     [10745b16] + Statistics
     [8dfed614] + Test
     [cf7118a7] + UUIDs
     [4ec0a83e] + Unicode
    
  7. (test) pkg> ^C (to escape the pkg environment)
  8.  julia> using PDFIO
     [ Info: Precompiling PDFIO [4d0d745f-9d9a-592e-8d18-1ad8a0f42b92]
     Updating registry at `~/.julia/registries/General`
     Updating git-repo `https://github.com/JuliaRegistries/General.git`
     Resolving package versions...
     Updating `~/Desktop/test/Project.toml`
     [458c3c95] + OpenSSL_jll v1.1.1+2
     Updating `~/Desktop/test/Manifest.toml`
     [458c3c95] + OpenSSL_jll v1.1.1+2
    ┌ Warning: Package PDFIO does not have OpenSSL_jll in its dependencies:
    │ - If you have PDFIO checked out for development and have
    │   added OpenSSL_jll as a dependency but haven't updated your primary
    │   environment's manifest file, try `Pkg.resolve()`.
    │ - Otherwise you may need to report an issue with PDFIO
    └ Loading OpenSSL_jll into PDFIO from project dependency, future warnings for PDFIO are suppressed.
     Resolving package versions...
     Updating `~/Desktop/test/Project.toml`
     [83775a58] + Zlib_jll v1.2.11+10
     Updating `~/Desktop/test/Manifest.toml`
     [83775a58] + Zlib_jll v1.2.11+10
    
  9. (need to add the function)
    julia> function getPDFText(src, out)
           # handle that can be used for subsequence operations on the document.
           doc = pdDocOpen(src)
    
           # Metadata extracted from the PDF document.
           # This value is retained and returned as the return from the function.
           docinfo = pdDocGetInfo(doc)
           open(out, "w") do io
    
               # Returns number of pages in the document
               npage = pdDocGetPageCount(doc)
    
               for i=1:npage
    
                   # handle to the specific page given the number index.
                   page = pdDocGetPage(doc, i)
    
                   # Extract text from the page and write it to the output file.
                   pdPageExtractText(io, page)
    
               end
           end
           # Close the document handle.
           # The doc handle should not be used after this call
           pdDocClose(doc)
           return docinfo
       end
    getPDFText (generic function with 1 method)
  10. julia> getPDFText("file.pdf", "file.txt")
    ERROR: MethodError: no method matching read_internal_stream_data(::IOStream, ::CosDict, ::Base.GenericIOBuffer{Array{UInt8,1}})
    Closest candidates are:
    read_internal_stream_data(::IO, ::CosDict, ::Int64) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/CosReader.jl:256
    Stacktrace:
     [1] postprocess_indirect_object(::IOStream, ::Int64, ::CosDict, ::Dict{CosIndirectObjectRef,PDFIO.Cos.CosObjectLoc}) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/CosReader.jl:331
     [2] parse_indirect_obj(::IOStream, ::Int64, ::Dict{CosIndirectObjectRef,PDFIO.Cos.CosObjectLoc}) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/CosReader.jl:359
     [3] cosDocGetObject(::PDFIO.Cos.CosDocImpl, ::PDFIO.Cos.CosNullType, ::CosIndirectObjectRef, ::PDFIO.Cos.CosObjectLoc) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/CosDoc.jl:282
     [4] cosDocGetObject(::PDFIO.Cos.CosDocImpl, ::CosIndirectObjectRef) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/CosDoc.jl:275
     [5] cosDocGetObject(::PDFIO.Cos.CosDocImpl, ::CosDict, ::CosName) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/CosDoc.jl:248
     [6] find_resource(::PDFIO.PD.PDFormXObject, ::CosName, ::CosName) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDXObject.jl:54
     [7] get_xobject(::PDFIO.PD.PDFormXObject, ::CosName) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDXObject.jl:62
     [8] evalContent!(::PDPageElement{:Do}, ::PDFIO.PD.GState{:PDFIO}) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPageElement.jl:846
     [9] evalContent! at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPageElement.jl:657 [inlined]
     [10] Do(::PDFIO.PD.PDFormXObject, ::PDFIO.PD.GState{:PDFIO}) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDXObject.jl:92
     [11] evalContent!(::PDPageElement{:Do}, ::PDFIO.PD.GState{:PDFIO}) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPageElement.jl:848
     [12] evalContent! at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPageElement.jl:657 [inlined]
     [13] pdPageEvalContent(::PDFIO.PD.PDPageImpl, ::PDFIO.PD.GState{:PDFIO}) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPage.jl:145
     [14] pdPageEvalContent at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPage.jl:144 [inlined]
     [15] pdPageExtractText at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPage.jl:178 [inlined]
     [16] (::var"#3#4"{PDFIO.PD.PDDocImpl})(::IOStream) at ./REPL[4]:19
     [17] open(::var"#3#4"{PDFIO.PD.PDDocImpl}, ::String, ::Vararg{String,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./io.jl:298
     [18] open at ./io.jl:296 [inlined]
     [19] getPDFText(::String, ::String) at ./REPL[4]:8
     [20] top-level scope at REPL[5]:1
    
  11. versioninfo()
    Julia Version 1.4.1 
    Commit 381693d3df* (2020-04-14 17:20 UTC) 
    Platform Info:
      OS: macOS (x86_64-apple-darwin18.7.0)
      CPU: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz
      WORD_SIZE: 64
      LIBM: libopenlibm
      LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
    

I'm very sorry about sending the wrong file! I must have read the error file incorrectly. Thank you for your help

@jakewilliami
Copy link
Author

test.zip

@sambitdash
Copy link
Owner

The bug is due to the length for stream objects are indirect objects embedded in the Object Streams. The current implementation does not look for the length attribute in the object streams.

@sambitdash
Copy link
Owner

Fix in c8c3c57
file.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants