Tabulizer does not read top rows when there is no line on top #121

gorkang · 2020-07-01T19:30:11Z

Please specify whether your issue is about:

a possible bug
If you are reporting (1) a bug or (2) a question about code, please supply:

Please see a reprex below of the issue. In brief, tabulizer extracts 684 out of 746 rows of this document. Mostly ignores the first rows of each page (starting at page 2). Those rows don't have a line on top.

## rJava loads successfully
# install.packages("rJava")
library("rJava")

## load package
library("tabulizer")

## code goes here


library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)

table_list = extract_tables("https://github.com/ropensci/tabulizer/files/4860026/DOC.pdf", method = "lattice")
temp_df = 1:length(table_list) %>% map_df(~ table_list[[.x]]%>% as_tibble)  
#> Warning: The `x` argument of `as_tibble.matrix()` must have column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.

nrow(temp_df)
#> [1] 684

# COMMENT: Tabulizer extracts 684 out of 746 rows. Mostly ignores the first rows of each page (starting at page 2). Those rows don't have a line on top.
  


## session info for your system
sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                 
#>  [3] LC_TIME=es_CL.UTF-8           LC_COLLATE=en_US.UTF-8       
#>  [5] LC_MONETARY=es_CL.UTF-8       LC_MESSAGES=en_US.UTF-8      
#>  [7] LC_PAPER=es_CL.UTF-8          LC_NAME=es_CL.UTF-8          
#>  [9] LC_ADDRESS=es_CL.UTF-8        LC_TELEPHONE=es_CL.UTF-8     
#> [11] LC_MEASUREMENT=es_CL.UTF-8    LC_IDENTIFICATION=es_CL.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] purrr_0.3.4     dplyr_1.0.0     tabulizer_0.2.2 rJava_0.9-12   
#> 
#> loaded via a namespace (and not attached):
#>  [1] knitr_1.29          magrittr_1.5        tidyselect_1.1.0   
#>  [4] R6_2.4.1            rlang_0.4.6         stringr_1.4.0      
#>  [7] highr_0.8           tools_3.6.3         xfun_0.15          
#> [10] png_0.1-7           htmltools_0.5.0     ellipsis_0.3.1     
#> [13] yaml_2.2.1          digest_0.6.25       tibble_3.0.1       
#> [16] lifecycle_0.2.0     crayon_1.3.4        vctrs_0.3.1        
#> [19] glue_1.4.1          evaluate_0.14       rmarkdown_2.3      
#> [22] stringi_1.4.6       compiler_3.6.3      pillar_1.4.4       
#> [25] tabulizerjars_1.0.1 generics_0.0.2      pkgconfig_2.0.3

^{Created on 2020-07-01 by the reprex package (v0.3.0)}

DOC.pdf

gorkang · 2020-07-01T19:42:02Z

For the sake of completeness (or in case it is useful to someone), my solution has been the following:

library(tabulizer)
library(dplyr)
library(purrr)

# Manually set areas for the tables (page1 is different than the rest)
p1_area = list(c(154.24309, 17.75138, 782.15470, 600.81215 )) 
p2_area = list(c(95.17127 , 24.31492, 780, 594.24862))

# Get number of pages of document, to create as many lists with the areas as pages has the document
num_pages = length(extract_tables(filename, method = "lattice"))
areas_all = c(p1_area, rep(p2_area, num_pages - 1))

# Extract tables
table_list = extract_tables("https://github.com/ropensci/tabulizer/files/4860026/DOC.pdf", method = "lattice", area = areas_all)
temp_df = 1:length(table_list) %>% map_df(~ table_list[[.x]]%>% as_tibble)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tabulizer does not read top rows when there is no line on top #121

Tabulizer does not read top rows when there is no line on top #121

gorkang commented Jul 1, 2020

gorkang commented Jul 1, 2020

Tabulizer does not read top rows when there is no line on top #121

Tabulizer does not read top rows when there is no line on top #121

Comments

gorkang commented Jul 1, 2020

gorkang commented Jul 1, 2020