Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assert fail in src/ccstruct/pageres.cpp, line 1502 with specific image and language combination - all languages from ubuntu repos #4270

Closed
cchadowitz opened this issue Jun 18, 2024 · 3 comments

Comments

@cchadowitz
Copy link

Current Behavior

When running this command line:

$ tesseract /opt/sample_013741.jpg -- -l ara+chi_tra

The following occurs:

Estimating resolution as 303
!w_it.cycled_list():Error:Assert failed:in file src/ccstruct/pageres.cpp, line 1502
Aborted (core dumped)

This is reproducible via the following sequence of commands (output is clipped for brevity until the end) to start a clean Ubuntu 24.04 docker container, update existing packages, install tesseract-ocr (for command line usage) and the two languages in question, tesseract-ocr-ara and tesseract-ocr-chi-tra. The test image is the same image in #4148, wget is used to download it to test. It is also available in this ticket below.

$ docker run --rm -it ubuntu:24.04 /bin/bash

# apt update && apt upgrade -y && apt install -y wget tesseract-ocr tesseract-ocr-ara tesseract-ocr-chi-tra

# wget -O test.jpg https://private-user-images.githubusercontent.com/5814160/277405666-ac322732-7d8b-45f8-875c-83a04b278165.jpg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTg3MzIyNzMsIm5iZiI6MTcxODczMTk3MywicGF0aCI6Ii81ODE0MTYwLzI3NzQwNTY2Ni1hYzMyMjczMi03ZDhiLTQ1ZjgtODc1Yy04M2EwNGIyNzgxNjUuanBnP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxOCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MThUMTczMjUzWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MzZlZDY1ZDgyYmRhNmNkYzY4M2QxYTRkMTk5ZWExMjE3MjI5MWUxZjg5NjEzZDY2ZGVlYTMzNDQ4MWEwNDYwOSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.RKjXwn4LLeWj619ALbtICZ21i-qcU6cv_Qr_Q6G3jZY
--2024-06-18 17:34:55--  https://private-user-images.githubusercontent.com/5814160/277405666-ac322732-7d8b-45f8-875c-83a04b278165.jpg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTg3MzIyNzMsIm5iZiI6MTcxODczMTk3MywicGF0aCI6Ii81ODE0MTYwLzI3NzQwNTY2Ni1hYzMyMjczMi03ZDhiLTQ1ZjgtODc1Yy04M2EwNGIyNzgxNjUuanBnP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxOCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MThUMTczMjUzWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MzZlZDY1ZDgyYmRhNmNkYzY4M2QxYTRkMTk5ZWExMjE3MjI5MWUxZjg5NjEzZDY2ZGVlYTMzNDQ4MWEwNDYwOSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.RKjXwn4LLeWj619ALbtICZ21i-qcU6cv_Qr_Q6G3jZY

# tesseract test.jpg -- -l ara+chi_tra
Estimating resolution as 303
!w_it.cycled_list():Error:Assert failed:in file src/ccstruct/pageres.cpp, line 1502
Aborted (core dumped)

Backtrace:

GNU gdb (Ubuntu 15.0.50.20240403-0ubuntu1) 15.0.50.20240403-git
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from tesseract...

warning: could not find '.gnu_debugaltlink' file for /usr/bin/tesseract
(No debugging symbols found in tesseract)
(gdb) run sample_013741.jpg -- -l ara+chi_tra
Starting program: /usr/bin/tesseract sample_013741.jpg -- -l ara+chi_tra
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/liblber.so.2
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libbrotlidec.so.1
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libbrotlicommon.so.1
Estimating resolution as 303
[New Thread 0x71f6f001f6c0 (LWP 4281)]
[New Thread 0x71f6ef81e6c0 (LWP 4282)]
[New Thread 0x71f6ef01d6c0 (LWP 4283)]
!w_it.cycled_list():Error:Assert failed:in file src/ccstruct/pageres.cpp, line 1502

Thread 1 "tesseract" received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
warning: 44	./nptl/pthread_kill.c: No such file or directory

This also occurred when using the latest package from the dail-dev ppa here, version included below.

The test image is included here again for reference.
sample_013741

Expected Behavior

As in #4148 and #4146, the expectation is that this combination of languages and image would not cause a sigabrt.

Suggested Fix

No known suggested fixes at this time.

tesseract -v

Current Ubuntu 24.04 tesseract-ocr package:

 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5) : libpng 1.6.43 : libtiff 4.5.1 : zlib 1.3 : libwebp 1.3.2 : libopenjp2 2.5.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.7.2 zlib/1.3 liblzma/5.4.5 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5
 Found libcurl/8.5.0 OpenSSL/3.0.13 zlib/1.3 brotli/1.1.0 zstd/1.5.5 libidn2/2.3.7 libpsl/0.21.2 (+libidn2/2.3.7) libssh/0.10.6/openssl/zlib nghttp2/1.59.0 librtmp/2.3 OpenLDAP/2.6.7

From the current latest package available from this daily-dev ppa:

tesseract --version
tesseract 5.4.1
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5) : libpng 1.6.43 : libtiff 4.5.1 : zlib 1.3 : libwebp 1.3.2 : libopenjp2 2.5.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.7.2 zlib/1.3 liblzma/5.4.5 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5
 Found libcurl/8.5.0 OpenSSL/3.0.13 zlib/1.3 brotli/1.1.0 zstd/1.5.5 libidn2/2.3.7 libpsl/0.21.2 (+libidn2/2.3.7) libssh/0.10.6/openssl/zlib nghttp2/1.59.0 librtmp/2.3 OpenLDAP/2.6.7```


### Operating System

Ubuntu 24.04 Noble

### Other Operating System

Ubuntu 24.04 docker container

### uname -a

(from within the container)
```Linux dc417289b0ea 6.5.0-41-generic #41-Ubuntu SMP PREEMPT_DYNAMIC Mon May 20 15:55:15 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Compiler

# gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-linux-gnu/13/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 13.2.0-23ubuntu4' --with-bugurl=file:///usr/share/doc/gcc-13/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-13 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/libexec --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-libstdcxx-backtrace --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --enable-cet --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-13-uJ7kn6/gcc-13-13.2.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-13-uJ7kn6/gcc-13-13.2.0/debian/tmp-gcn/usr --enable-offload-defaulted --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4) 

CPU

vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-2609 v4 @ 1.70GHz
stepping	: 1
microcode	: 0xb000040
cpu MHz		: 1637.511
cache size	: 20480 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 8
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 20
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts vnmi md_clear flush_l1d
vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs pml
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit mmio_stale_data
bogomips	: 3400.01
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

Virtualization / Containers

$ docker --version
Docker version 26.1.4, build 5650f9b

Other Information

I opened this new ticket even though this is closely related to #4146 and #4148 as this is entirely reproducible with the latest ubuntu packages for both the command line tesseract and the languages used. This implies that while the image may not have discernible text for the OCR process to function, it is still causing a sigabrt with a "standard configuration".

@cchadowitz
Copy link
Author

In addition, this is the version of liblept5 that is installed in a clean Ubuntu 24.04 container when installing tesseract-ocr (there is no newer version of liblept5 available from the above tesseract daily-dev ppa):

# apt info liblept5
Package: liblept5
Version: 1.82.0-3build4
Priority: optional
Section: universe/libs
Source: leptonlib
Origin: Ubuntu
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Original-Maintainer: Jeff Breidenbach <jab@debian.org>
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Installed-Size: 2726 kB
Depends: libc6 (>= 2.33), libgif7 (>= 5.1), libjpeg8 (>= 8c), libopenjp2-7 (>= 2.0.0), libpng16-16t64 (>= 1.6.2), libtiff6 (>= 4.0.3), libwebp7 (>= 1.3.2), libwebpmux3 (>= 1.3.2), zlib1g (>= 1:1.1.4)
Breaks: libleptonica (>= 1.69~)
Replaces: libleptonica (>= 1.69~)
Homepage: http://www.leptonica.org
Task: kubuntu-desktop, kubuntu-full, ubuntustudio-video, ubuntustudio-graphics, ubuntustudio-publishing
Download-Size: 1099 kB
APT-Manual-Installed: no
APT-Sources: http://archive.ubuntu.com/ubuntu noble/universe amd64 Packages
Description: image processing library

@stweil stweil added the bug label Nov 8, 2024
@stweil
Copy link
Contributor

stweil commented Nov 8, 2024

Stack trace with latest code when running
tesseract sample_013741.jpg - -l tessdata_fast/ara+tessdata_fast/chi_tra:

(gdb) i s
#0  tesseract::ERRCODE::error (this=0x5555559d3ce0 <tesseract::ASSERT_FAILED>, caller=0x5555558744d3 "!w_it.cycled_list()", action=tesseract::ABORT, format=0x555555873d71 "in file %s, line %d")
    at ../../../src/ccutil/errcode.cpp:78
#1  0x0000555555694e95 in tesseract::PAGE_RES_IT::DeleteCurrentWord (this=0x7fffffffd860) at ../../../src/ccstruct/pageres.cpp:1502
#2  0x00005555555f4ea2 in tesseract::Tesseract::recog_all_words (this=0x7ffff41b5010, page_res=0x555555dd3810, monitor=0x0, target_word_box=0x0, word_config=0x0, dopasses=0)
    at ../../../src/ccmain/control.cpp:446
#3  0x0000555555597cb0 in tesseract::TessBaseAPI::Recognize (this=0x7fffffffdf90, monitor=0x0) at ../../../src/api/baseapi.cpp:833
#4  0x0000555555599285 in tesseract::TessBaseAPI::ProcessPage (this=0x7fffffffdf90, pix=0x555555d99ae0, page_index=0, filename=0x7fffffffe537 "sample_013741.jpg", retry_config=0x0, timeout_millisec=0, 
    renderer=0x555555a254e0) at ../../../src/api/baseapi.cpp:1218
#5  0x0000555555599003 in tesseract::TessBaseAPI::ProcessPagesInternal (this=0x7fffffffdf90, filename=0x7fffffffe537 "sample_013741.jpg", retry_config=0x0, timeout_millisec=0, renderer=0x555555a254e0)
    at ../../../src/api/baseapi.cpp:1181
#6  0x00005555555984d8 in tesseract::TessBaseAPI::ProcessPages (this=0x7fffffffdf90, filename=0x7fffffffe537 "sample_013741.jpg", retry_config=0x0, timeout_millisec=0, renderer=0x555555a254e0)
    at ../../../src/api/baseapi.cpp:998
#7  0x000055555558ab1f in main (argc=5, argv=0x7fffffffe248) at ../../../src/tesseract.cpp:868

@stweil
Copy link
Contributor

stweil commented Nov 8, 2024

This looks like a duplicate of issue #4148. Therefore I close it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants