Skip to content

Commit

Permalink
* Upgrade presets for Tesseract 5.5.0
Browse files Browse the repository at this point in the history
  • Loading branch information
saudet committed Nov 12, 2024
1 parent 33994e0 commit d682f6d
Show file tree
Hide file tree
Showing 11 changed files with 292 additions and 14 deletions.
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
* Build FFmpeg with zimg to enable zscale filter ([pull #1481](https://github.com/bytedeco/javacpp-presets/pull/1481))
* Enable PulseAudio support for FFmpeg on Linux ([pull #1472](https://github.com/bytedeco/javacpp-presets/pull/1472))
* Virtualize `btCollisionWorld`, `btOverlapFilterCallback`, `btOverlapCallback` from Bullet Physics SDK ([pull #1475](https://github.com/bytedeco/javacpp-presets/pull/1475))
* Upgrade presets for OpenCV 4.10.0, FFmpeg 7.1, Spinnaker 4.0.0.116 ([pull #1524](https://github.com/bytedeco/javacpp-presets/pull/1524)), MKL 2025.0, DNNL 3.6.1, OpenBLAS 0.3.28, CMINPACK 1.3.11, GSL 2.8, CPython 3.13.0, NumPy 2.1.3, SciPy 1.14.1, LLVM 19.1.3, LibRaw 0.21.2 ([pull #1520](https://github.com/bytedeco/javacpp-presets/pull/1520)), Leptonica 1.85.0, Tesseract 5.4.1, libffi 3.4.6, CUDA 12.6.2, cuDNN 9.5.1, NCCL 2.23.4, nvCOMP 4.1.0.6, OpenCL 3.0.17, NVIDIA Video Codec SDK 12.2.72, PyTorch 2.5.1 ([pull #1466](https://github.com/bytedeco/javacpp-presets/pull/1466)), SentencePiece 0.2.0, TensorFlow Lite 2.18.0, TensorRT 10.6.0.26, Triton Inference Server 2.51.0, ONNX 1.17.0, ONNX Runtime 1.20.0, TVM 0.18.0, and their dependencies
* Upgrade presets for OpenCV 4.10.0, FFmpeg 7.1, Spinnaker 4.0.0.116 ([pull #1524](https://github.com/bytedeco/javacpp-presets/pull/1524)), MKL 2025.0, DNNL 3.6.1, OpenBLAS 0.3.28, CMINPACK 1.3.11, GSL 2.8, CPython 3.13.0, NumPy 2.1.3, SciPy 1.14.1, LLVM 19.1.3, LibRaw 0.21.2 ([pull #1520](https://github.com/bytedeco/javacpp-presets/pull/1520)), Leptonica 1.85.0, Tesseract 5.5.0, libffi 3.4.6, CUDA 12.6.2, cuDNN 9.5.1, NCCL 2.23.4, nvCOMP 4.1.0.6, OpenCL 3.0.17, NVIDIA Video Codec SDK 12.2.72, PyTorch 2.5.1 ([pull #1466](https://github.com/bytedeco/javacpp-presets/pull/1466)), SentencePiece 0.2.0, TensorFlow Lite 2.18.0, TensorRT 10.6.0.26, Triton Inference Server 2.51.0, ONNX 1.17.0, ONNX Runtime 1.20.0, TVM 0.18.0, and their dependencies

### January 29, 2024 version 1.5.10
* Introduce `macosx-arm64` builds for PyTorch ([pull #1463](https://github.com/bytedeco/javacpp-presets/pull/1463))
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -213,7 +213,7 @@ Each child module in turn relies by default on the included [`cppbuild.sh` scrip
* libpostal 1.1 https://github.com/openvenues/libpostal
* LibRaw 0.21.x https://www.libraw.org/download
* Leptonica 1.85.x http://www.leptonica.org/download.html
* Tesseract 5.4.x https://github.com/tesseract-ocr/tesseract
* Tesseract 5.5.x https://github.com/tesseract-ocr/tesseract
* Caffe 1.0 https://github.com/BVLC/caffe
* OpenPose 1.7.0 https://github.com/CMU-Perceptual-Computing-Lab/openpose
* CUDA 12.6.x https://developer.nvidia.com/cuda-downloads
Expand Down
2 changes: 1 addition & 1 deletion platform/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -257,7 +257,7 @@
<dependency>
<groupId>org.bytedeco</groupId>
<artifactId>tesseract-platform</artifactId>
<version>5.4.1-${project.version}</version>
<version>5.5.0-${project.version}</version>
</dependency>
<!-- <dependency>-->
<!-- <groupId>org.bytedeco</groupId>-->
Expand Down
4 changes: 2 additions & 2 deletions tesseract/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Introduction
------------
This directory contains the JavaCPP Presets module for:

* Tesseract 5.4.1 https://github.com/tesseract-ocr
* Tesseract 5.5.0 https://github.com/tesseract-ocr

Please refer to the parent README.md file for more detailed information about the JavaCPP Presets.

Expand Down Expand Up @@ -47,7 +47,7 @@ We can use [Maven 3](http://maven.apache.org/) to download and install automatic
<dependency>
<groupId>org.bytedeco</groupId>
<artifactId>tesseract-platform</artifactId>
<version>5.4.1-1.5.11-SNAPSHOT</version>
<version>5.5.0-1.5.11-SNAPSHOT</version>
</dependency>
</dependencies>
<build>
Expand Down
4 changes: 3 additions & 1 deletion tesseract/cppbuild.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ if [[ -z "$PLATFORM" ]]; then
exit
fi

TESSERACT_VERSION=5.4.1
TESSERACT_VERSION=5.5.0
download https://github.com/tesseract-ocr/tesseract/archive/$TESSERACT_VERSION.tar.gz tesseract-$TESSERACT_VERSION.tar.gz

mkdir -p $PLATFORM
Expand Down Expand Up @@ -48,6 +48,8 @@ export PKG_CONFIG_PATH=$INSTALL_PATH/lib/pkgconfig/

CMAKE_CONFIG="-DCMAKE_BUILD_TYPE=Release -DCMAKE_PREFIX_PATH=$LEPTONICA_PATH -DCMAKE_INSTALL_PREFIX=$INSTALL_PATH -DCMAKE_INSTALL_LIBDIR=$INSTALL_PATH/lib -DDISABLE_ARCHIVE=ON -DDISABLE_CURL=ON -DMARCH_NATIVE_OPT=OFF -DOPENMP_BUILD=OFF -DBUILD_SHARED_LIBS=ON -DBUILD_TRAINING_TOOLS=OFF -DLEPT_TIFF_RESULT=1"

patch -RNp1 < ../../../tesseract.patch

case $PLATFORM in
android-arm)
patch -Np1 < ../../../tesseract-android.patch
Expand Down
2 changes: 1 addition & 1 deletion tesseract/platform/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

<groupId>org.bytedeco</groupId>
<artifactId>tesseract-platform</artifactId>
<version>5.4.1-${project.parent.version}</version>
<version>5.5.0-${project.parent.version}</version>
<name>JavaCPP Presets Platform for Tesseract</name>

<properties>
Expand Down
2 changes: 1 addition & 1 deletion tesseract/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

<groupId>org.bytedeco</groupId>
<artifactId>tesseract</artifactId>
<version>5.4.1-${project.parent.version}</version>
<version>5.5.0-${project.parent.version}</version>
<name>JavaCPP Presets for Tesseract</name>

<dependencies>
Expand Down
2 changes: 1 addition & 1 deletion tesseract/samples/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
<dependency>
<groupId>org.bytedeco</groupId>
<artifactId>tesseract-platform</artifactId>
<version>5.4.1-1.5.11-SNAPSHOT</version>
<version>5.5.0-1.5.11-SNAPSHOT</version>
</dependency>
</dependencies>
<build>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -145,15 +145,15 @@ public class tesseract extends org.bytedeco.tesseract.presets.tesseract {
// clang-format off

public static final int TESSERACT_MAJOR_VERSION = 5;
public static final int TESSERACT_MINOR_VERSION = 4;
public static final int TESSERACT_MICRO_VERSION = 1;
public static final int TESSERACT_MINOR_VERSION = 5;
public static final int TESSERACT_MICRO_VERSION = 0;

public static final int TESSERACT_VERSION =
(TESSERACT_MAJOR_VERSION << 16 |
TESSERACT_MINOR_VERSION << 8 |
TESSERACT_MICRO_VERSION);

public static final String TESSERACT_VERSION_STR = "5.4.1";
public static final String TESSERACT_VERSION_STR = "5.5.0";

// clang-format on

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,9 @@
@Platform(define = "TESS_CAPI_INCLUDE_BASEAPI", include = {"tesseract/export.h", /*"tesseract/osdetect.h",*/ "tesseract/unichar.h",
"tesseract/version.h", "tesseract/publictypes.h", "tesseract/pageiterator.h", "tesseract/ocrclass.h", "tesseract/ltrresultiterator.h",
"tesseract/renderer.h", "tesseract/resultiterator.h", "tesseract/baseapi.h", "tesseract/capi.h", "locale.h"},
compiler = "cpp14", link = "tesseract@.5.4.1"/*, resource = {"include", "lib"}*/),
compiler = "cpp14", link = "tesseract@.5.5"/*, resource = {"include", "lib"}*/),
@Platform(value = "android", link = "tesseract"),
@Platform(value = "windows", link = "tesseract54", preload = "libtesseract54") })
@Platform(value = "windows", link = "tesseract55", preload = "libtesseract55") })
public class tesseract implements InfoMapper {
static { Loader.checkVersion("org.bytedeco", "tesseract"); }

Expand Down
276 changes: 276 additions & 0 deletions tesseract/tesseract.patch
Original file line number Diff line number Diff line change
@@ -0,0 +1,276 @@
diff --git a/src/api/baseapi.cpp b/src/api/baseapi.cpp
index 72503636c0..bae30ab8bb 100644
--- a/src/api/baseapi.cpp
+++ b/src/api/baseapi.cpp
@@ -64,6 +64,7 @@
#include <cmath> // for round, M_PI
#include <cstdint> // for int32_t
#include <cstring> // for strcmp, strcpy
+#include <filesystem> // for std::filesystem
#include <fstream> // for size_t
#include <iostream> // for std::cin
#include <locale> // for std::locale::classic
@@ -82,15 +83,9 @@
#endif

#if defined(_WIN32)
-# include <fcntl.h>
-# include <io.h>
-#else
-# include <dirent.h> // for closedir, opendir, readdir, DIR, dirent
-# include <libgen.h>
-# include <sys/stat.h> // for stat, S_IFDIR
-# include <sys/types.h>
-# include <unistd.h>
-#endif // _WIN32
+# include <fcntl.h> // for _O_BINARY
+# include <io.h> // for _setmode
+#endif

namespace tesseract {

@@ -149,61 +144,18 @@ static void ExtractFontName(const char* filename, std::string* fontname) {

/* Add all available languages recursively.
*/
-static void addAvailableLanguages(const std::string &datadir, const std::string &base,
+static void addAvailableLanguages(const std::string &datadir,
std::vector<std::string> *langs) {
- auto base2 = base;
- if (!base2.empty()) {
- base2 += "/";
- }
- const size_t extlen = sizeof(kTrainedDataSuffix);
-#ifdef _WIN32
- WIN32_FIND_DATA data;
- HANDLE handle = FindFirstFile((datadir + base2 + "*").c_str(), &data);
- if (handle != INVALID_HANDLE_VALUE) {
- BOOL result = TRUE;
- for (; result;) {
- char *name = data.cFileName;
- // Skip '.', '..', and hidden files
- if (name[0] != '.') {
- if ((data.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) == FILE_ATTRIBUTE_DIRECTORY) {
- addAvailableLanguages(datadir, base2 + name, langs);
- } else {
- size_t len = strlen(name);
- if (len > extlen && name[len - extlen] == '.' &&
- strcmp(&name[len - extlen + 1], kTrainedDataSuffix) == 0) {
- name[len - extlen] = '\0';
- langs->push_back(base2 + name);
- }
- }
- }
- result = FindNextFile(handle, &data);
- }
- FindClose(handle);
- }
-#else // _WIN32
- DIR *dir = opendir((datadir + base).c_str());
- if (dir != nullptr) {
- dirent *de;
- while ((de = readdir(dir))) {
- char *name = de->d_name;
- // Skip '.', '..', and hidden files
- if (name[0] != '.') {
- struct stat st;
- if (stat((datadir + base2 + name).c_str(), &st) == 0 && (st.st_mode & S_IFDIR) == S_IFDIR) {
- addAvailableLanguages(datadir, base2 + name, langs);
- } else {
- size_t len = strlen(name);
- if (len > extlen && name[len - extlen] == '.' &&
- strcmp(&name[len - extlen + 1], kTrainedDataSuffix) == 0) {
- name[len - extlen] = '\0';
- langs->push_back(base2 + name);
- }
- }
- }
+ for (const auto& entry :
+ std::filesystem::recursive_directory_iterator(datadir,
+ std::filesystem::directory_options::follow_directory_symlink |
+ std::filesystem::directory_options::skip_permission_denied)) {
+ auto path = entry.path().lexically_relative(datadir).string();
+ auto extPos = path.rfind(".traineddata");
+ if (extPos != std::string::npos) {
+ langs->push_back(path.substr(0, extPos));
}
- closedir(dir);
}
-#endif
}

TessBaseAPI::TessBaseAPI()
@@ -444,7 +396,7 @@ void TessBaseAPI::GetLoadedLanguagesAsVector(std::vector<std::string> *langs) co
void TessBaseAPI::GetAvailableLanguagesAsVector(std::vector<std::string> *langs) const {
langs->clear();
if (tesseract_ != nullptr) {
- addAvailableLanguages(tesseract_->datadir, "", langs);
+ addAvailableLanguages(tesseract_->datadir, langs);
std::sort(langs->begin(), langs->end());
}
}
diff --git a/src/ccutil/ccutil.cpp b/src/ccutil/ccutil.cpp
index 5e93e70866..930aa2636e 100644
--- a/src/ccutil/ccutil.cpp
+++ b/src/ccutil/ccutil.cpp
@@ -10,10 +10,6 @@
// See the License for the specific language governing permissions and
// limitations under the License.

-#if defined(_WIN32)
-# include <io.h> // for _access
-#endif
-
#include "ccutil.h"
#include "tprintf.h" // for tprintf

@@ -63,7 +59,7 @@ void CCUtil::main_setup(const std::string &argv0, const std::string &basename) {
/* Use tessdata prefix from the environment. */
datadir = tessdata_prefix;
#if defined(_WIN32)
- } else if (datadir.empty() || _access(datadir.c_str(), 0) != 0) {
+ } else if (datadir.empty() || !std::filesystem::exists(datadir)) {
/* Look for tessdata in directory of executable. */
char path[_MAX_PATH];
DWORD length = GetModuleFileName(nullptr, path, sizeof(path));
@@ -73,7 +69,7 @@ void CCUtil::main_setup(const std::string &argv0, const std::string &basename) {
*separator = '\0';
std::string subdir = path;
subdir += "/tessdata";
- if (_access(subdir.c_str(), 0) == 0) {
+ if (std::filesystem::exists(subdir)) {
datadir = subdir;
}
}
diff --git a/unittest/pagesegmode_test.cc b/unittest/pagesegmode_test.cc
index 9689e407e1..781e67d3f9 100644
--- a/unittest/pagesegmode_test.cc
+++ b/unittest/pagesegmode_test.cc
@@ -9,13 +9,9 @@
// See the License for the specific language governing permissions and
// limitations under the License.

-#if defined(_WIN32)
-# include <io.h> // for _access
-#else
-# include <unistd.h> // for access
-#endif
#include <allheaders.h>
#include <tesseract/baseapi.h>
+#include <filesystem>
#include <string>
#include "helpers.h"
#include "include_gunit.h"
@@ -24,15 +20,6 @@

namespace tesseract {

-// Replacement for std::filesystem::exists (C++-17)
-static bool file_exists(const char *filename) {
-#if defined(_WIN32)
- return _access(filename, 0) == 0;
-#else
- return access(filename, 0) == 0;
-#endif
-}
-
// The fixture for testing Tesseract.
class PageSegModeTest : public testing::Test {
protected:
@@ -86,7 +73,7 @@ class PageSegModeTest : public testing::Test {
// and differently to line and block mode.
TEST_F(PageSegModeTest, WordTest) {
std::string filename = file::JoinPath(TESTING_DIR, "segmodeimg.tif");
- if (!file_exists(filename.c_str())) {
+ if (!std::filesystem::exists(filename)) {
LOG(INFO) << "Skip test because of missing " << filename << '\n';
GTEST_SKIP();
} else {
diff --git a/unittest/tatweel_test.cc b/unittest/tatweel_test.cc
index d0d8f2ae6f..10e673b217 100644
--- a/unittest/tatweel_test.cc
+++ b/unittest/tatweel_test.cc
@@ -9,12 +9,7 @@
// See the License for the specific language governing permissions and
// limitations under the License.

-#if defined(_WIN32)
-# include <io.h> // for _access
-#else
-# include <unistd.h> // for access
-#endif
-
+#include <filesystem>
#include "dawg.h"
#include "include_gunit.h"
#include "trie.h"
@@ -23,15 +18,6 @@

namespace tesseract {

-// Replacement for std::filesystem::exists (C++-17)
-static bool file_exists(const char *filename) {
-#if defined(_WIN32)
- return _access(filename, 0) == 0;
-#else
- return access(filename, 0) == 0;
-#endif
-}
-
class TatweelTest : public ::testing::Test {
protected:
void SetUp() override {
@@ -41,7 +27,7 @@ class TatweelTest : public ::testing::Test {

TatweelTest() {
std::string filename = TestDataNameToPath("ara.wordlist");
- if (file_exists(filename.c_str())) {
+ if (std::filesystem::exists(filename)) {
std::string wordlist("\u0640");
CHECK_OK(file::GetContents(filename, &wordlist, file::Defaults()));
// Put all the unicodes in the unicharset_.
@@ -77,7 +63,7 @@ TEST_F(TatweelTest, DictIgnoresTatweel) {
// This test verifies that the dictionary ignores the Tatweel character.
tesseract::Trie trie(tesseract::DAWG_TYPE_WORD, "ara", SYSTEM_DAWG_PERM, unicharset_.size(), 0);
std::string filename = TestDataNameToPath("ara.wordlist");
- if (!file_exists(filename.c_str())) {
+ if (!std::filesystem::exists(filename)) {
LOG(INFO) << "Skip test because of missing " << filename;
GTEST_SKIP();
} else {
@@ -91,7 +77,7 @@ TEST_F(TatweelTest, UnicharsetLoadKeepsTatweel) {
// This test verifies that a load of an existing unicharset keeps any
// existing tatweel for backwards compatibility.
std::string filename = TestDataNameToPath("ara.unicharset");
- if (!file_exists(filename.c_str())) {
+ if (!std::filesystem::exists(filename)) {
LOG(INFO) << "Skip test because of missing " << filename;
GTEST_SKIP();
} else {
diff --git a/src/ccutil/ccutil.cpp b/src/ccutil/ccutil.cpp
index 7cf57f2ee9..483d5bc1ee 100644
--- a/src/ccutil/ccutil.cpp
+++ b/src/ccutil/ccutil.cpp
@@ -17,7 +17,8 @@
#include "ccutil.h"

#include <cstdlib>
-#include <cstring> // for std::strrchr
+#include <cstring> // for std::strrchrA
+#include <filesystem> // for std::filesystem

namespace tesseract {

@@ -48,6 +49,12 @@ void CCUtil::main_setup(const std::string &argv0, const std::string &basename) {

const char *tessdata_prefix = getenv("TESSDATA_PREFIX");

+ // Ignore TESSDATA_PREFIX if there is no matching filesystem entry.
+ if (tessdata_prefix != nullptr && !std::filesystem::exists(tessdata_prefix)) {
+ tprintf("Warning: TESSDATA_PREFIX %s does not exist, ignore it\n", tessdata_prefix);
+ tessdata_prefix = nullptr;
+ }
+
if (!argv0.empty()) {
/* Use tessdata prefix from the command line. */
datadir = argv0;

0 comments on commit d682f6d

Please sign in to comment.