Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a PROJ_DB_FAST_BUILD=ON/OFF CMake option (default OFF) #4279

Closed
wants to merge 1 commit into from

Conversation

rouault
Copy link
Member

@rouault rouault commented Oct 16, 2024

"Trigger" for this (pun intented) is that most of the time spent while building GDAL Docker image when cross-building to arm64 is spent on building proj.db (close to 7.5h for a target Ubuntu 24.04 arm64 !). Setting this new option should cut that to a few minutes.


.. option:: PROJ_DB_FAST_BUILD=OFF

    .. versionadded:: 9.5.1

    By default, creation of :file:`proj.db` involves inserting consistency check
    triggers before inserting data records, to be able to catch potential
    inconsistencies. Such checks are useful for core PROJ developers when they
    update the database content, or for advanced PROJ users that customize the
    content of the database. However those checks come with a non-negligible cost.
    On modern hardware, building :file:`proj.db` with those checks enabled takes
    about 50 to 60 seconds (and on scenarios where PROJ is built for other
    architectures with full emulation, several hours). When setting this option
    to ON, those triggers are inserted after data records, which decreases the
    build time to about 3 seconds.
    In short, setting this option to ON is safe if you do not customize yourself
    the .sql files used to build :file:`proj.db`

Timings on my machine:

  • before this PR:
$ time make generate_proj_db
[100%] Generating proj.db
[100%] Built target generate_proj_db

real	0m54,752s
user	0m53,968s
sys	0m0,648s

$ md5sum data/proj.db
beecdc018b4a5131229709b3c7747036  data/proj.db

$ echo ".dump" | sqlite3 data/proj.db | md5sum
64e446efdc5c18e398cc7b6b2e4b3086  -
  • with this PR, not setting PROJ_DB_FAST_BUILD (so OFF):

Same as above

  • with this PR, setting PROJ_DB_FAST_BUILD=ON
$ cmake .. -DPROJ_DB_FAST_BUILD=ON

$ time make generate_proj_db
[100%] Generating proj.db
[100%] Built target generate_proj_db

real	0m3,243s
user	0m2,876s
sys	0m0,204s

$ md5sum data/proj.db
1955dfdc3f7abada3890bf9b7592770a  data/proj.db

$ echo ".dump" | sqlite3 data/proj.db | md5sum
64e446efdc5c18e398cc7b6b2e4b3086  -

One can notice that the binary content of proj.db is not exactly the same, however the result of dumping it to SQL is exactly the same. The reason for the slight difference is that in PROJ_DB_FAST_BUILD=ON we also skip creating a fake table and trigger, which influences the "schema version number" of the SQLite3 database, which is a non significant difference.

Cf the diff of the od -x output, which shows that only a few bytes in the SQLite3 header are different.

$ diff -u proj.db.slow.txt proj.db.fast.txt
--- proj.db.slow.txt	2024-10-16 08:50:07.211601573 +0200 +++ proj.db.fast.txt	2024-10-16 08:50:16.155615860 +0200 @@ -1,9 +1,9 @@
 0000000 5153 694c 6574 6620 726f 616d 2074 0033
-0000020 0010 0101 4000 2020 0000 1100 0000 d208
-0000040 0000 0000 0000 0000 0000 6700 0000 0400
+0000020 0010 0101 4000 2020 0000 2500 0000 d208
+0000040 0000 0000 0000 0000 0000 6300 0000 0400
 0000060 0000 0000 0000 0000 0000 0100 0000 0000
 0000100 0000 0000 0000 0000 0000 0000 0000 0000
-0000120 0000 0000 0000 0000 0000 0000 0000 1100
+0000120 0000 0000 0000 0000 0000 0000 0000 2500
 0000140 2e00 d93f 0005 0000 0f1a 007e 0000 d208
 0000160 fb0f f60f f10f ec0f e70f e20f dd0f d80f
 0000200 d30f ce0f c90f c40f bf0f ba0f b50f b00f

@rouault rouault added the backport 9.5 Backport to 9.5 branch label Oct 16, 2024
docs/source/install.rst Outdated Show resolved Hide resolved
"Trigger" for this (pun intented) is that most of the time spent while
building GDAL Docker image when cross-building to arm64 is spent on building
proj.db (close to 7.5h for a target Ubuntu 24.04 arm64 !). Setting this
new option should cut that to a few minutes.

```

.. option:: PROJ_DB_FAST_BUILD=OFF

    .. versionadded:: 9.5.1

    By default, creation of :file:`proj.db` involves inserting consistency check
    triggers before inserting data records, to be able to catch potential
    inconsistencies. Such checks are useful for core PROJ developers when they
    update the database content, or for advanced PROJ users that customize the
    content of the database. However those checks come with a non-negligible cost.
    On modern hardware, building :file:`proj.db` with those checks enabled takes
    about 50 to 60 seconds (and on scenarios where PROJ is built for other
    architectures with full emulation, several hours). When setting this option
    to ON, those triggers are inserted after data records, which decreases the
    build time to about 3 seconds.
    In short, setting this option to ON is safe if you do not customize yourself
    the .sql files used to build :file:`proj.db`
```

Timings on my machine:

- before this PR:

```
$ time make generate_proj_db
[100%] Generating proj.db
[100%] Built target generate_proj_db

real	0m54,752s
user	0m53,968s
sys	0m0,648s

$ md5sum data/proj.db
beecdc018b4a5131229709b3c7747036  data/proj.db

$ echo ".dump" | sqlite3 data/proj.db | md5sum
64e446efdc5c18e398cc7b6b2e4b3086  -
```

- with this PR, not setting PROJ_DB_FAST_BUILD (so OFF):

Same as above

- with this PR, setting PROJ_DB_FAST_BUILD=ON

```
$ cmake .. -DPROJ_DB_FAST_BUILD=ON

$ time make generate_proj_db
[100%] Generating proj.db
[100%] Built target generate_proj_db

real	0m3,243s
user	0m2,876s
sys	0m0,204s

$ md5sum data/proj.db
1955dfdc3f7abada3890bf9b7592770a  data/proj.db

$ echo ".dump" | sqlite3 data/proj.db | md5sum
64e446efdc5c18e398cc7b6b2e4b3086  -
```

One can notice that the binary content of proj.db is not exactly the
same, however the result of dumping it to SQL is exactly the same. The
reason for the slight difference is that in PROJ_DB_FAST_BUILD=ON we
also skip creating a fake table and trigger, which influences the
"schema version number" of the SQLite3 database, which is a non
significant difference.

Cf the diff of the ``od -x`` output, which shows that only a few bytes
in the SQLite3 header are different.

```
$ diff -u proj.db.slow.txt proj.db.fast.txt
--- proj.db.slow.txt	2024-10-16 08:50:07.211601573 +0200
+++ proj.db.fast.txt	2024-10-16 08:50:16.155615860 +0200
@@ -1,9 +1,9 @@
 0000000 5153 694c 6574 6620 726f 616d 2074 0033
-0000020 0010 0101 4000 2020 0000 1100 0000 d208
-0000040 0000 0000 0000 0000 0000 6700 0000 0400
+0000020 0010 0101 4000 2020 0000 2500 0000 d208
+0000040 0000 0000 0000 0000 0000 6300 0000 0400
 0000060 0000 0000 0000 0000 0000 0100 0000 0000
 0000100 0000 0000 0000 0000 0000 0000 0000 0000
-0000120 0000 0000 0000 0000 0000 0000 0000 1100
+0000120 0000 0000 0000 0000 0000 0000 0000 2500
 0000140 2e00 d93f 0005 0000 0f1a 007e 0000 d208
 0000160 fb0f f60f f10f ec0f e70f e20f dd0f d80f
 0000200 d30f ce0f c90f c40f bf0f ba0f b50f b00f
```
@hobu
Copy link
Contributor

hobu commented Oct 16, 2024

Why would we add an option called PROJ_DB_FAST_BUILD? What is the reason for having SLOW?

@rouault
Copy link
Member Author

rouault commented Oct 16, 2024

What is the reason for having SLOW?

As explained in the doc ;-) " Such checks are useful for core PROJ developers when they update the database content, or for advanced PROJ users that customize the content of the database"

We could change the default, but that would mean that when integrating a new EPSG / ESRI / whatever release we must think of doing the build & a test run at least once with the checks enabled. That said that could also be the job of a CI configuration to have that turn on.

@hobu
Copy link
Contributor

hobu commented Oct 16, 2024

My point was to ask why we should make this an option that users would have to make a decision about. I wonder if the behavior should be:

  • default to FAST
  • add PROJ_DB_EXTRA_VALIDATION=ON for the SLOW mode

@jjimenezshaw
Copy link
Contributor

" Such checks are useful for core PROJ developers when they update the database content, or for advanced PROJ users that customize the content of the database"

I think I have been both things already, and those checkers saved my life in both cases.

@rouault
Copy link
Member Author

rouault commented Oct 16, 2024

What could potentially be done is to check the md5sum of the concatenated all.sql.in file against a reference value. If it matches, then we use the fast way. If it doesn't match, we run once with the slow checks, and once proj.db successfully build with them, we output the new md5sum so the maintainer can update it in data/CMakeLists.txt. That way we would have the best of both worlds.

@rouault
Copy link
Member Author

rouault commented Oct 16, 2024

That way we would have the best of both worlds.

just did that. Works just fine. Closing this PR as superseded per #4280

@rouault rouault closed this Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 9.5 Backport to 9.5 branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants