Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use whitelist table in gravity.db directly #600

Merged
merged 5 commits into from
Jun 29, 2019

Conversation

DL6ER
Copy link
Member

@DL6ER DL6ER commented Jun 23, 2019

By submitting this pull request, I confirm the following (please check boxes, eg [X]) Failure to fill the template will close your PR:

Please submit all pull requests against the development branch. Failure to do so will delay or deny your request

  • I have read and understood the contributors guide.
  • I have checked that another pull request for this purpose does not exist.
  • I have considered, and confirmed that this submission will be valuable to others.
  • I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
  • I give this submission freely, and claim no ownership to its content.

How familiar are you with the codebase?:

10

Use vw_whitelist view of table whitelist when checking if a domain is contained in the whitelist instead of copying whitelisted domains on receipt of SIGHUP into FTL's memory.
This change will not only decrease FTL's memory footprint but also allow a much better scaling to very large / huge whitelists (looking into the 100k+ domains regime).


To achieve good performance, we constantly keep the database connection to gravity.db open to have a prepared (= precompiled to SQLite byte code) statement ready for querying the whitelist.

In vivo tests on a Raspberry Pi with a whitelist containing 100.000 whitelisted domains combined with the regex . (matches any character) confirms that checking the whitelist should not take longer than typically 2 milliseconds, even for very large whitelists. The fast reaction is achieved by using SELECT EXISTS(...) which is highly optimized:

SELECT EXISTS(SELECT domain from vw_whitelist WHERE domain = ?);

Note that querying strategies like

SELECT COUNT(*) FROM vw_whitelist WHERE domain = ?;

are less performant as they will almost always result in an iterating co-routine (even when the column in the WHERE clause is UNIQUE!)
In contrast, SELECT EXISTS(...) immediately returns when it finds the first occurrence of the contained SELECT statement. The domain itself will be bound at lookup time using SQLite bindings (placeholder ?).


A drawback of relying on the always-open connection to the whitelist view is that when the database is locked (e.g., pihole -g is currently storing the blocked domains), we cannot check whether a domain in contained in the whitelist or not. To avoid blocking the DNS service until gravity is eventually done, we immediately return and assume that the domain was not on the whitelist. In this case, we log like

[2019-06-23 18:41:18.419 22527] in_whitelist("badbad.de"): Database is busy, assuming domain is NOT whitelisted",

As gravity will send SIGHUP at the end, this possibly wrong cache entry will be removed soon after the query was made. On the next query of this domain, the database should not be locked any more and the query to the database can succeed.


Overall, the benefits seem to outweigh the drawbacks but this PR may serve as a discussion platform.

This PR fixes #596 (at least partially).

… domains inside FTL itself.

Signed-off-by: DL6ER <dl6er@dl6er.de>
… should prevent the database from staying in a locked state after the first query.

Signed-off-by: DL6ER <dl6er@dl6er.de>
Signed-off-by: DL6ER <dl6er@dl6er.de>
@DL6ER DL6ER added this to the v5.0 milestone Jun 23, 2019
@DL6ER DL6ER requested a review from a team June 23, 2019 16:53
…ase is busy.

Signed-off-by: DL6ER <dl6er@dl6er.de>
@AzureMarker
Copy link
Contributor

Is it possible to avoid locking the entire database by using WAL? https://www.sqlite.org/draft/wal.html

gravity.c Show resolved Hide resolved
gravity.c Show resolved Hide resolved
@DL6ER
Copy link
Member Author

DL6ER commented Jun 26, 2019

Is it possible to avoid locking the entire database by using WAL? https://www.sqlite.org/draft/wal.html

I looked at WAL before, however, I'm not sure whether it is a good idea mainly due to the following limitation (quoting the documentation):

All processes using a database must be on the same host computer; WAL does not work over a network filesystem.

…DB_count(). Return early in gravityDB_close() if the database is not available.

Signed-off-by: DL6ER <dl6er@dl6er.de>
@AzureMarker
Copy link
Contributor

Looking at the SQLite locking states (https://www.sqlite.org/draft/lockingv3.html), it seems that as long as FTL holds a read-only connection to the database, gravity will not be able to acquire a write lock. How does gravity acquire the write lock when FTL is holding a read lock?

@DL6ER
Copy link
Member Author

DL6ER commented Jun 26, 2019

Looking at the SQLite locking states (https://www.sqlite.org/draft/lockingv3.html), it seems that as long as FTL holds a read-only connection to the database, gravity will not be able to acquire a write lock. How does gravity acquire the write lock when FTL is holding a read lock?

The two relevant locks are SHARED and EXCLUSIVE

SHARED - The database may be read but not written. Any number of processes can hold SHARED locks at the same time, hence there can be many simultaneous readers. But no other thread or process is allowed to write to the database file while one or more SHARED locks are active.

EXCLUSIVE - An EXCLUSIVE lock is needed in order to write to the database file. Only one EXCLUSIVE lock is allowed on the file and no other locks of any kind are allowed to coexist with an EXCLUSIVE lock. In order to maximize concurrency, SQLite works to minimize the amount of time that EXCLUSIVE locks are held.

When FTL reads from the database (performs the prepared statement), it requests a SHARED lock through sqlite3_step().

There are two possible situations:

  1. We succeed in obtaining the lock. This means nobody else is currently writing to the database.
    Consequence: gravity can not write to the database while FTL holds the SHARED lock.

    FTL releases the SHARED lock once it is done by calling sqlite3_reset(). The fact that we keep the database connection alive is of no consequence here.

  2. We fail in obtaining the lock. This means gravity is currently writing to the file.
    Consequence: sqlite3_step() hits the EXCLUSIVE lock and cannot continue.

    In this case, we assume a domain was not whitelisted as this is less problematic if we'd assume the contrary.

While case 2 seems to be the best we can do, it looks like we should add something like

.timeout 5000

to the sqlite3 calls in gravity.sh to have sqlite3 wait up to 5 seconds when trying to write to the table instead of failing with "database locked". This never happened during my tests as the queries are usually not longer than 1-2 msec, however, as it can (== will at some point) happen, we should properly deal with it.

@mabrafooMSFT
Copy link

Why not use 2 sqlite databases? Or will maintaining that be a pain in the code?

@AzureMarker
Copy link
Contributor

We are already using multiple SQLite databases, but in this case the data is closely related. The core issue is SQLite's concurrency, and it will probably show up again even if we try to avoid it now.

@AzureMarker
Copy link
Contributor

My only concern before approving this PR is the behavior when the database is locked. Could we put a short timeout on it, so there is a possibility that we can avoid that edge case? Perhaps something around 100-500ms. Even if gravity takes longer, this would avoid cases where there may be small updates to the database (taking only milliseconds) but FTL wants to read the whitelist.

@DL6ER
Copy link
Member Author

DL6ER commented Jun 28, 2019

I do not feel comfortable with adding a delay so far as this might severely slow down FTL when users use massive lists (6M+) on a low end device where gravity might need up to twenty (!) minutes to store the domains. If we'd add only 100msec of possible delay this can reduce the response rate to at most 10/s.

Im really going back and forth on this, if you still find this a good compromise, you might be able to convince me to add some small delay. However, I consider 100ms to be really already at the upper limit of what should be added.

Copy link
Contributor

@AzureMarker AzureMarker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The response rate issue is big enough that I'm willing to live with this compromise of assuming the domain is not on the whitelist. I can't think of any way to get around this in our current setup, as we need fast (no waiting) access to the whitelist, but SQLite will lock the whole database during the write. The only way I can think of getting around this would be to implement our own hashed whitelist so we would always have it accessible, but that is out of the question and would make this PR redundant.

@AzureMarker AzureMarker merged commit 822fb88 into development Jun 29, 2019
@AzureMarker AzureMarker deleted the new/gravity_db_use_whitelist_table_directly branch June 29, 2019 03:27
@mabrafooMSFT
Copy link

I see the commit is in the development branch. I am really grateful you fine people are working on adding support for huge whitelists. Thank you!
Is this something that you recommend trying out now or are there additional changes needed before trying it out?

@DL6ER
Copy link
Member Author

DL6ER commented Jul 2, 2019

You should wait for pi-hole/pi-hole#2803 getting merged.

This PR will change the schema of the gravity database and the test would be much smoother if you start only after we finally agreed on how this should look like.

The current development branches are preparing the v5.0 release and heavily depend on each other as much of Pi-hole's internal machinery is moved into the database. Make sure to run pihole checkout dev to move your entire installation to the dev branches once you want to do this.

Also note our usual disclaimer that the development branches bring in completely new features which are well tested by the author and also independently review by at least one other developer, however, there can still be bugs like unexpected interactions that only appear once used on real systems. We can surely give support for single users trying the development branches, however, due to the points mentioned, the development branches are only meant for those who are technically versed and first try to help themselves. We would just be overrun with support questions if we'd handle it differently.

@DL6ER DL6ER mentioned this pull request May 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants