Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving data source synchronization performance #1186

Closed
Jylhis opened this issue Jun 7, 2021 · 19 comments
Closed

Improving data source synchronization performance #1186

Jylhis opened this issue Jun 7, 2021 · 19 comments

Comments

@Jylhis
Copy link
Contributor

Jylhis commented Jun 7, 2021

Members synchronization process is currently quite slow for our use case and I would like to improve this.

Expected Behavior

Synchronization to be fast as possible and preferably maximum wait time of 1h in worst case scenario.

Current Behavior

Currently synchronizing 100 000 members (60k add and 40k update) takes ~10 min.

Possible Solution

Running profiler for the sync process shows that majority of the sync time is spent executing database queries. The sync process seems to be creating 4 queries per member update and 7 queries per member for remove and add.

I pushed draft PR (#1184) where I have moved couple queries out of loops and doing some of the check in code and I also grouped some of the queries in transactions. Here are the benchmarks I ran with the changes:

add 20 000, remove 20 000 update 20 000 members

Before After
Time (sec) 263 67.2
Calls to do_prepared_query 240 032 120 032

add 60 000, update 40 000 members

Before After
Time (sec) 569 141
Calls to do_prepared_query 580 032 260 032

Context

We need to synchronize around 4 milj. memberships total a cross all lists and we would like to keep our list memberships as much as possible in sync with our data source and minimizing the wait time for sync.

@ikedas
Copy link
Member

ikedas commented Jun 8, 2021

Just a thought...

  • In your PR:

    my $user_query;
    $user_query = $sdm->do_prepared_query(qq{ SELECT user_$t, inclusion_$t, inclusion_ext_$t
    FROM ${t}_table
    WHERE list_$t = ? AND robot_$t = ?$r
    }, $list->{'name'}, $list->{'domain'});
    my %users = ();
    while (my @row = $user_query->fetchrow_array) {
    my $user_email = $row[0];
    my $user_inclusion = $row[1];
    my $user_inclusion_ext = $row[2];
    $users{$user_email} = [ $user_inclusion, $user_inclusion_ext ];
    }

    How much will be the actual memory usage by %users? Is the amount acceptable to us? (In the past, the code also loaded all the users in the list into memory, which resulted in a sudden increase in memory usage.)

  • To reduce number of queries, use of transaction looks making sense. However, is there problems with transaction interference? (Synchronization can be kicked by multiple processes at the same time.)

@Jylhis
Copy link
Contributor Author

Jylhis commented Jun 8, 2021

I hadn't measured the memory usage before but I ran following command before and after the changes with 100k member (only update)

valgrind sympa.pl -d --sync_include=testlist

Before:

total heap usage: 16,167,047 allocs, 15,425,081 frees, 1,718,573,828 bytes allocated

After:

total heap usage: 10,976,807 allocs, 10,230,023 frees, 837,249,270 bytes allocated

It would seem that the change would lower at least amount of memory allocated. I'm not sure if the bytes allocated is the stat that matters in this case. I have ran my test in virtual machine with 1 cpu and 512MB memory and just looking at the process (without Valgrind) in htop I get following numbers:

Before:

VIRT RES SHR CPU% MEM%
362M 82816 7480 22.0 16.6

After:

VIRT RES SHR CPU% MEM%
397M 110M 2576 100. 22.7

So memory utilization would be little bit higher and CPU utilization would be much higher as the sync process doesn't wait for each request to finish and does some of the checks in code.

For the problems with transaction interface. I have triggered two synchronizations in parallel from command line and these finished successfully, but I'm not that familiar with Sympa codebase to know it there are some possible cases for failure. At least PostgreSQL does not seem to support parallel transactions with same connection, but with each transaction in its separate connection it should work. I have used PostgreSQL in my tests.

@ikedas
Copy link
Member

ikedas commented Jun 9, 2021

What is architecture you are using? For example, x86-64 architecture consumes much memory than 32bit x86.
Can you know the peak value of the virtual memory usage? (If no swap out occurred,) I think the peak value is important for system administrators.

@Jylhis
Copy link
Contributor Author

Jylhis commented Jun 9, 2021

I'm using x86-64. Here is some memory stats from /proc/{id}/status run with 30k members (10k add, 10k remove, 10k update).

Before (kB) after (kB)
VmPeak 368084 386616
VmSize 367824 386616
VmHWM 79660 98156
VmRSS 79660 98156
VmData 71260 90052
VmPTE 628 668

@ikedas
Copy link
Member

ikedas commented Jun 10, 2021

Thank you for information! At least the current results don't show a significant increase in memory usage.

If possible, could you please also check the results by task_manager.pl? It will repeatedly execute sync_include task (i.e. memory allocation/deallocation) on multiple lists.

@Jylhis
Copy link
Contributor Author

Jylhis commented Jun 10, 2021

I got measurements from task_manager.pl. I ran it approximately the same amount of time and it synchronized 2 lists (10k add, 10k delete). I accidentally ran it with only 20k members, but I think the results are still comparable.

Before (kB) after (kB)
VmPeak 364340 384376
VmSize 364336 384376
VmHWM 77936 96092
VmRSS 77936 96092
VmData 71900 89876

And from Valgrind

Before:

total heap usage: 11,470,155 allocs, 10,749,443 frees, 2,150,722,477 bytes allocated

After:

total heap usage: 8,996,962 allocs, 8,274,111 frees, 2,015,373,534 bytes allocated

@ikedas
Copy link
Member

ikedas commented Jun 11, 2021

Thank you! Things look good so far. Later I'll check memory usage and performance with my VM, too.

@ikedas
Copy link
Member

ikedas commented Jun 11, 2021

@Jylhis , I have things a little unclear about your PR.

First, ---

Current code works as:

  • I. Start.

  • II. Include new entries.

    for each entry given by a data source:

    1. If role of the data source is 'member' and the user is excluded:
      Do nothing.

    2. If user has already been updated by the other data sources:
      Keep user.

    3. If user (has not been updated by the other data sources and) exists:
      UPDATE inclusion.

    4. Otherwise, i.e. a new user:
      INSERT new user.

  • III. Expire outdated entries.

  • IV. Update custom attributes.

  • V. Finish.

As far as I understood, your changes will make it:

  • I. Start.

  • II. Include new entries.

    Begin transaction.

    for each entry given by a data source:

    1. If role of the data source is 'member' and the user is excluded:
      Do nothing.

    2. If user has already been updated by the other data sources:
      Keep user.

    3. If user (has not been updated by the other data sources and) exists:
      UPDATE inclusion.

      However, rollback transaction and abort in case of unrecoverable error.

    Commit transaction.

  1. Otherwise, i.e. a new user:
    INSERT new user.
  • III. Expire outdated entries.

  • IV. Update custom attributes.

  • V. Finish.

My question is: May we leave 4. (INSERT new user) and III. (Expire outdated entries) out of the transaction?

Second. ---

Is the magic with autocomit useful? If poor support for transaction by SQLite is the problem, why not give up using transaction with SQLite?

@Jylhis
Copy link
Contributor Author

Jylhis commented Jun 11, 2021

That process looks correct. The delete_list_member and add_list_member are also using transactions with rollbacks in case of error, so for the whole process there will be 3 transactions (update, insert & delete).

For the SQLite, it's really just a workaround for a workaround. The problem is that the SQLite do_prepared_query and do_query are already using transactions for each query to do this locking:

# Note:
# To prevent "database is locked" error, acquire "immediate" lock
# by each query. Most queries excluding "SELECT" need to lock in this
# manner.

So if we start transaction before calling these functions, it will fail to acquire the lock and it will return undef.

The AutoCommit check before commit is just to suppress warning messages saying commit doesn't have any effect when AutoCommit is enabled.

sympa/src/lib/Sympa/List.pm

Lines 1605 to 1611 in aa1b3ad

unless ($sdm->__dbh->{AutoCommit}) {
my $rc = $sdm->__dbh->commit;
unless ($rc) {
$log->syslog('err', 'Error at delete member commit: %s', $sdm->errstr);
$sdm->__dbh->rollback;
}
}

@ikedas
Copy link
Member

ikedas commented Jun 11, 2021

That process looks correct. The delete_list_member and add_list_member are also using transactions with rollbacks in case of error, so for the whole process there will be 3 transactions (update, insert & delete).

OK. I understood!

For the SQLite, it's really just a workaround for a workaround. The problem is that the SQLite do_prepared_query and do_query are already using transactions for each query to do this locking:

# Note:
# To prevent "database is locked" error, acquire "immediate" lock
# by each query. Most queries excluding "SELECT" need to lock in this
# manner.

So if we start transaction before calling these functions, it will fail to acquire the lock and it will return undef.

The AutoCommit check before commit is just to suppress warning messages saying commit doesn't have any effect when AutoCommit is enabled.

sympa/src/lib/Sympa/List.pm

Lines 1605 to 1611 in aa1b3ad

unless ($sdm->__dbh->{AutoCommit}) {
my $rc = $sdm->__dbh->commit;
unless ($rc) {
$log->syslog('err', 'Error at delete member commit: %s', $sdm->errstr);
$sdm->__dbh->rollback;
}
}

I added that first workaround for SQLite: Without it, the database might be occasionally locked and couldn't be recovered without recreating database file manually. SQLite's implementation for exclusive lock is inadequate.
Autocommit is enabled by default (as far as I remember) at least with MySQL/MariaDB and PostgreSQL drivers.

That's why I personally thought, to make things simple, we'd be better not to support transactions with SQLite. I'll think about it some more.

@ikedas
Copy link
Member

ikedas commented Jun 13, 2021

According to the manual, SQLite's transaction by BEGIN...COMMIT can not be nested. I wrote the methods wrappring DBI methods to deal with it. If possible, can you check ikedas/sympa@4e3446c in your spare time?

@Jylhis
Copy link
Contributor Author

Jylhis commented Jun 14, 2021

I checked your commit and that is better solution than my workaround. How do you want to handle merging it? Do you want to merge my changes to your branch or should I cherry-pick it to my branch?

@ikedas
Copy link
Member

ikedas commented Jun 15, 2021

Please feel free to do it in any way confortable, either merging or cherrypicking!

@Jylhis
Copy link
Contributor Author

Jylhis commented Jun 15, 2021

I cherry picked your commit and pushed the changes to my branch. The latest version should now be in #1184

@ikedas
Copy link
Member

ikedas commented Jun 30, 2021

@Jylhis , I checked your code and found a few things to be correct. That's all I can point out. When they are fixed, I'll add some refactoring and cosmetic changes and will merge it to base branch.

@ikedas ikedas added this to the 6.2.66 milestone Jun 30, 2021
@Jylhis
Copy link
Contributor Author

Jylhis commented Jun 30, 2021

Thanks! I will look into those fixes.

@ikedas
Copy link
Member

ikedas commented Jul 16, 2021

Hi, I pushed a PR to your PR. If there are not any problems, please add this to your PR.

The most important change is the commit 0b19264. Persistent connection (automatic reconnection) feature should be disabled during transaction. Or, if a transaction is aborted, e.g. due to a temporary server connection failure, the connection is automatically recovered
and subsequent updates will not be rolled back.

@Jylhis
Copy link
Contributor Author

Jylhis commented Jul 21, 2021

PR merged 👍🏻

ikedas added a commit that referenced this issue Jul 26, 2021
…s & ikedas

Data source synchronization performance improvement (#1186)
@ikedas
Copy link
Member

ikedas commented Jul 26, 2021

@Jylhis , your PR has been merged now. Thank you for valuable improvment!

Your changes will be included in the next beta released in this week. Please let us know if you notice anything.

ikedas added a commit that referenced this issue May 23, 2022
Revert changes for #1186 "Improving data source synchronization performance"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants