Improving data source synchronization performance #1186

Jylhis · 2021-06-07T14:47:03Z

Members synchronization process is currently quite slow for our use case and I would like to improve this.

Expected Behavior

Synchronization to be fast as possible and preferably maximum wait time of 1h in worst case scenario.

Current Behavior

Currently synchronizing 100 000 members (60k add and 40k update) takes ~10 min.

Possible Solution

Running profiler for the sync process shows that majority of the sync time is spent executing database queries. The sync process seems to be creating 4 queries per member update and 7 queries per member for remove and add.

I pushed draft PR (#1184) where I have moved couple queries out of loops and doing some of the check in code and I also grouped some of the queries in transactions. Here are the benchmarks I ran with the changes:

add 20 000, remove 20 000 update 20 000 members

	Before	After
Time (sec)	263	67.2
Calls to do_prepared_query	240 032	120 032

add 60 000, update 40 000 members

	Before	After
Time (sec)	569	141
Calls to do_prepared_query	580 032	260 032

Context

We need to synchronize around 4 milj. memberships total a cross all lists and we would like to keep our list memberships as much as possible in sync with our data source and minimizing the wait time for sync.

ikedas · 2021-06-08T02:04:44Z

Just a thought...

In your PR:

sympa/src/lib/Sympa/Request/Handler/include.pm

Lines 390 to 403 in 5af29d4

    
               my $user_query; 
        
               $user_query = $sdm->do_prepared_query(qq{ SELECT user_$t, inclusion_$t, inclusion_ext_$t 
        
                               FROM ${t}_table 
        
                               WHERE list_$t = ? AND robot_$t = ?$r 
        
                              }, $list->{'name'}, $list->{'domain'}); 
        
               my %users = (); 
        
               while (my @row = $user_query->fetchrow_array) { 
        
                   my $user_email = $row[0]; 
        
                   my $user_inclusion = $row[1]; 
        
                   my $user_inclusion_ext = $row[2]; 
        
                   $users{$user_email} = [ $user_inclusion, $user_inclusion_ext ]; 
        
               }

How much will be the actual memory usage by %users? Is the amount acceptable to us? (In the past, the code also loaded all the users in the list into memory, which resulted in a sudden increase in memory usage.)

To reduce number of queries, use of transaction looks making sense. However, is there problems with transaction interference? (Synchronization can be kicked by multiple processes at the same time.)

Jylhis · 2021-06-08T13:21:56Z

I hadn't measured the memory usage before but I ran following command before and after the changes with 100k member (only update)

valgrind sympa.pl -d --sync_include=testlist

Before:

total heap usage: 16,167,047 allocs, 15,425,081 frees, 1,718,573,828 bytes allocated

After:

total heap usage: 10,976,807 allocs, 10,230,023 frees, 837,249,270 bytes allocated

It would seem that the change would lower at least amount of memory allocated. I'm not sure if the bytes allocated is the stat that matters in this case. I have ran my test in virtual machine with 1 cpu and 512MB memory and just looking at the process (without Valgrind) in htop I get following numbers:

Before:

VIRT	RES	SHR	CPU%	MEM%
362M	82816	7480	22.0	16.6

After:

VIRT	RES	SHR	CPU%	MEM%
397M	110M	2576	100.	22.7

So memory utilization would be little bit higher and CPU utilization would be much higher as the sync process doesn't wait for each request to finish and does some of the checks in code.

For the problems with transaction interface. I have triggered two synchronizations in parallel from command line and these finished successfully, but I'm not that familiar with Sympa codebase to know it there are some possible cases for failure. At least PostgreSQL does not seem to support parallel transactions with same connection, but with each transaction in its separate connection it should work. I have used PostgreSQL in my tests.

ikedas · 2021-06-09T08:22:58Z

What is architecture you are using? For example, x86-64 architecture consumes much memory than 32bit x86.
Can you know the peak value of the virtual memory usage? (If no swap out occurred,) I think the peak value is important for system administrators.

Jylhis · 2021-06-09T11:15:27Z

I'm using x86-64. Here is some memory stats from /proc/{id}/status run with 30k members (10k add, 10k remove, 10k update).

	Before (kB)	after (kB)
VmPeak	368084	386616
VmSize	367824	386616
VmHWM	79660	98156
VmRSS	79660	98156
VmData	71260	90052
VmPTE	628	668

ikedas · 2021-06-10T01:34:33Z

Thank you for information! At least the current results don't show a significant increase in memory usage.

If possible, could you please also check the results by task_manager.pl? It will repeatedly execute sync_include task (i.e. memory allocation/deallocation) on multiple lists.

Jylhis · 2021-06-10T10:25:38Z

I got measurements from task_manager.pl. I ran it approximately the same amount of time and it synchronized 2 lists (10k add, 10k delete). I accidentally ran it with only 20k members, but I think the results are still comparable.

	Before (kB)	after (kB)
VmPeak	364340	384376
VmSize	364336	384376
VmHWM	77936	96092
VmRSS	77936	96092
VmData	71900	89876

And from Valgrind

Before:

total heap usage: 11,470,155 allocs, 10,749,443 frees, 2,150,722,477 bytes allocated

After:

total heap usage: 8,996,962 allocs, 8,274,111 frees, 2,015,373,534 bytes allocated

ikedas · 2021-06-11T01:24:21Z

Thank you! Things look good so far. Later I'll check memory usage and performance with my VM, too.

ikedas · 2021-06-11T01:48:00Z

@Jylhis , I have things a little unclear about your PR.

First, ---

Current code works as:

I. Start.

II. Include new entries.

for each entry given by a data source:

If role of the data source is 'member' and the user is excluded:
Do nothing.

If user has already been updated by the other data sources:
Keep user.

If user (has not been updated by the other data sources and) exists:
UPDATE inclusion.

Otherwise, i.e. a new user:
INSERT new user.

III. Expire outdated entries.

IV. Update custom attributes.

V. Finish.

As far as I understood, your changes will make it:

I. Start.

II. Include new entries.

Begin transaction.

for each entry given by a data source:

If role of the data source is 'member' and the user is excluded:
Do nothing.

If user has already been updated by the other data sources:
Keep user.

If user (has not been updated by the other data sources and) exists:
UPDATE inclusion.

However, rollback transaction and abort in case of unrecoverable error.

Commit transaction.

Otherwise, i.e. a new user:
INSERT new user.

III. Expire outdated entries.

IV. Update custom attributes.

V. Finish.

My question is: May we leave 4. (INSERT new user) and III. (Expire outdated entries) out of the transaction?

Second. ---

Is the magic with autocomit useful? If poor support for transaction by SQLite is the problem, why not give up using transaction with SQLite?

Jylhis · 2021-06-11T08:23:14Z

That process looks correct. The delete_list_member and add_list_member are also using transactions with rollbacks in case of error, so for the whole process there will be 3 transactions (update, insert & delete).

For the SQLite, it's really just a workaround for a workaround. The problem is that the SQLite do_prepared_query and do_query are already using transactions for each query to do this locking:

sympa/src/lib/Sympa/DatabaseDriver/SQLite.pm

Lines 523 to 526 in aa1b3ad

    
           # Note: 
        
           # To prevent "database is locked" error, acquire "immediate" lock 
        
           # by each query.  Most queries excluding "SELECT" need to lock in this 
        
           # manner.

So if we start transaction before calling these functions, it will fail to acquire the lock and it will return undef.

The AutoCommit check before commit is just to suppress warning messages saying commit doesn't have any effect when AutoCommit is enabled.

sympa/src/lib/Sympa/List.pm

Lines 1605 to 1611 in aa1b3ad

    
           unless ($sdm->__dbh->{AutoCommit}) { 
        
               my $rc = $sdm->__dbh->commit; 
        
               unless ($rc) { 
        
                   $log->syslog('err', 'Error at delete member commit: %s', $sdm->errstr); 
        
                   $sdm->__dbh->rollback; 
        
               } 
        
           }

ikedas · 2021-06-11T09:41:17Z

That process looks correct. The delete_list_member and add_list_member are also using transactions with rollbacks in case of error, so for the whole process there will be 3 transactions (update, insert & delete).

OK. I understood!

For the SQLite, it's really just a workaround for a workaround. The problem is that the SQLite do_prepared_query and do_query are already using transactions for each query to do this locking:

sympa/src/lib/Sympa/DatabaseDriver/SQLite.pm

Lines 523 to 526 in aa1b3ad

# Note:

# To prevent "database is locked" error, acquire "immediate" lock

# by each query. Most queries excluding "SELECT" need to lock in this

# manner.

So if we start transaction before calling these functions, it will fail to acquire the lock and it will return undef.

The AutoCommit check before commit is just to suppress warning messages saying commit doesn't have any effect when AutoCommit is enabled.

sympa/src/lib/Sympa/List.pm

Lines 1605 to 1611 in aa1b3ad

unless ($sdm->__dbh->{AutoCommit}) {

my $rc = $sdm->__dbh->commit;

unless ($rc) {

$log->syslog('err', 'Error at delete member commit: %s', $sdm->errstr);

$sdm->__dbh->rollback;

}

}

I added that first workaround for SQLite: Without it, the database might be occasionally locked and couldn't be recovered without recreating database file manually. SQLite's implementation for exclusive lock is inadequate.
Autocommit is enabled by default (as far as I remember) at least with MySQL/MariaDB and PostgreSQL drivers.

That's why I personally thought, to make things simple, we'd be better not to support transactions with SQLite. I'll think about it some more.

ikedas · 2021-06-13T08:58:36Z

According to the manual, SQLite's transaction by BEGIN...COMMIT can not be nested. I wrote the methods wrappring DBI methods to deal with it. If possible, can you check ikedas/sympa@4e3446c in your spare time?

Jylhis · 2021-06-14T14:03:52Z

I checked your commit and that is better solution than my workaround. How do you want to handle merging it? Do you want to merge my changes to your branch or should I cherry-pick it to my branch?

ikedas · 2021-06-15T05:13:50Z

Please feel free to do it in any way confortable, either merging or cherrypicking!

Jylhis · 2021-06-15T06:51:58Z

I cherry picked your commit and pushed the changes to my branch. The latest version should now be in #1184

ikedas · 2021-06-30T10:09:20Z

@Jylhis , I checked your code and found a few things to be correct. That's all I can point out. When they are fixed, I'll add some refactoring and cosmetic changes and will merge it to base branch.

Jylhis · 2021-06-30T10:13:29Z

Thanks! I will look into those fixes.

ikedas · 2021-07-16T05:02:30Z

Hi, I pushed a PR to your PR. If there are not any problems, please add this to your PR.

The most important change is the commit 0b19264. Persistent connection (automatic reconnection) feature should be disabled during transaction. Or, if a transaction is aborted, e.g. due to a temporary server connection failure, the connection is automatically recovered
and subsequent updates will not be rolled back.

Jylhis · 2021-07-21T09:28:56Z

PR merged 👍🏻

…s & ikedas Data source synchronization performance improvement (#1186)

ikedas · 2021-07-26T07:59:42Z

@Jylhis , your PR has been merged now. Thank you for valuable improvment!

Your changes will be included in the next beta released in this week. Please let us know if you notice anything.

…onization performance"

…onization performance" And small refactoring.

Revert changes for #1186 "Improving data source synchronization performance"

ikedas added enhancement help wanted labels Jun 8, 2021

ikedas mentioned this issue Jun 8, 2021

Data source synchronization performance improvement #1184

Merged

ikedas added this to the 6.2.66 milestone Jun 30, 2021

ikedas added a commit that referenced this issue Jul 26, 2021

Merge pull request #1215 from ikedas/Jylhis/sync-performance by Jylhi…

ea2ebd3

…s & ikedas Data source synchronization performance improvement (#1186)

ikedas closed this as completed Jul 26, 2021

ikedas mentioned this issue Nov 8, 2021

Included owners won't be removed even if they have been removed from the data source #1272

Closed

ikedas added a commit to ikedas/sympa that referenced this issue May 15, 2022

Revert changes for sympa-community#1186 "Improving data source synchr…

b08774f

…onization performance"

ikedas added a commit to ikedas/sympa that referenced this issue May 16, 2022

Revert changes for sympa-community#1186 "Improving data source synchr…

74b429b

…onization performance" And small refactoring.

This was referenced May 16, 2022

Unsubscribe and Signoff Requests Cause Sympa Service Crash in 6.2.68 #1359

Closed

Revert changes for #1186 "Improving data source synchronization performance" #1398

Merged

ikedas added a commit that referenced this issue May 23, 2022

Merge pull request #1398 from ikedas/issue-1186_rev by ikedas

17e1a6e

Revert changes for #1186 "Improving data source synchronization performance"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving data source synchronization performance #1186

Improving data source synchronization performance #1186

Jylhis commented Jun 7, 2021

ikedas commented Jun 8, 2021 •

edited

Loading

Jylhis commented Jun 8, 2021 •

edited

Loading

ikedas commented Jun 9, 2021

Jylhis commented Jun 9, 2021

ikedas commented Jun 10, 2021

Jylhis commented Jun 10, 2021

ikedas commented Jun 11, 2021

ikedas commented Jun 11, 2021

Jylhis commented Jun 11, 2021

ikedas commented Jun 11, 2021

ikedas commented Jun 13, 2021 •

edited

Loading

Jylhis commented Jun 14, 2021

ikedas commented Jun 15, 2021

Jylhis commented Jun 15, 2021

ikedas commented Jun 30, 2021

Jylhis commented Jun 30, 2021

ikedas commented Jul 16, 2021

Jylhis commented Jul 21, 2021

ikedas commented Jul 26, 2021

Improving data source synchronization performance #1186

Improving data source synchronization performance #1186

Comments

Jylhis commented Jun 7, 2021

Expected Behavior

Current Behavior

Possible Solution

add 20 000, remove 20 000 update 20 000 members

add 60 000, update 40 000 members

Context

ikedas commented Jun 8, 2021 • edited Loading

Jylhis commented Jun 8, 2021 • edited Loading

ikedas commented Jun 9, 2021

Jylhis commented Jun 9, 2021

ikedas commented Jun 10, 2021

Jylhis commented Jun 10, 2021

ikedas commented Jun 11, 2021

ikedas commented Jun 11, 2021

Jylhis commented Jun 11, 2021

ikedas commented Jun 11, 2021

ikedas commented Jun 13, 2021 • edited Loading

Jylhis commented Jun 14, 2021

ikedas commented Jun 15, 2021

Jylhis commented Jun 15, 2021

ikedas commented Jun 30, 2021

Jylhis commented Jun 30, 2021

ikedas commented Jul 16, 2021

Jylhis commented Jul 21, 2021

ikedas commented Jul 26, 2021

ikedas commented Jun 8, 2021 •

edited

Loading

Jylhis commented Jun 8, 2021 •

edited

Loading

ikedas commented Jun 13, 2021 •

edited

Loading