Long running php processes: LDAP timeout #4831

linuxrrze · 2017-05-12T05:00:57Z

Bug initially reported to nextant (nextant issue #175), but could be caused by a more global user_ldap problem.

For long running jobs some LDAP servers (Samba domain controller in my case) seems to kill the LDAP connection after some time. This connection is not re-established by nextcloud/user_ldap when required.

Possible explanation: During normal operation most php processes of nextcloud run in apache config mode with a relatively short runtime - typically 30s. Most likely not enough to run into any LDAP timeouts. However nextclouds cron.php (when started by system cron) does run with php cli config (with unlimited runtime) - in this case a re-connect to LDAP may be required.

As apps are (most likely?) unaware of the actual user backend, the problem cannot be solved by the app developer, but needs to be handled in the user backend itself.

Steps to reproduce

Use samba domain controller as LDAP server (samba 4.6.3 in my case)
Running long lasting php jobs (e.g. "nextant:index" and lots of files to index) with LDAP user authentication enabled

Expected behaviour

Job should finish normally.

Actual behaviour

Jobs breaks after indexing all files (takes days to finish!)
On next run, the indexing starts all over again ...

# sudo -u www-data php occ nextant:index
nextant v1.0.8

* Extracting files:

/Administrator/files            12/    12 [============================] 100%
/dani/files                      7/     7 [============================] 100%
/flo/files                       7/     7 [============================] 100%
/frank/files                     7/     7 [============================] 100%
/ijp/files                       7/     7 [============================] 100%
/marcel/files               145150/145150 [============================] 100%


                                    
  [OC\ServerNotAvailableException]  
  Lost connection to LDAP server.   
                                    

nextant:index [--output [OUTPUT]] [--debug] [--debugall [DEBUGALL]] [-k|--unlock] [-f|--force] [-u|--user [USER]] [-c|--background] [-b|--bookmarks] [-i|--files] [-x|--files_extract] [-r|--files_update]

Server configuration

Operating system:
Ubuntu 16.04

Web server:
Apache 2.4.18

Database:

Postgres 9.5

PHP version:

7.0

Nextcloud version: (see Nextcloud admin page)

11.0.3

Updated from an older Nextcloud/ownCloud or fresh install:

Updated

The text was updated successfully, but these errors were encountered:

MorrisJobke · 2017-05-12T05:03:50Z

cc @daita @blizzz

ArtificialOwl · 2017-05-12T08:58:53Z

I do agree on the possible explanation as I had some doubt when some people reported having issue with nextant background process not ending while using LDAP.

linuxrrze · 2017-05-13T06:50:06Z

I did some code research, and the cause of this seems to be the LDAP error handling here:

apps/user_ldap/lib/LDAP.php:

private function processLDAPError($resource)
...
} else if ($errorCode === -1) {
               throw new ServerNotAvailableException('Lost connection to LDAP server.');
...

Looks like an exception is thrown, but nobody is willing to handle it.
Trying to reconnect to the LDAP server in case of this exception could do the trick.

The LDAP code at this point is a little hard to read and understand for someone not working with it on a daily basis. I had to match the numeric error codes like "-1":

define LDAP_SERVER_DOWN				(-1)

from LDAP header files:
https://github.com/openldap/openldap/blob/master/include/ldap.h

As a side note: it'd be nice to have those constants mentioned in the source for better readability.

blizzz · 2017-05-15T22:22:53Z

in this case a re-connect to LDAP may be required.

this could potentially result in other bugs, when paged results are being used. If those are restarted, it's possible to run again into a timeout and the fun starts again.

Perhaps it makes more sense to set/request a higher timeout from the LDAP Server when a background task is running (keyword LDAP_OPT_NETWORK_TIMEOUT). If we can figure this out. It still might not help in the long run.

Alternatively, can the background job be split into smaller chunks?

linuxrrze · 2017-05-16T04:47:45Z

The handling of long running paged results may require special treatment, that's right.

I'm not sure wether splitting a background job into smaller junks will help: As far as I observed, the LDAP connection during "nextant:index" is opened (and used?) at the very beginning, then a long time with no LDAP activity follows - and I guess that's what makes the LDAP server close the connection.

So splitting the background job into smaller junks would only improve the situation if every junk was processed by a new process (with a new LDAP connection, that won't hit the timeout).

I also thought about changing some LDAP connection related settings, however most of the options I found seem to be OpenLDAP specific (and therefore wont help with other LDAP implementations like my Samba DC LDAP):

/* private and experimental options */
/* OpenLDAP specific options */
...
#define LDAP_OPT_NETWORK_TIMEOUT	0x5005	/* socket level timeout */

...

/*
 * OpenLDAP per connection tcp-keepalive settings
 * (Linux only, ignored where unsupported)
 */
#define LDAP_OPT_X_KEEPALIVE_IDLE		0x6300
#define LDAP_OPT_X_KEEPALIVE_PROBES		0x6301
#define LDAP_OPT_X_KEEPALIVE_INTERVAL	0x6302

At the very end I still think re-connecting is the best solution, even if it may require some extra handling (e.g. for paged results).

How do other user backends handle situations like this?

ArtificialOwl · 2017-05-16T11:38:01Z

I could totally catch the ServerNotAvailableException, but because I do not directly work on the LDAP (I think it is managed by the UserManager), how can I quickly reconnect to the LDAP from my script ?

blizzz · 2017-05-16T13:31:31Z

@daita you don't. try to set $resource to null before https://github.com/nextcloud/server/blob/master/apps/user_ldap/lib/LDAP.php#L333 Not sure whether it works, and as stated above this might create more issues without further adjustments. For you it's probably okay since you don't interact much with LDAP, as I interpret it.

linuxrrze · 2017-05-16T22:52:35Z

I've been running a similar test overnight:

Instead of setting $resource to null I replaced the exception with a simple debug warning:

apps/user_ldap/lib/LDAP.php:

private function processLDAPError($resource)
...   
      } else if ($errorCode === -1) {
                        # throw new ServerNotAvailableException('Lost connection to LDAP server.');
                        \OCP\Util::writeLog('user_ldap',
                                'LDAP error Lost connection to LDAP server.'.$errorMsg.' (' .
                                $errorCode.') after calling '.
                                $this->curFunc,
                                \OCP\Util::DEBUG);

"occ nextant:index" now finishes without (obvious) error.
However this seems to be only true for the "Extracting files" section, which is run for all local and LDAP user accounts.
Sections "Updating files" and "Indexing bookmarks" are only executed for local user accounts (=Administrator). The nextcloud.log reports "Connection lost" errors in these two sections. So I guess they started after the LDAP connection got lost.

# sudo -u www-data php occ nextant:index
nextant v1.0.8

* Extracting files:

/Administrator/files             7/     7 [============================] 100%
/dani/files                      1/     1 [============================] 100%
/flo/files                       7/     7 [============================] 100%
/frank/files                     7/     7 [============================] 100%
/ijp/files                       7/     7 [============================] 100%
/marcel/files                35132/ 35132 [============================] 100%
/martin/files                    7/     7 [============================] 100%
/moni/files                      7/     7 [============================] 100%
/sebbi/files                     7/     7 [============================] 100%

  33942 file(s) processed ; 0 orphan(s) removed
  33942 documents indexed ; 33942 fully extracted
  82 file(s) were not processed (failure)

* Updating files:

/Administrator/files             7/     7 [============================] 100%

  0 document(s) updated ; 0 failure(s)

* Indexing bookmarks:

/Administrator                   0/     0 [>---------------------------]   0%

  0 bookmark(s) processed ; 0 orphan(s) removed
  0 document indexed ; 0 fully extracted

Time spent: 00:10:26
Your index now contains 18 segments

blizzz · 2017-05-17T10:15:44Z

Instead of setting $resource to null I replaced the exception with a simple debug warning:

This would fail really, really hard in several other places. Been there, not pleasant. ;)

linuxrrze · 2017-05-17T11:46:20Z

I'm totally aware of that.

That's why I hope for a clean solution ;-)

blizzz · 2017-05-24T23:04:28Z

Potential fix in #5104

MorrisJobke · 2017-07-24T12:03:32Z

backported to stable12 in #5210

MorrisJobke added 0. Needs triage Pending check for reproducibility or if it fits our roadmap bug feature: ldap labels May 12, 2017

blizzz removed the 0. Needs triage Pending check for reproducibility or if it fits our roadmap label May 24, 2017

blizzz added this to the Nextcloud 12.0.1 milestone May 24, 2017

MorrisJobke closed this as completed Jul 24, 2017

DmDS mentioned this issue Oct 15, 2021

Argument 1 passed to OCA\Files_FullTextSearch\Service\FilesService::e() must be an instance of Exception, instance of Error given nextcloud/files_fulltextsearch#140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long running php processes: LDAP timeout #4831

Long running php processes: LDAP timeout #4831

linuxrrze commented May 12, 2017

MorrisJobke commented May 12, 2017

ArtificialOwl commented May 12, 2017

linuxrrze commented May 13, 2017

blizzz commented May 15, 2017

linuxrrze commented May 16, 2017

ArtificialOwl commented May 16, 2017

blizzz commented May 16, 2017

linuxrrze commented May 16, 2017

blizzz commented May 17, 2017

linuxrrze commented May 17, 2017

blizzz commented May 24, 2017

MorrisJobke commented Jul 24, 2017

Long running php processes: LDAP timeout #4831

Long running php processes: LDAP timeout #4831

Comments

linuxrrze commented May 12, 2017

Steps to reproduce

Expected behaviour

Actual behaviour

Server configuration

MorrisJobke commented May 12, 2017

ArtificialOwl commented May 12, 2017

linuxrrze commented May 13, 2017

blizzz commented May 15, 2017

linuxrrze commented May 16, 2017

ArtificialOwl commented May 16, 2017

blizzz commented May 16, 2017

linuxrrze commented May 16, 2017

blizzz commented May 17, 2017

linuxrrze commented May 17, 2017

blizzz commented May 24, 2017

MorrisJobke commented Jul 24, 2017