Possible bug in AD sync connector

I think there may be a bug in the connection to a 2008 AD.

We are running a mirror off the main AD, as a test measure.
yesterday we had a non-graceful C&B (crash & burn) loss of a UPS, this took the 2008 AD server down hard.

After the system was brought back up and functioning, we noticed the Uni. could not re-connect and was filling the error logs

it seems that once a connection is made to a MS AD , there is not any real checking to see if the connection goes down, instead the log files just fill with PY errors, even after the remote system comes backup.

a stopping of the AD connection at univention & a restarting, fixes the log errors and catches up on the domain syncs.

it seems the error routines need to be made a bit more robust & try re-forming the connection if the code is producing connection errors.

then after re-connecting:

[quote] File “/usr/lib/pymodules/python2.7/univention/connector/init.py”, line 1326, in sync_to_ucs
f(self, property_type, object)
File “/usr/lib/pymodules/python2.7/univention/connector/ad/password.py”, line 381, in password_sync
res = get_password_from_ad(connector, univention.connector.ad.compatible_modstring(object[‘dn’]))
File “/usr/lib/pymodules/python2.7/univention/connector/ad/password.py”, line 180, in get_password_from_ad
(level, ctr) = connector.drs.DsGetNCChanges(connector.drsuapi_handle, 8, req8)
NTSTATUSError: (-1073741300, ‘The transport connection is now disconnected.’)

26.01.2017 09:45:18,218 MAIN (------ ): DEBUG_INIT
26.01.2017 09:45:18,237 LDAP (ERROR ): Failed to lookup AD LDAP base, using UCR value.
26.01.2017 09:45:18,270 LDAP (PROCESS): Building internal group membership cache
26.01.2017 09:45:18,411 LDAP (PROCESS): Internal group membership cache was created
26.01.2017 09:45:18,449 LDAP (PROCESS): Using GP01 as AD Netbios domain name
26.01.2017 09:45:18,521 LDAP (PROCESS): sync from ucs: Resync rejected file: /var/lib/univention-connector/ad/1485393377.000030
26.01.2017 09:45:18,548 LDAP (PROCESS): sync from ucs: [ user] [ modify] cn=xxxxxx,ou=hk office,DC=xx,DC=xx,DC=xxx,DC=xx[/quote]

and everything is fine with the world until next time

Hi, can you provide us with the logfiles that show this behaviour? I can then try to reproduce it and create an issue entry in our tracker.

Thanks in advance,
Jens

yep, if you tell me exactly what you want and where to send them.

I’m not going to upload them to the forum , since they have user and domain info.

I built a testingenvironment with a membermode and killed the ad-master hard. My UCS complained at first but re-established a connection as the AD-Master was back online by itself. So I cannot reproduce your issue - I used UCS 4.1-4 in the latest Errata.

latest version 4.1-4 errata
AD server 2008 SP2, fully patched
WAN connection latency ~10-25ms, possible packet drops under heavy load.

system memory is 4GB , with <1GB used
system swap =0
CPU is not particularly loaded

Once the connection is manually restarted, everything syncs up again with no errors, just massive log files of failed connections.

Really it could do with some sort of trapping to capture this situation or deeper logging.

Ok,

Today i did a completely new install from a brand new downloaded 4.1 iso into a clean VM machine.

Updated to the latest errata,
then did a bi-directional mirror connection to a fully patched 2008 SP2 AD server.

it did a full sync with a few AD object errors, which resolved.

Then it ran for several hours,
finally exactly the same error on the connection occurred.
Restarted the server and it fully synced again on the out of date records, (if it is the records causing the problem , it should not restart with no errors)

So clearly something is going on with the connections, even though the WAN is showing no disconnects or package loss latency?

Apart from a flaky connection, I just tried again to reproduce the issue. I setup a UCS Server as a member to an existing Windows Server, configured the password sync, waited some time and then killed the AD. That is what I see in the log:

08.02.2017 10:44:57,503 MAIN (------ ): DEBUG_INIT 08.02.2017 10:44:58,209 LDAP (PROCESS): Using HEL as AD Netbios domain name 08.02.2017 10:45:08,974 LDAP (PROCESS): sync to ucs: [ user] [ modify] uid=tmtt,cn=users,dc=hel,dc=underworld
Modified a user to test the sync and some seconds afterwards killed the AD hard.

08.02.2017 10:52:12,775 MAIN (------ ): DEBUG_INIT 08.02.2017 10:52:44,860 LDAP (PROCESS): Building internal group membership cache 08.02.2017 10:52:44,883 LDAP (PROCESS): Internal group membership cache was created 08.02.2017 10:52:44,893 LDAP (PROCESS): Using HEL as AD Netbios domain name
After around 5 minutes I rebooted the AD again. As you can see my log went silent for about 5 - 7 minutes and then resumed connection by its own.

I have a question:

[quote]Updated to the latest errata,
then did a bi-directional mirror connection to a fully patched 2008 SP2 AD server.[/quote]

If you join an UCS via AD-Connector to a Windows Domain, that is not bi-directional. The AD is always the defining DC and changes are “mirrored” from the AD to the UCS. So if you have another setting you should let me know.

CLARIFICATION:
Our difference seems we are not making it part of an existing system.
Can you clarify which ports are required to be open to sync data & password hashes? (we want to double check this)

Should we perhaps be looking at the windows AD blocking the password request for some reason (but why a restart of the UCS fixes it)?

OUR TEST SYSTEM SETUP:
1.Create a NEW UCS domain NOT part of current schema, (finish procedure, update)
2.load in Active directory connection
3. Set UCS debugging to 3 (connector/debug/level)
4. “synchronisation of account data between an Active Directory and this UCS Domain”
5. allow to synchronise.
6. user usage via connection to the AD server (MS)

TESTING:
We now find it is not necessary to loose power to the systems, the systems randomly disconnects.
sometimes an hour, sometimes 3 hours, yesterday it was 16 hours.

FURTHER VALIDATION TESTING:

  1. we rebuilt the windows AD server on a new machine, rebuilt a new UCS system
  2. turned on UCS debugging =3

FURTHER SCOPE testing:

  1. Check ports/ protocols (TCP/UDP)
  2. Check python resources, connection limits etc
  3. Check MTU/ Packet Fragmenting/Timeout
  4. Check if it is OSX computer based objects triggering crash.

OUTCOME:
The UCS server then ran for 16 hours without erroring,
but finally the connection crashed again (logging an OSX client) & the first computer in the morning.

log was clean until this:

Our Setup is the same, we just achieved via different means (I set it up at Installation, you set it up later). We both have a UCS joined in a AD Domain and control the domain via the AD DC. The open Ports needed you can find here: http://sdb.univention.de/content/2/281/en/which-tcp-_-udp-ports-on-the-dc-master-must-be-accessable-by-other-ucs-systems.html

NTSTATUSError: (-1073741299, 'The transport connection has been reset.')

Unfortunatly, this just tells you, that the connection is lost. Did you already check network, switches, router, etc.? In my case the connector just resumed its duty at reestablishing the connection, I would expect that to happen in your case also.

This is beginning to annoy me…

It just does not make any sense, the thing has been running fine since version 4.0, and even reinstalling 4.1 on a separate system does not work.

cannot find anything wrong with network or switches, everything else works

Clearly something has changed in the connection routines.
it can spend hours just throwing this error message, but as soon as i restart the AD connection on the univention, it works first time… until it breaks again.

So clearly something on the uinvention is blocking it from remaking the connection…?

where to start looking?

On startup I have the following (stripped of other connections)

when the connection goes bad i have this:

and after re-starting the AD-connector, and the AD re-syncs.

Clearly we can see a double connection: (over a port that is not listed in your reference list 49155)
The question is, why the LDAP connection appeared to be able to maintain without falling over, but the other one cannot

Think i have found it.

for some reason the TCP connection session protocol is showing as ‘active’ at the firewall, even though it is not as far as the application is concerened. (error messages)

Expiring the connection at the firewall causes the connection to re-form and the AD to re-sync ,as you would expect and without the need to restart the actual AD-connector/univention instance.

As far as the “netstat” in the VM instance is concerened the connection:

Has “gone away”

but as far as the firewall is concerned “41482-49155” , has NOT gone away.

So what it looks like is as far as the host VM instance is concerned:

The python code is attempting to use a connection it cannot , because of some sort of error, between the KVM, hosted Kernel & Python.

Maybe this will be why you cannot duplicate it, because it is not actually a network/ high level python code problem.
but rather a kernel process/python Asynchronous socket failure that the application does not handle by resetting the socket handler.

So when this error is thrown, there needs to be some sort of shutdown & reforming of the underlying connection

What I did not understand is: The error is

NTSTATUSError: (-1073741300, 'The transport connection is now disconnected.')

so clearly, the UCS recognizes that the connection is lost (same as in my testing environment - I just do not cranked up the debuglevel). In my environment, the connection was reestablished as soon the AD was reachable again. So this actually happend:

Fine,
but clearly the systems are not synchronised.
Because if I re-set the connection (by deleting it at the firewall), then there are literally 10’s of updates to the UCS system.

you can see here… a tail of the log… with continual errors.

after clearing the connection at the firewall, you can see the log file changes. (the debug-init is where i did a forced firewall deletion of the connection)

This is on a network that does not fail nor does it drop packets.

the longer I leave the connection without a hard reset, the longer the re-sync list is.

If i don’t reset the connection, then i just continually get the “disconnected error”

So…
its recovery not likely to be related to this in /init.py 1323, which does not fully check the return values after the connection

HI,

if i need to programatically close the connection to a MS AD server

is there a python library routine for it?

for example i might see this in the python code:

[quote] if retry:
self.open_ucs()
return self.sync_to_ucs(property_type, object, premapped_ad_dn, False)
else:
raise
[/quote]

is there a comparable routine to close the connection?

If you mean the LDAP connection you can use this:

open_ad() sets self.lo_ad and in ‘lo’ an LDAP Object will be saved, that you can close explicitly.

ad_obj = ad(...) ad_obj.open_ad(...) ad_obj.lo_ad.lo.unbind_ext() ad_obj.open_ad(...)

But we think it should be sufficient to just reopen .open_ad() all old objects will be closed and a new connection will be established.

I will try that and check the results

yep that fixed part of the bug!!!

but

does NOT work

it has to be:

[url]Possible bug in AD sync connector]

but not the best solution since i will loose the fix on your software update.
also this is clearly not clearing the connection completely, since other errors are thrown

[quote] self.open_ucs()[/quote] does not force a physical disconnection between the sites as far as this bug is concerned.

it seems if we do this:

then after the connection errors it is reformed.

[quote] self.open_ucs()
[/quote]
DOES NOT work if used here

Mastodon