Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reworks access control API #136

Merged
merged 17 commits into from
May 12, 2019
Merged

Reworks access control API #136

merged 17 commits into from
May 12, 2019

Conversation

mdavidsaver
Copy link
Member

Still a work in progress, but I think far enough along for a first look.

This is not a protocol change. It should not be a change to the "ca" style authentication mechanism. It is an incompatible change in pv/security.h. The only breakage I anticipate is pvaSrv.

There is a lot here. The part I'd most like to have feedback on is the new PeerInfo struct, and associated new method ChannelRequester::getPeerInfo(). PeerInfo is intended to be the common piece which joins together three stages: Authentication (who are you?), Authorization (what can you do?), and access control decisions (do I let you do it?). The last part is no longer handled implicitly. Rather, every ChannelProvider is responsible to inspecting the PeerInfo provided (or not) by each requester.

The core information which a ChannelProvider will use is PeerInfo::account (aka username, aka principle name, ...), and PeerInfo::roles (aka. groups, aka attributes, ...). The later being an unordered set of strings which can be tested (any other meaning is left to a ChannelProvider).

/** @brief Information provded by a client to a server-type ChannelProvider.
*
* All peers must be identified by a peer name, which may be a network address (IP+port#) and transport.
* Peer names must be unique for a given transport.
*
* Transport names include:
*
* # "local" in-process client. Peer name is optional and arbitrary. Must set local flag.
* # "pva" PVA over TCP. Used by PVA client provider. Peer name is IP address and TCP port number as "XXX.XXX.XXX.XXX:YYYYY".
*
* Authority names include:
*
* "anonymous" - No credentials provided. Must not set identified flag.
* "plain" - Unauthenticated credential.
*/
struct epicsShareClass PeerInfo {
POINTER_DEFINITIONS(PeerInfo);
static size_t num_instances;
std::string peer; //!< network address of remote peer. eg. "192.168.1.1:5075".
std::string transport; //!< transport protocol used eg. "pva". Must not be empty.
std::string authority; //!< authentication mechanism used. eg. "anonymous" or "gssapi". Must not be empty.
std::string realm; //!< scope of authority. eg. "mylab.gov"
std::string account; //!< aka. user name
//! NULL or extra authority specific information.
pvData::PVStructure::const_shared_pointer aux;
typedef std::set<std::string> roles_t;
//! Set of strings which may be used to modify access control decisions.
roles_t roles;
unsigned transportVersion; //!< If applicable, the protocol minor version number
// attributes for programatic consumption
bool local; //!< Short-hand for transport=="local"
bool identified; //!< Short-hand for authority!="anonymous"
PeerInfo();
virtual ~PeerInfo();
};

/**
* @brief Return information on connected peer if applicable.
*
* A server-type ChannelProvider will use this method to discover if a remote client
* has provided credentials which may be used in access control decisions.
*
* Default implementation returns NULL.
*
* isConnected()==true and getPeerInfo()==NULL when the ChannelProvider does not provide
* information about the peer. This should be treated as an unauthenticated, anonymous,
* peer.
*
* The returned instance must not change, and a different instance should be returned
* if/when peer information changes (eg. after reconnect).
*
* May return !NULL when !isConnected(). getPeerInfo() must be called _before_
* testing isConnected() in situations where connection state is being polled.
*/
virtual std::tr1::shared_ptr<const PeerInfo> getPeerInfo();

The mechanics of this process is:

  1. Server proposes a list of AuthenticationPlugin names.
  2. Client selects one.
  3. An exchange results in the server having a PeerInfo.
  4. The server passes this PeerInfo through a chain of AuthorizationPlugins which may modify it.

So far there is only one AuthorizationPlugin which populates PeerInfo::roles with the list of *NIX group names, or Windows group names (tested with cross mingw only so far). vxWorks and RTEMS leave this empty. This happens only on the server side.

The credentials (account name and roles) of connect clients is printed by the pvasr iocsh command. A concurrently running pvmonitor cnt shows.

epics> pvasr 2
Clients:
  tcp://192.168.210.1:34792 ver=1  1 channels user: ca/mdavidsaver groups:audio,bluetooth,cdrom,dip,floppy,lp,lpadmin,mdavidsaver,netdev,plugdev,sudo,uml-net,video,wireshark
  cnt

This interface will be removed.
Break up monolithic SecurityPlugin API into three parts.

1. Authentication.
2. Authorization.
3. Access Control decision.  Left to ChannelProviders
@mdavidsaver
Copy link
Member Author

While there is nothing here which must be replicated on the java side (it isn't a protocol change) I'd like to solicit opinions from @shroffk and/or @georgweiss on whether this new abstraction makes sense.

The default value of _verificationStatus is reported
on auth timeout.  Make sure it isn't success.
@mdavidsaver
Copy link
Member Author

mdavidsaver commented Jan 14, 2019

I'm not planning to merge this until I've successfully used it with, at least, the PVA gateway access controls. There are a few remaining questions which I have:

  1. Configuration! Control which AuthenticationPlugin s are a advertised, and in what order. Configure said plugins. Similarly for AuthorizationPlugins s. So far this is hard coded. The idea of doing this through environment variables frightens me.
  2. Is the roles (aka. groups) idea sufficiently expressive?
  3. How/if groups populated on RTEMS/vxWorks? Trust client to provide? What about RTEMS -> RTEMS situations?
  4. What else should go into PeerInfo? sockaddr in addition to string? SASL-like SSF metric?
  5. Expose server PeerInfo? That is, allow client to inquire about the server. Most strong auth. protocols work in both directions. Would detect search based MitM in many cases.
  6. Does NetUserGetLocalGroups() give meaningful results in real Windows environments?
  7. What to do about netapi32.dll? More broadly, how to avoid having to hard code system libraries in all dependent/downstream modules?
  8. How to handle expiring credentials?

@ralphlange
Copy link
Contributor

@shroffk / @georgweiss: I remember using the JAAS API for Java AuthN/AuthZ. Would that be a reasonable thing to use, or are the scopes too different?

@georgweiss
Copy link

Not sure I completely understand why this is not a protocol change. How would a client present roles (or rather: list of groups)?
As for list of groups, I think @mdavidsaver has suggested a limit of 16, but this might be a bit too conservative in my opinion.
As for JAAS, I think we would need to use it, at least to extract information like user groups.

@ralphlange
Copy link
Contributor

Michael wrote earlier:

So far there is only one AuthorizationPlugin which populates PeerInfo::roles with the list of *NIX group names, or Windows group names (tested with cross mingw only so far). vxWorks and RTEMS leave this empty. This happens only on the server side.

So my understanding is that lists of groups are not sent by the client.

@mdavidsaver
Copy link
Member Author

limit of 16, but this might be a bit too conservative in my opinion.

Fair enough. I was mostly looking to test the idea of expressing groups using the existing UAG syntax.

https://code.launchpad.net/~epics-core/epics-base/+git/asLib/+merge/358821/comments/941273

@mdavidsaver
Copy link
Member Author

So my understanding is that lists of groups are not sent by the client.

This understanding is correct. The structure send by clients has, and had, only two string fields: "host" and "user".

@georgweiss
Copy link

So the suggested procedure to pass group (role) names is through the exchange of PeerInfo, right?

@bhill-slac
Copy link
Contributor

bhill-slac commented Jan 16, 2019

  1. Configuration! Control which AuthenticationPlugin s are a advertised, and in what order. Configure said plugins. Similarly for AuthorizationPlugins s. So far this is hard coded. The idea of doing this through environment variables frightens me.
    

Why not use a config file as we currently do for the gateway and IOC CA Access Security configuration?
2. Is the roles (aka. groups) idea sufficiently expressive?
This works for SLAC. We're interested in creating groups such as operator, staff, sxr-opr, cxi-opr, cxi-staff, etc and looking for whether a user belongs to a group w/ write access.
3. How/if groups populated on RTEMS/vxWorks? Trust client to provide? What about RTEMS -> RTEMS situations?
This is a tough one. Ideally this would work w/o IOC needing to be reconfigured or restarted as our RTEMS IOCs often run nonstop for months.
4. What else should go into PeerInfo? sockaddr in addition to string? SASL-like SSF metric?
I don't see a current need for other fields.
5. Expose server PeerInfo? That is, allow client to inquire about the server. Most strong auth. protocols work in both directions. Would detect search based MitM in many cases.
Interesting suggestion, but don't see this as a current need.
6. Does NetUserGetLocalGroups() give meaningful results in real Windows environments?
7. What to do about netapi32.dll? More broadly, how to avoid having to hard code system libraries in all dependent/downstream modules?
8. How to handle expiring credentials?
I see a need to configure a timeout on our EPICS CA or PVA credentials. Potentially this could be handled by using kerberos tickets w/ custom timeouts for different groups.

Re this not being a protocol change, isn't that only true for CA, and only so long as CA only implements the current client supplied username authentication mechanism? Does pvAccess have a protocol version, and wouldn't we need to update that so a pvAccess server can be backward compatible w/ an older pvAccess client? Would these older clients just ignore the list of AuthenticationPlugin names?

@bhill-slac
Copy link
Contributor

I agree re needing more than 16 groups. If we populate the groups a user belongs to through POSIX, many of our users belong to way more than 16 groups as each experiment gets a numbered group. Do we need to limit these?

@mdavidsaver
Copy link
Member Author

Re this not being a protocol change, isn't that only true for CA, and only so long as CA

As clarification, any mention of "ca" here refers only to what was the SecurityPlugin, and will be AuthenticationPlugin by that name. This has no real connection with the Channel Access protocol . See the "The mechanics of this process is:" mention above.

@mdavidsaver
Copy link
Member Author

@bhill-slac

Why not use a config file as we currently do for the gateway and IOC CA Access Security configuration?

For a gateway, sure. This change is more general, and applies to all PVA clients and servers. A configuration file is still a possibility in the same way that all targets load .acf files.

  1. How to handle expiring credentials?

I see a need to configure a timeout ...

The question is then what to do. Simply start denying puts? Force a disconnect? Somehow fallback to anonymous? I see potential problems with all of these options. And of course there is the mechanical question of how, and where, a timeout would be expressed. Neither the existing or proposed APIs provide a way to do this.

Absent any better ideas my inclination is the force disconnect option as this error case is handled. This leaves the usual issue of avoiding a reset loop though.

@mdavidsaver
Copy link
Member Author

I've been able to use this successfully so far with my second PVA gateway prototype. I also think I've sorted out the mingw/windows linking error.

I still haven't had any feedback on whether NetUserGetLocalGroups() gives useful output on windows. Also, the question of groups on RTEMS/vxWorks is still unanswered. However, I don't think resolving these would result in API changes.

Also, the larger question of how to handle runtime configuration probably can't be resolved without input from someone who has actually tried to deploy it.

So I think this change is now merge-able.

Ignore NGROUPS_MAX.  On Darwin NGROUPS_MAX==16.
The actual limit is higher.

getgrouplist() is weird...
@mdavidsaver
Copy link
Member Author

@georgweiss encountered some portability issues with osdGetRoles() on Mac. This raises a larger point that there are several relevant scenarios which I'm not able to easily test.

To assist with this I've added the showauth executable. This isn't installed, but can be found under testApp. eg.

$ ./testApp/O.linux-x86_64/showauth
User: mdavidsaver
Groups: 
  audio
  bluetooth
...

Some test results I'd like to have.

  1. Windows standalone
  2. Windows domain member
  3. Linux with NIS or LDAP
  4. Mac

In all cases, the first question is whether it builds and runs at all.

With Linux and Mac, I'm interested to know the execution time of showauth vs. groups $USER.

With Windows, I'm interested what is actually reported, and whether it would be useful in an ACF file.

@georgweiss
Copy link

On a Macbook pro, in an LDAP environment, I have:

time showauth
real 21-28 ms
user 6-7 ms
sys 7-9 ms

time groups
real 12-15 ms
user 3-4 ms
sys 3-4 ms

It seems that the OS occasionally needs to refresh LDAP information. In those cases execution time doubles or triples (at least real and sys).

@hartmansm
Copy link

Mac 10.13.6 (office machine, with AD, member of ~90 groups)

$ time ./showauth 
. . . 
real	0m0.020s
user	0m0.006s
sys	0m0.005s
$ time groups
 . . .
real	0m0.014s
user	0m0.003s
sys	0m0.003s

@hartmansm
Copy link

redhat 7.6 (LDAP, member of ~30 groups)

$ time ./showauth
 . . .
real	0m0.022s
user	0m0.008s
sys	0m0.004s
$ time groups 
 . . .
real	0m0.005s
user	0m0.001s
sys	0m0.004s

@mdavidsaver
Copy link
Member Author

@georgweiss @hartmansm Thanks for the feedback.

execution time of showauth vs. groups $USER

wrt. execution time, I think that giving $USER is important (or picking a different user). At least for groups from GNU coreutils this selects a different code path. If $USER is omitted, then getgroups() is used, which asks the OS to list the groups IDs of the running process. If $USER is given, then getgrouplist() is used (as with showauth), which parses /etc/groups or consults LDAP (but not the OS kernel).

In both cases it is then necessary to consult /etc/groups, LDAP, etc. to find names for the numeric group IDs. So the performance shouldn't be dramatically different. One less potential I/O operation.

As an aside, getgroups() and getgrouplist() can disagree if games are played with setgroups() (eg. with Linux user namespaces). The PVA auth checks will never compare with the running process, even if the remote username happens to match the local process username.

@hartmansm
Copy link

@mdavidsaver sorry I missed $USER in your instructions. Typical results below . . .

Mac 10.13.6 (office machine, with AD, member of ~90 groups)

$ time groups $USER
 . . . 
real	0m0.014s
user	0m0.003s
sys	0m0.003s

redhat 7.6 (LDAP, member of ~30 groups)

$ time groups $USER
 . . .
real	0m0.007s
user	0m0.004s
sys	0m0.004s

@ralphlange
Copy link
Contributor

Do we need to test with an LDAP server not answering, i.e. no connection / time out?
I know that dysfunctional DNS is hitting Linux machines (and IOCs) painfully, and I would hate to add yet another vulnerability.

@georgweiss
Copy link

georgweiss commented Apr 30, 2019

time groups $USER on Macbook pro (10.13.6), AD connected (not LDAP as stated previously), 30 groups: results are a wee bit faster (1-2 ms) compared to "time groups georgweiss". However, this is tested over VPN, will provide results from office network on Thursday.

@mdavidsaver
Copy link
Member Author

Do we need to test with an LDAP server not answering, i.e. no connection / time out?

I would like to know how this fails. getgrouplist() on Linux doesn't document an error other than buffer too small. Given the need to handle the unhelpful behavior of Mac (and maybe other *NIX), there is a reasonable chance that the current osdGetRoles() will loop through multiple timeouts while trying to resize the gid array until it gets a std::bad_alloc. getgrouplist() is an ugly interface.

... I would hate to add yet another vulnerability.

imo. this is a situation which sites using centralized authentication systems should already be mitigating by eg. having redundant LDAP servers and/or local caching with nscd.

@hartmansm
Copy link

Do we need to test with an LDAP server not answering, i.e. no connection / time out?

A quick check on Mac (pulled the network cable) was inconclusive. Group information is cached and it replied as before (and as quickly) for both groups and ./showauth.

@mdavidsaver
Copy link
Member Author

Does OSX use nscd? nscd -i passwd and nscd -i group will clear the user and group databases (aka. cache). This database names may be glibc specific. Or try searching for a random non-existent user.

@georgweiss
Copy link

In the ESS environment Mac does not use nscd.

Guess it depends on client configuration, but I observe the same thing as Steven: there is obviously a cache, and on Linux we will be able to tweak the cache expire time. What I do observe is a longer (~100-200 ms) response time for command "groups" when the host has been off-line and goes back on-line. Suppose this is caused by an AD request.

@hartmansm
Copy link

Does OSX use [nscd]

No and I have been unable to determine what service is doing the cacheing. Some online docs gave information relevant to earlier versions of MacOS but implied that 10.13 only uses the cache for offline mode which doesn't match my observations.

Our linux hosts do not use nscd either. I think the cacheing is provided by sssd. The linux hosts I have access to are shared systems so I haven't tried disabling this for testing yet. I do observe that the first call to groups someuser has a noticeable delay but subsequent calls are quicker. (The times I provided above are for cached results.)

@georgweiss
Copy link

georgweiss commented May 7, 2019

ESS is using sssd and we can set the cache expire time based on whatever we will conclude from future tests.
Are tests on Windows needed to get pv access control moving forward? I can do the tests unless anyone else has results available already.

@mdavidsaver
Copy link
Member Author

I was planning to merge this past weekend. But I decided to do PVD merges first, and then epics-base/pvDataCPP#62 blew up in my face. So I don't see any blockers.

@mdavidsaver
Copy link
Member Author

So I don't see any blockers.

... so I'll merge when I have time to deal with any CI failures.

@mdavidsaver mdavidsaver merged commit 523cb46 into epics-base:master May 12, 2019
@mdavidsaver
Copy link
Member Author

The added requirement of linking against netapi32.dll has triggered downstream build failures.

@anjohnson Can we avoid patching every downstream module Makefile? I'm wondering about adding PROD_SYS_LIBS_WIN32 += netapi32 in cfg/CONFIG_PVACCESS_MODULE?

@georgweiss
Copy link

georgweiss commented May 14, 2019

Building on Windows for Windows has mixed results.

EPICS_HOST_ARCH=windows-x64-mingw builds fine.

EPICS_HOST_ARCH=windows-x64 does not build due to:

  1. In getgroups.cpp inclusion of <windows.h> triggers multiple inclusions of winsock. A workaround is to precede include statement with #define _WINSOCKAPI_, or to include winsock2.h before windows.h. This works also with mingw.
  2. HexDump.cpp does not build due to:
    a) hexDump.h(53): error C2375: 'epics::pvAccess::operator <<': redefinition; different linkage. This can be resolved by adding epicsShareFunc to the prototype inside the class definition.
    b) hexDump.cpp(71): error C2039: 'min': is not a member of 'std'. Solution is to include algorithm, according to MS. Do not know how this comes out on other platforms, but algorithm seems to be std C++.
    Once a) and b) are fixed, the code builds on windows-x64 with SDK 10 and Visual Studio 2019 (Community Edition).

@anjohnson
Copy link
Member

Adding PROD_SYS_LIBS_WIN32 += netapi32 to CONFIG_PVACCESS_MODULE would appear to be the cleanest solution to the Windows static build failures at the moment.

Contrary to what we implied in the meeting, I see no need to set the parallel LIB_SYS_LIBS_WIN32 variable because the link errors are only occurring on static builds, and in that configuration any library that the downstream code creates is still going to need to be linked into a PROD anyway, which will use the first variable to add the necessary system library.

The other solution might be to conditionally (when $(OS_CLASS)==WIN32) add netapi32.lib to STATIC_LDLIBS_YES but I don't know if the link order matters there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants