Database To AD Group Mapping In Apache Sentry

This post is for Big Data cluster admins who want to take a stock of which active directory group has access to which hive/impala database in the cluster.

In Hadoop Clusters the most common and recommended way of access control is via groups wherein central active directory groups are created (on windows environment) with Unix attributes (like group id, home directory, and login shell, etc) set properly so that they are relevant on unix environment as well. Once these groups are created the new/existing users who need access to the Hadoop environment or hive tables are provided the required access by adding these users to the identified groups.

Of course, they should also have the corresponding unix attributes set as well like userid, groupid (of which the user is a member), home directory, and the login shell. This is required because unix doesn’t work based on names (like cn or common name in active directory) but it is all based on ids in Unix. So even if you define groups and add users to those groups you can’t use them for access control in unix until unix attributes are defined for those groups or users in the active directory itself. Then it is the job of unix admins to have the SSSD + PAM module installed on the Hadoop cluster which helps in integrating these two disjoint operating systems.

So far so good? The above description is required to help us understand how and from where these group definitions come from in most of the live clusters.

Now with HiveServer2 we can use HQL (Hive SQL or Hive Query Language) to GRANT or REVOKE privileges to a database or a table or even a column of the given table too. Sentry basically ties an active directory to a hive database (or table or a column) via a role. You can understand a role in sentry as a bridge between two disjoint entries. Imagine a pub-sub model of message broker where there is a topic to which a publisher publishes the message and is not worried about when and who would read the message (i.e. fire and forget) and at the other end, there are subscribers of the same topic which get the message as per their own convenience with an additional benefit of adding more subscribers later or remove the existing subscriptions.

Sentry architecture bears a remarkable similarity (to a certain extent) to this pub-sub model wherein access levels (all, select, etc) on a database is assigned to a role and the role is assigned to one or more different active directory groups. Obviously, you can change the access levels assigned to a role which means all groups with which this role is associated would automatically have changed permissions. And at the same time, you can add/delete groups to this role enabling/disabling all the users in that group. Pictorially it is something like

Sometimes based on the requirement we multitask a single sentry role using which we provide multiple groups with access to multiple databases. In this, the sentry configuration could be as complex as:

The HQL commands for these actions look like

Creating a sentry role:
```
CREATE ROLE [role_name];
```

Providing database access to a role:

GRANT SELECT ON DATABASE database_name TO ROLE role_name;

Associating a group with the role:

GRANT ROLE role_name TO GROUP <groupName>;

The overall summary of the entire process is:

Have the required groups created in the central active directory (with unix attributes)
Get the appropriate users added to the relevant groups (with unix attributes) like RiskBusinessSelect, RiskBusinessAll
Create Sentry role(s)
Provide relevant access to sentry role(s) on database(s) (or its resources)
Associate a Sentry role with an active directory group (obviously as per the access requirement)

Till this point, it seems like everything is streamlined right from creating the group in AD to their association with an appropriate role in Apache Sentry. This process gives an impression wherein we can easily find out which all roles are associated with which databases and groups to roles association as well. The commands we can use are:

List all the active roles for currently logged in user (only admin can see all the roles):
```
SHOW CURRENT ROLES;
```

Role to database association:

SHOW GRANT ROLE <roleName> on <DATABASE|TABLE><db | table Name>;

Group to role association:
```
SHOW ROLE GRANT GROUP <groupName>;
```

In case you have observed carefully, you can get the database to role mapping and role to group mapping but not something like the database to group mapping. Imagine a scenario, maybe for audit reports, where you have to get details on which all user groups have active access to hive databases along with the access level. You might think of using subqueries and get this associative data back using relevant joins but unfortunately, it doesn’t work that way as subqueries are not select-based (see they all start with Show) so trying something like

select rln.role_name from ( SHOW CURRENT ROLES ) rln;

will not work here. So what do we do here? Is there any way out?

Yes, we have a solution where we can get details from the sentry database itself. You can get the details of the database and its credentials from Ambari (if you are using Hortonworks) or Cloudera Manager (if you are using Cloudera Distribution of Hadoop). These are the main Sentry Tables:

SENTRY_GROUP: Stores the group names with local sequential ids for the groups used in sentry privileges
SENTRY_ROLE: Stores the name and id of the roles created.
SENTRY_ROLE_GROUP_MAP: Stores the mapping between Sentry roles and Sentry groups using the ids defined in the above two tables
SENTRY_DB_PRIVILEGE: Stores the hive object details (database or table) with the assigned access level.
SENTRY_ROLE_DB_PRIVILEGE_MAP: It is a mapping between the Sentry role (#2 above) and the provided access to database objects (#4 above)

You can simply write a Unix script to connect to this database and produce the desired output. The SQL query for the desired information could look like (output would be a single column with comma-separated values so that it can be saved in CSV directly):

SELECT DB_PRV.DB_NAME || ',' || GRP.GROUP_NAME || ',' || REPLACE(DB_PRV.ACTION,'*','all')
from SENTRY_DB_PRIVILEGE DB_PRV
INNER JOIN SENTRY_ROLE_DB_PRIVILEGE_MAP ROLE_TO_DB_PRV ON
DB_PRV.DB_PRIVILEGE_ID = ROLE_TO_DB_PRV.DB_PRIVILEGE_ID
AND upper(PRIVILEGE_SCOPE) = 'DATABASE'
INNER join SENTRY_ROLE_GROUP_MAP ROLE_TO_GROUP ON
ROLE_TO_GROUP.ROLE_ID = ROLE_TO_DB_PRV.ROLE_ID
INNER join SENTRY_GROUP GRP ON
GRP.GROUP_ID = ROLE_TO_GROUP.GROUP_ID
ORDER BY DB_NAME

A couple of points:

In case you want to search for only tables then use value for PRIVILEGE_SCOPE column as TABLE (AND upper(PRIVILEGE_SCOPE) = ‘TABLE’ ).
In case you want to limit the search and run the script with a predefined set of databases or tables then you can add a where clause before ORDER BY as: WHERE trim(DB_PRV.DB_NAME) IN (@replacewithvalues@)

I hope you would find this article useful. Do let me know your feedback or comments.

Leave a Reply Cancel reply