September 8, 2022

Managing IAM more sanely

AWS IAM is the global permissions system that AWS uses. People are often mystified by how it works and often end up just giving everything *.* permissions.

In this post I'll discuss things I learned while managing the IAM configuration at Narrative Science and hopefully allow you can learn from our failures so you can manage your IAM account in a much for sane and scalable way.

IAM Primer

IAM is composed of a few primitives:

Roles
Policies
Groups
Users

Roles, Groups and Users can all have policies attached to them. Groups are composed of multiple Users. For example, you could have a db-admin group that has the common permissions needed for that task, and users can be added or removed easily from it.

Users and Roles are the entities that authenticate with AWS. There's a lot of ways this can be done but most commonly it's either an EC2 instance with a Role applied, or an Employee with an access key.

Policies are where the rubber hits the road. When you author a policy, you can say to allow or deny access to a resource (though not allowing something is an implicit deny).

Here is AWS's official documentation on the subject:
https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction_access-management.html

It will also be important to understand ARN's (Amazon Resource Identifiers) as they are used extensively in IAM policy documents:
https://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html

Security Objectives

Least privilege access:
In short, only granting permissions to the resources and actions an entity will need to perform. The idea here is that having credentials compromised doesn't allow an attacker to do anything they want.
Easy to administer in a fast changing environment.
Employee turn over, new projects coming online, etc
In a multiple account setup, have a single user account for each employee

Naïve Solution #1

One of the first solutions we used at Narrative Science, you just go through all the resources an entity needs access to and which actions they need and grant them. This is a "grant-only" solution, so never use a deny in these policies.

Seems pretty straight forward, why is this naïve?

Well AWS has some annoying limitations:

Users cannot have more than 10 policies attached.
Users cannot be a member of more than 10 groups.
User policy size cannot exceed 2,048 characters.
Role policy size cannot exceed 10,240 characters.
Group policy size cannot exceed 5,120 characters.

These might seem like a lot of wiggle room at first but you'll find very quickly that you start running out and have to be overly permissive to fit within these bounds, especially if your team has generalists that perform multiple job functions.

If a user is already maxed out on groups and policies, but needs permission to perform a certain function for a small period of time (privilege escalation), what do you do? Well I guess we just drop one they don't need at that time and revert it later. You'll need some way of tracking that escalation and reminder to deescalate when they're done, which will almost certainly be required for compliance anyway.

At best this is an annoying burden on the admin team managing the user permission granting. At worst this will just lead to that team getting fed up and granting way overly permissive permissions.

Naïve Solution #2

Lets try the opposite; being overly permissive and having explicit deny's on certain resources. On the surface this seems reasonable since its probably easier to state what you don't want to grant access to vs what you do.

Say for example you want engineers to have free access of a lower environment, but want to deny all of production. Well if you have good resource tagging, you can mostly deny by anything with a production tag or having prod somewhere in the ARN.

This also very quickly becomes a mess and doesn't save that many characters in your policy documents after its all said and done. Some resources require you to be able to have read / list access, otherwise the UI is unusable. One example is CloudFormation having to have read / list permissions on all resources but also you have to deny any update access to production stacks specifically. But do you then deny CreateStack access in CloudFormation too?

There's also a huge wrinkle here; you cannot allow users to create IAM policies because they can just create a role that grants higher access and allow themselves to be able to assume that role. But then how do you allow one to make IAM policies that are deployed with a CloudFormation stack?

A Better Solution

First of all, one of the biggest issues we had was that we were running both our lower environment and production environments in the same AWS account. We thought that it was good to be diligent with tagging and resource naming but that didn't really help us here for the reasons listed in the last section; even though its still really great to be diligent with resource naming and tagging for billing and clarity of communication.

AWS allows an Organization to have multiple accounts associated with it. IAM doesn't do well trying to isolate resources in the same AWS account, so first order of business; isolate your production resources in it's own AWS account, then you can focus on access control to that account specifically.

In our Organization, we setup a security account where all the IAM user accounts are, a non-production account where users were granted almost complete admin access, and a production account where access was extremely locked down.

Before we go further, we need to understand a 2 advanced AWS IAM features:

Permission Boundary's
You can basically grant a wider array of access, then with a boundary policy, restrict the access to only the resources in that boundary. Think of it like a Venn diagram of permissions:
Assume Role
An IAM user can inhabit the permissions of a Role, but only the permissions applied to that role itself.
Both the UI and CLI are able to do these role assumptions, and can be done in a way that requires MFA (big plus for compliance). This feature is what will enable us to have single IAM accounts per employee but allow them to have permissions in multiple AWS accounts.

So the idea here is; you create roles with open permissions that you then restrict down with a permission boundary, creating much more strict effective permissions; and then users can inhabit those roles.

You can define these roles any way you want, it will vary greatly depending on your organization and this can obviously be a bit subjective. We decided to define our roles roughly along discrete job functions which allows for an employee to inhabit the role needed to complete a certain task without having to grant full admin (unless that is what is needed).

For our production account, we created the following roles:

Admin - Complete god permissions only granted to a few, mostly just myself.
ProdAdmin - Almost the same as god permissions except for it cant change IAM roles, view billing or make account wide changes. Used for team leads / architects
OnCall - Basically a copy of ProdAdmin except that we wanted to keep this distinct incase these 2 roles diverged. Also its more obvious what it is.
ProdEngineer - Allows engineers to view the status of deployments, read log files, view alarm states.
CustomerSuccess - Allow our customer success team to manage our Serverless SFTP and handle issues relating to customer data delivery. They basically only had access to a few S3 buckets.

If a new job function should arise, then you can of course make new roles to suit those needs. We were working on adding a new one right before we were acquired by Salesforce that would allow some of our engineers to train and maintain production ML models; just as an example.

The advantage of this setup is that not only do you not run into annoying AWS limitations, but also you can grant or remove access to individual users as needed. This can get out of hand to manage manually, so we codified all of our IAM using Git and Terraform. This had the side effect of also allowing engineers to request a role escalation via Git that we could gate with CodeOwners, and have an audit trail for compliance.

Side note: CloudTrail does log when a user assumes a role, and will show that role's session id in any logs done while that user is in that role; so you can audit who did what and when.

Example Code

Enough of me describing the theory of this; here's some actual code. I'm not allowed to share the exact code (for obvious reasons), however here is an example to demonstrate how we used permission boundaries. Basically think of these 2 policy documents in a Venn diagram.

Role Permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "cloudformation:Describe*",
                "cloudformation:EstimateTemplateCost",
                "cloudformation:Get*",
                "cloudformation:List*",
                "cloudformation:ValidateTemplate",
                "cloudformation:Detect*"
            ],
            "Resource": "*"
        }
    ]
}

Role Boundary:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "cloudformation:*"
            ],
            "Resource": [
                "arn:aws:cloudformation:*:ACCOUNT_ID_REDACTED:stack/prod-app-*/*",
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "cloudformation:Describe*",
                "cloudformation:List*",
                "cloudformation:Get*",
                "cloudformation:GetTemplateSummary",
                "cloudformation:CreateChangeSet",
                "cloudformation:ExecuteChangeSet"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

Terraform Role Creation Module:

variable "app_admins" {
    type = list(string)
    default = [
        "user1",
        "user2",
        "user3",
        "user4",
        "user5"
    ]
}

resource "aws_iam_role" "app_admin" {
    name = "app_admin"
    path = "/humans/"
    max_session_duration = 3600

    assume_role_policy = data.aws_iam_policy_document.app_admin_trust_policy.json
    permissions_boundary  = aws_iam_policy.app_admin_boundry.arn
}

data "aws_iam_policy_document" "app_admin_trust_policy" {
    statement {
        principals {
            type = "AWS"
            identifiers = [
                for user in var.app_admins :
                "arn:aws:iam::ACCOUNT_ID_REDACTED:user/${user}"
            ]
        }

        actions = [
            "sts:AssumeRole",
            "sts:AssumeRoleWithWebIdentity"
        ]

        condition {
            test = "Bool"
            variable = "aws:MultiFactorAuthPresent"
            values = ["true"]
        }
    }
}

resource "aws_iam_policy" "app_admin" {
    name = "app_admin"
    policy = file("${path.module}/../policies/app_admin.json")
}

resource "aws_iam_policy" "app_admin_boundry" {
    name = "app_admin_boundry"
    policy = file("${path.module}/../permission_boundries/app_boundry.json")
}

resource "aws_iam_role_policy_attachment" "app_admin_attach" {
    role = aws_iam_role.app_admin.name
    policy_arn = aws_iam_policy.app_admin.arn
}

IAM Scopedown for Service Roles

A distinction that should be made is that so far we've focused this on IAM for Humans: permissions that apply to employee IAM accounts, but haven't really touched on IAM for Services: permissions granted to EC2 instances and the like.

We never ended up putting this into action at Narrative Science but I stand by the design and idea.

With services, it can be quite tedious to know all the permissions exhaustively that a service will need ahead of time. Even in earnest, most humans will miss a lot because AWS has just so many services and caveats. But we also need to reach least privilege access for services, is there a way to do this so it takes the burden off the developer and can allow you to achieve least privilege?

The solution here I think is quite elegant. I call it IAM Scopedown.

Basically, a new project that is not yet in production is allowed to run for a time, say 1 month (but the duration is somewhat arbitrary). Initially you grant a sane but knowingly over permissive set of permissions to the app. CloudTrail will log the actions that project did during that time. Then you can view those logs and see which permissions the app actually needed and edit your IAM policies based on that.

This is ripe for a script isnt it? Well good news, one exists for this very thing! Enter Cloudtracker.

To further on this idea; I would have automation run this script and send IAM scopedown suggestions to the team managing the project; ideally with the exact output needed for an IAM policy document. This is more of a "eventually least privilege" solution; which is far better than what usually happens where people give up and grant it way too much.

It is possible to have this tool automatically scope down service policies live in production, but I'm wary of not having a human involved in something that is likely to brick production in unexpected ways; though with time and developing on this idea further, I think that is feasible.

This scopedown procedure can probably also be applied to IAM for Humans but just because something goes unused doesn't necessarily mean it wont have need in the future. This is why I make the distinction between Services and Humans, since they tend to have big differences in their security needs.

Conclusion

Using a few advanced IAM features and a little bit of clever design, you can make your IAM setup a LOT less tedious to deal with, and reasonably achieve least privilege.

I have to give some credit to Netflix engineers on their "security pizza" talk. I built on a lot of ideas from this talk: