Jason Chan is Cloud Security Architect at Netflix. In this Security > 140 conversation, we discuss some of the innovations that Netflix has applied to its security in AWS and what other enterprises can learn from their pioneering experiences.
GP: Jason, your practitioner's perspective of AWS in particular and overall approach to security in general at Netflix is something I think the industry can benefit from. You use an evocative image to describe the shared security responsibility for AWS users, a rock climber way up high (you) and on the ground with the belay rope its Amazon. It shows in a nutshell where the risk lies. I am curious though how well defined is the interface between the AWS user and Amazon, does the carabiner and harness give you a simple way to engage with Amazon security services? What problems do you face here with integration?
Jason Chan: I do like that rock climbing analogy - I think anyone that's been paying attention to cloud security these past few years is familiar with the notion of shared responsibility for security, but it's critical that the true nature of that relationship be understood. Cloud service providers have a vested interest in meeting their customers' security requirements, but ultimately the customer holds the responsibility.
Regarding AWS' approach and interface for security - I think their Security and Compliance Center and product documentation are good places for customers and prospective customers to start. The tricky part about AWS security is really related to common adoption patterns. AWS use within an organization can spread quickly and organically, and before you know it non-trivial apps and architectures are deployed. However, it is hard to grok all the details of AWS security (including both recommendations and gotchas) until you have some experience under your belt - which is of course too late if you're deploying quickly and widely.
With regard to overall security integration with AWS, the biggest thing we're missing is access to a general purpose event notification system for AWS audit and logging events. For example - when are users added, when are firewall rules changed, when is an access control policy deleted. This missing link has driven much of our security approach, especially with regard to our Security Monkey framework, which relies heavily on ongoing monitoring and analysis to fill these gaps.
On the positive side, of course everything is API-driven, which makes integration with new and existing tools much more straightforward.
GP: That quick, organic spread is the bane of most Infosec groups, wherever you go it seems like security is always lagging behind development process and playing catch up. I guess the Cloud makes this even more challenging because it speeds up development so security needs to move faster too.
One approach can be to use patterns and frameworks so that when you get the old "...and we are going live in a month" in your first conversation with a team, well at least there are proven building blocks (patterns) to help move forward. It seems like AWS has a lot of these to build on for access keys, certificates, MFA, and some basic IAM functionality, but what about authorization? In my experience, this can be challenging because its domain or app specific. Have you seen cases where more dynamic and granularity attribute based access control rather than static ACL kind of mapping is required?
JC: First - to tackle the building blocks and patterns component - that's absolutely a key element of how we operate. We are a very loosely coupled organization, but we share platform components, lessons learned, tested patterns, etc. across the engineering organization. We use the Simian Army to validate that those patterns and techniques have been implemented. This allows us to move fast and ensure quality without potentially heavyweight and burdensome constructs like architectural review boards and change approval meetings.
Regarding AWS and IAM, I like the direction they are moving and the building blocks they are making available. They have a rich and comprehensible policy language, and the newly released "EC2 Roles" feature provides a means of delivering short-lived credentials for access to services from running instances (which is a core part of how our service operates). We had solved that problem separately prior to the AWS release, but we are looking to migrate to their offering to allow us to focus on other unsolved problems. I would like to see AWS standardize more on their access model, though, with regard to controlling service and resource access. Some services (e.g. SQS, S3) have a pretty full-featured access control model that applies to the service and API as well as the resources involved (e.g. I can control who lists a storage container, as well as separately protect the objects in the container). Other services like EC2 have a much coarser model. You can restrict access to APIs, but you can't restrict which resources those restrictions apply to. So, if I grant a user rights to terminate an EC2 instance, they can terminate any EC2 instance. This makes segregation a bit more of a challenge.
The way I look at it, the more AWS and other cloud vendors can provide usable, scalable, dynamic and fine-grained access control mechanisms, the faster their customers will be able to innovate.
GP: I assume that the Simian Army and Security Monkey follow the guidance I have heard out of Netflix' chaos monkey - "the best way to avoid failure is to fail constantly." Availability is an important part of security in most companies, are there any security specific aspects to the "fail constantly testing regime? What things have worked well that companies can utilize to improve their own vulnerability assessment processes?
JC: To give a little more info on the Simian Army - there a few approaches that they take. The approach that the Chaos Family (including Monkey and Gorilla) and Latency Monkey uses is to inject faults into the production environment to see whether deployed applications and systems can robustly handle and recover from these issues.
Some of the other members (namely Security, Conformity, and Janitor) take some different approaches. They are looking for adherence to known good practices and architectures for secure, distributed, cloud-based systems. One of the key cloud tenets we embrace is self-service - we want to make it as easy as possible for engineers to leverage the cloud and the platform we've provided. This freedom sometimes leads to suboptimal deployment, communication, and security implementations, so we use these Monkeys to seek out these anomalies (and in many cases, make automatic corrections).
In terms of what others can leverage, we are open sourcing many of the Monkeys (in fact, Chaos Monkey was just open sourced on 7/30), so others are free to reuse and extend what we've created. At a more abstract level, I think it's really about minimizing any friction you create with the inbuilt advantages that the cloud provides (e.g. broad accessibility, self-service, elasticity). The more you put in between cloud users and the benefits that the cloud provides, the less value you'll derive from cloud deployment. To do this safely, you have to refresh your thinking around monitoring, architectural pattern compliance, and security assessment.
GP: Can you share an illustrative end to end example of a vulnerability or class of vulns that Security monkey looks for, how you identify that its worked or not and how remediation fits in?
JC: A good example would be around how we use Security Monkey to evaluate AWS security groups (the Amazon implementation of a hypervisor-level firewall, used to control network access between instances). Security Monkey monitors our configuration of security groups, and will alert on simple add/change/delete events. These types of events are standard in our environment, so they are not necessarily a call to action.
Security Monkey also looks for potential misconfiguration of security group rules. For example, a security group being opened to the Internet, using an unsafe port/service (e.g. Telnet, FTP), providing access across AWS accounts, etc. These kinds of configurations can signify a serious vulnerability, so we also alert on them and aggregate them up to a sort of "Exposures" report that is easy to evaluate and use for additional investigation (if needed). We don't take automatic action on these, as they often require some human analysis to establish validity. We also have the ability to whitelist and document exceptions for valid cases that would otherwise trip these alarms.
On a somewhat related front, several of us at Netflix are working with some folks from the open source community on a new security assessment tool called Gauntlt. It's a framework intended to mesh behavior-driven development with security and security testing. I think it's got a lot of potential, especially in shops like ours where developers are more involved with deployment and operational support and continuous integration and deployment are used quite broadly.
GP: Do you have anything on the back end reporting that discerns between Security Monkey and an attack, so that you know - "the calls are coming from inside the house"?
JC: We do have a means of differentiating, and this actually brings up another good point about security and operating in the cloud. We can write to our production timeline system that tracks important events (e.g. code deployments, configuration changes, etc.). That way, our SRE team and other interested parties can easily tell if scanning is occurring, etc. Also, we plug into our other security tools (e.g. web application firewalls), to register/whitelist hosts.
The key here is that at scale (and with the nature of cloud elasticity and ephemerality), you need to do this via API and automation, not via someone manually updating an administrative UI. This is a point that I'm finding some security vendors have not picked up on yet. "Cloud Security" is a popular term, but in terms of tools, controls, etc. - if it does not have an API, it is difficult or impossible to use and integrate in our environment. This has become a real differentiator for security vendors, in my opinion.
GP: I see the same thing, many security tools have some good raw capabilities but they require so much manual effort, I tend to think of it as an integration gap - better/simpler APIs and Automation would help close the gap and help security teams.
When I looked at Gauntlt, it seems like a concrete implementation of the objectives of Security Monkey where your web app is constantly attacked by Metasploit, fuzzers, Nessus, and other tools. Its also interesting to see a DSL emerging here. Can you talk a little about the Gauntlt project and where you see it going?
JC: Sure, I don't want to speak too officially for Gauntlt; I am working on it but I defer to James Wickett for the most authoritative perspective on roadmap and progression. As I see it, it can become a great framework for packaging, executing, and normalizing security testing tools and resulting output. There are so many great security tools out there, but it's non-trivial (for experts or non-experts) to select a decent toolset, determine how they should be configured and run for your environment, and intelligently evaluate the output of the chosen toolset. I think Gauntlt can help facilitate this process, and hence provide an easier mechanism for integrating security testing into development lifecycles of all shapes and sizes.
Interested in Mobile AppSec, Mobile AppSec Triathlon training - November 5-7,2012 in San Jose. Hands on training with Gunnar Peterson and Ken van Wyk for Mobile, iOS and Android security for developers, security teams and architects.