SRE: An incomplete guide to cultural Narnia
SRE is a topic that most recently has become a popular discussion across many companies I have interacted with. What is SRE? Who are SRE's? How do we get it? I, just like everyone else have opinions around this topic. However there is a common ground between us all. SRE is not only about tooling and tech, but it is primarily a CULTURAL shift within companies. Now, just as a disclaimer these are just my opinions and experiences with building organizations like this and talking to other companies who have implemented or are implementing this. There is no prescription to building SRE. Everyone will find a way that this fits their company and build this according to their current operating model. Forcing this in because it's the new fad is not the best approach but that's up to you...
During this post I am going to use a few different terms that I would rather cover here so you don't have to look them up to see what we are talking about. These will be short/sweet definitions and we will dive into them more in-depth later on.
SRE - Site Reliability Engineer, an individual who possesses an interest in infrastructure and operations, with a software engineering background.
SWE - Software Engineer.
Infrastructure SRE - A variation of SRE that works on infrastructure related projects (monitoring, provisioning, IaaS, etc.)
Application/Service SRE - A variation of SRE that is dedicated to specific Application/Service group. These are the groups we are building the products that generate revenue for your company(hopefully).
Now lets get to it.....
What/Who is SRE?
The definition of SRE is quite scattered and I don't think anyone truly understands/knows what it means and no Unicorn's will not show up below. This is what SRE means to me:
SRE is a core group of individuals who have a wide array of skill sets. These skill sets range from operations, networking, development, hardware, distributed systems, monitoring, stability, capacity planning, etc.
SRE's are responsible for building out and scaling all aspects of infrastructure for any team looking for support.
SRE is not just about the technology. SRE is a mindset, thought process, and cultural shift. These are traits that seasoned individuals will bring to the organization to help drive it forward.
SRE shouldn't be made mandatory for everyone in your company. Teams will have the choice to fund SRE support if they need/want it.
Areas of Responsibility
Technology has so many different aspects of it that it may seem hard to narrow down what SRE's should actually be spending their time and energy on. Since these individuals will not be directly building public facing products or features(depending on your company of course) what exactly will they be covering and helping with? This by no means is limited to this list but should be the general areas in which you start to focus your SRE organization. Without these underlying infrastructure plumbing being properly built, monitored, scaled, etc. it will be difficult to preach this new cultural mindset to the outside team members. When you look below these items may seem very infrastructure heavy. They are and aren't. Many of these can be split between Inf & App SRE. One is the primary producer while the other is the primary consumer.
- Configuration Management & Automation
- Big Data / Data Warehousing
- Core Infrastructure Services, Tooling and Provisioning Capacity Planning & System
- Performance Documentation and Runbooks
- Incident Response
Is SRE single team or multiple teams?
This is an area where lots of people have questions, concerns, and comments. I highly recommend that this is split logically into multiple teams(infrastructure/monitoring/tooling/apps/services/etc) and SRE types. Even though they are split in some sense the reporting chain must remain centralized throughout the organization. Breaking this centralized hierarchy in my opinion will break down and result in an unsuccessful SRE implementation. The reason I think this needs to be centralized is SRE's should be reporting to like minded individuals. These individuals(individuals/Leads) will assist in building the tools they work with, which is crucial to the success of the team. They will also provide information around common tool sets, best practices, and architectural guidance. One thing I think is worth considering is your initial Application/Service SRE's could be members of your Application/Service team with operations/infrastructure interest. You can use them initially since they are closest to the code already and eventually either convert them or let them back into the team as a SWE.
There will be different levels of support you offer to your Application SWE's and Infrastructure. The Application SRE's will spend 70% of their time to the Application/Service team while the other 30% will be spent assisting the other areas of SRE(mostly infrastructure related items). The Infrastructure SRE's will have primary focus on infrastructure related products/services/etc and have less interaction directly with the application team members developing the features. The two don't necessarily work separately but they compliment each other heavily.
You also want to make sure that your SRE's do not just become a dumping ground for work that the application teams do not want to do. This is something I have seen in the past and it quickly breaks the relationship of the embedded team and the application/service team. You should be offloading some of the operational related work to the application/service team(5-10%) so they are still aware of their surroundings and understand what is happening under the hood. This enables individuals to move teams if the project can run autonomously and no longer needs full time SRE support.
Application/Service SRE vs Infrastructure SRE
- Embedded for Application/Service Teams
- Support within the Application/Service Team
- Automation for apps and services for the team
- Monitoring / Metrics for Application/Service Team
- Benchmarking and performance assistance for code
- Infrastructure deployment and architecture for applications and services
- Write documentation & runbooks for alerts issued by application/service stack
- Build & Management of core Tech Infrastructure(provisioning,OS,dns,dhcp,networking,etc)
- Automation of infrastructure services(telemetry, log aggregation,chef, etc)
- Build and Document tools/services for outside teams to consume
- Support for Application SRE teams
- Building Infrastructure as Code
- Architectural guidelines and assistance for Application SRE teams
- Institute best practices and documentation
Organizational structure is something that should be addressed early on when building SRE. Like i said previously this organization must be centralized to be successful. I have created a very high level chart that shows the logical line of separation between App SRE and Inf SRE. The leads can handle multiple App / Infrastructure teams depending on the type of project. You can scale the leads / managers / directors / etc as you see fit within your organization, but this is how I like to chop things up.
Example SRE Interaction
Below is an example interaction between your typical ruby application team and an SRE team. You will notice I added an additional level of support at the bottom. This is another piece that isn't necessary but is 100% nice to have so nobody gets burnt out. The duties of frontline support are to follow runbooks and handle the initial brunt of the alerts coming in from your monitoring system. This staff will get the first wave of pages if an incident occurs. There are many companies that you can outsource to for this or you can hire people in-house to handle this job. It is a great starting ground for junior level team members just getting into operations and want to gain as much knowledge of the stack as possible. Some past companies we used to throw the new SRE's on-call after the first week or so and buddy them up with a more seasoned member. This allowed them to learn by fire and I have seen it to be quite useful when on-boarding a new team member.
There is much more information that I am leaving out in this blog post. I just wanted to touch on some very high level opinions that I have and give everyone the initial insight into what they should be thinking about when wanting to join the SRE journey. There are a bunch of examples out there of companies that have implemented this culture within their organization. Just remember that there is no formula and just because it worked there doesn't mean it will work for you. You have to really push the understanding of the cultural shift within your organization and make sure that everyone is ready for this. As long as everyone works together and collaborates the possibilities are endless. I am more than happy also to come share my ideas and thoughts with anyone and all my contact information is on the main page. Thanks for reading!