Insights
How Google Is Changing the way we Approach SRE
Highlights:
Software developers find themselves chasing bugs and putting out production fires a bit too often with new codes and updates coming up all the time. Any web application that enjoys decent traffic will often end up with challenges pertaining to overseeing deployments, monitoring performance and reviewing error logs.
While the development teams want to get things moving really fast, the operation teams are always cautious fearing things might blow up in production. This is where site reliability engineering or SRE comes into play.
SRE empowers software developers to own up the ongoing daily operation of the application in the production phase. In that sense, it eliminates considerable load pertaining to application monitoring off the shoulders of operations teams.
Says Niall Murphy, “SRE is what happens when you ask a software engineer to design an operations function.”
Endowed with a deep understanding of the application, the code and how it’s configured, site reliability engineers know exactly how it runs and scales.
SRE at Google
At Google, SRE is an integral aspect of engineering and perceived as something that happens when a software engineer is asked to solve an operational problem. As such, it considers SRE as a mindset; a set of metrics, practices, and means to ensure systems reliability.
Often times, there is no clarity when it comes to pinpointing exactly what successful SRE implementation is. Google has it all- from workbooks and tips to non-exhaustive checklists that can be used as per the needs and priorities of team members.
SRE is not an exact science, which means challenges will vary and continue to crop up along the way. In that sense, SRE is an ongoing journey perfected with experience and sincere efforts.
Google aims to keep critical systems up and running despite natural calamities, bandwidth outages, and configuration errors. Google has its own platforms to manage, maintain and monitor them, and also repair, extend or scale code to keep them working.
For the same reason, Google’s SRE teams comprise people from both systems and software backgrounds. This informed mix has been helping Google address mammoth tasks such as developing large systems ranging from planet-spanning databases to near real-time scalable data warehousing.
Managing a range of systems and catering to a user population measured in billions, Google drives reliability and performance by mastering the full depth of the stack.
Automating jobs is key to SRE
Google has always been working diligently on determining the amount of time a team member is allowed to spend on toil.
While some take this limit as a cap, Google encourages its customers to look at it as a guarantee and a means to curating an engineering-based approach to problems instead of toiling at them aimlessly and laboriously.
In a typical Google environment, you enjoy reduced mean time to repair (MTTR) and greater agility for developers since early detection of problems means lesser time and challenges in fixing them. Late problem recovery is not so much of a problem anymore with Google enabled SRE.
SRE the Google way
Google’s SRE team is a mix of academic and intellectual backgrounds. While doing work that has been historically done by the operations team, the SREs have software expertise with a predisposition and ability to design and implement automation to replace human labor.
While doing so, they are focused on their core- engineering. Without engineering, it is impossible to keep pace with the growing workload. A conventional ops-focused group then begins to scale linearly in tandem with service size.
Google places a 50% cap on the average ‘ops’ work including on-call, tickets, manual tasks, etc., for all SREs to ensure efficient management of workload and also that the SRE team has enough time on hand to make the service stable and operable.
The SRE team is expected to have very little work on the operational front and should engage actively in development tasks. The idea is a move towards an ‘automatic’ not just an automated environment where systems will run and repair themselves.
Google expects SRE teams to utilize the remaining 50% of the time on development. For this, the way SRE time is spent is closely monitored. This could require shifting some of the work back to the development team or adding more staff without assigning the team additional operational responsibilities in a way that there is a balance between development and ops tasks and the SREs have greater bandwidth to engage in autonomous engineering.
This approach has many advantages. These include:
- Bridging the gap between ops and development teams
- Constant monitoring and analysis of application performance
- Effective planning and maintenance of operational runbooks
- Meaningful contribution towards overall product roadmap
- Manage on-call and emergency support
- Ensure good logging and diagnostics for software
Our approach to SRE
While Google continues to offer unmatched capabilities with SRE, we assume the responsibility of offering viable, customizable SRE to our customers keeping the signature benefits intact. We offer the best in SRE which is backed by our NexGen platform.
The SRE team at Parkar is responsible for latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
We ensure a durable focus on engineering, enabling to move fast without breaking any SLO.
At Parkar the SRE team has two goals.
- A short term goal to fulfill the product’s business needs by providing an operationally stable system that is available and scales with demand, with an eye on maintainability, and
- A long term goal to optimize service operations to a level where ongoing human work is no longer needed, so the SRE team, can move on to work on the next high-value engagement.
Proactive planning and coordinated execution ensure that the SRE team meets expectations and product goals while optimizing operations and reducing operational costs.
The planning is done at two connected levels,
- With developer leadership, priorities are set for products and services and yearly roadmaps are published.
- The roadmaps are reviewed and updated on a regular basis and quarterly or otherwise goals are derived that line up with the roadmap.
Some of our key SRE aspects include:
- Reliability – Maintaining a high level of network and application availability
- Monitoring—Implementing performance metrics and establishing benchmarks for better monitoring
- Alerting—Promptly identifying issues and ensuring that there is a closed loop support process in place to solve them
- Infrastructure—Understanding cloud and physical infrastructure scalability and limitations
- Application Engineering—Understanding application requirements including testing and readiness needs
- Debugging—Taking into account specifics pertaining to systems, log files, code, use case and troubleshooting to debug as required
- Security—Understanding common security issues, as well as tracking and addressing vulnerabilities, to ensure systems are properly secured
- Documentation – Prescribing solutions, production support playbooks, etc. keeping in line with best practices
- Best Practice Training – Promoting and evangelizing SRE best practices through production readiness reviews, blameless post-mortems, technical talks, and tooling
Parkar SRE team enabled a leading retail organization in the US to achieve efficiency in monitoring and alerts enabling the organization to attain a very high site availability and vastly improved performance with a reduction in manual efforts for managing the overall site.
The early wins are
- Achieved 90% fast identification/removal rate of Production Issues.
- Achieved 99.99% High Reliability and availability.
- Achieved 85% improved and efficient Monitoring and alerts.
SRE onboarding
While there are a few basic things to consider, SRE onboarding rules are not written in stone. They vary from one organization to another. Organizations need to understand how they can benefit from embracing SRE. Identifying implementation and operational deficiencies can go a long way in the effective adoption of SRE. Once the decision to embrace SRE is made, it becomes necessary to identify bug fixes, process changes and determine the required service behavior before onboarding the service.
Let us talk to assess your environment and discover a whole new world of possibilities.
About Parkar Digital
Parkar Digital, a Gold Certified Microsoft Azure partner, provides technology solutions for Digital Healthcare, Digital Retail & CPG. Our solutions are powered by the Parkar platforms built using Cloud, Opensource, and Customer experience technologies. Our goal is to empower a customer-first approach with digital technologies to deliver human-centric solutions for the clients.
THE AUTHOR
Amit Gandhi
As the Co-Founder and CTO for Parkar Digital, Amit leads the Technology and Engineering teams and is responsible for designing and implementing innovative technology solutions for clients across various industries.
Get in touch with us
Parkar Digital is a digital transformation and software engineering company headquartered in Atlanta, USA, and has engineering teams across India, Singapore, Dubai, and Latin America.