SRE Thinking and Practice-睿象云平台

SRE Thinking and Practice

本站部分文章、图片属于网络上可搜索到的公开信息，均用于学习和交流用途，不能代表睿象云的观点、立场或意见。我们接受网民的监督，如发现任何违法内容或侵犯了您的权益，请第一时间联系小编邮箱jiasou666@gmail.com 处理。

SRE Thinking and Practice

Written by Ruilong Deng, An Nie

Key Points Overview

1. Operations Idea: Specify the operations objects, start from the service life cycle, divide the stages, extract the key points, build the operations scenarios, and precipitate the operations platforms.

2. Reliability System: It advocates solving problems from the "Trinity" system, and fault prevention is far more effective than fault emergency. By doing standardized construction, container deployment, service governance, change management, active-active architecture, anti-architecture fission, etc., the MTBF time is extended to meet the demand of high reliability. If the prevention doesn’t work, then embrace the fault handling methodology, and use emergency plan, monitoring and observability tools, and mechanism to maximally shorten the MTTR time, etc.

Technology Overview

Team Model

Operations Practice

Google SRE divides the service life cycle into five stages. It believes that the service is born at the moment when the idea is determined, then the architecture design is beginning, then the development is active, then the gray scale / full scale release is carried out, and finally the service is abandoned, ending its life. From the perspective of operations, standardization construction is kick-off in the architecture design stage; In the development stage, service access process is open; In the gray/full scale release stage, continuous delivery and continuous operations keep going; Finally, the service goes offline. How do we get involved in operations in several stages of the service life cycle?

Standardization Construction

The key ideas are as follows:

2. Architecture Re-Modeling: Practice the idea of cloud native architecture, focus on business services, focus on value and agile delivery, smooth the differences of IaaS resources through container technology, and smooth the differences of multi clouds environment through service governance.

Key practices are as follows:

1. To Operations:

Services Unification, mainly for the removal of mixed deployment and domain name splitting. In the virtual machine era, in order to improve resource utilization, a large number of services are mixed together, which brings disastrous troubles to SRE. For example, in the mixed deployment state, ODP services interfere with each other, so it is impossible to accurately control deployment and control permissions according to the service dimension, and it is impossible to conduct accurate capacity evaluation and so on. Domain name is split from the WWW public one, so that each business has an independent domain name, facilitating north-south traffic scheduling and container migration.

Network Unification, mainly refers to the unification of fiber lines access points to IDC POP points, the unification of physical fiber lines to transmission lines, the normalization of network detection protocols to BFD protocols, and the normalization of QoS control to CPE.

2. To Architecture:

Service Discovery Unification, mainly normalizes East-West service discovery to SVC and North-South service discovery to DNS. In the early years, BNS was used for service discovery. Later, ZnS was evolved to support multi clouds deployment. Both of them belong to the central architecture. Later, they evolved to the current SVC mode, changing the previous style from the central architecture mode to the autonomous service discovery mode within the cluster, so as to avoid mutual influence between the clouds.

Framework Unification, mainly converges the PHP framework to ODP, and the Go framework is converged to the ZGIN framework.

Service Access

In the service access stage, the main idea is to let the business users do multiple-choice questions instead of filling in the blanks. At the same time, we will instill the concept of standardization into the users. Only by following our rules can they enjoy the full operations service when opening box.

Continuous Delivery

The concept of continuous delivery is slightly generalized. It covers all change delivery scenarios, including business changes and operations changes. Main ideas are as follows:

1. Embrace Methodology. For the direction of the change, the methodology in the industry is relatively mature. It is OK to copy them strictly according to the five military rules of the change.

2. Reshape the Change Plane. Introduce the main responsibility idea, clear the work boundary, change the co-operation mode, move the changes with business logic down to the business, and move the changes of operations attributes up to SRE.

3. Delivery Platform. Operations change is also a (global) variable that causes failure. It is better not to rely on the professionalism of people in the change method, but to apply military regulations / SOP into platforms, so that SRE can easily click and mutual backup at low cost.

Key practices are mainly reflected in R&D changes and SRE changes.

Continuous Operations

Reliability construction can be broadly divided into three parts: business quality, architecture unification and operations control. Of course, behind each dimension, they also have many corresponding services.

Generally, if a business wants to iterate faster, it actually needs a good service governance by the architecture; At the same time, if the business wants to iterate more stably, it actually needs elegant management and control of operations; Operations should also focus on the architecture measurement through technical means to prevent reliability problems caused by architecture fission; The architecture also needs more consideration, how to design and evolve will not lead operations to out of control. Additionally, whether it is business, operations, or architecture, if you want to extremely improve the reliability, three of them need standardization construction at the same time. Of course, these are not enough. Thinking from the bottom logic, let the business focus on value. We also need to change the new cooperation mode. From the perspective of operations, things with business logic are moved down to the business, such as configuration publishing, schedules task publishing, and business logic mixed by the access layer. From the business perspective, move the non-business logic things to the architecture, such as service discovery, unified authentication, traffic scheduling, etc. From the perspective of architecture, moving the ToC(operations) control plane up to the operations, such as capacity management and traffic scheduling, can't let the business directly penetrate the operations control and directly hit our containers. Finally, in addition to the three, we have also have a powerful organization, which can ensure the above-mentioned things running well.

This is the “Trinity” reliability system of ZUOYEBANG, as shown in the figure below.

After running in the above system, we naturally have such operations ideas.

1. Strengthen the Sense of Responsibility, clarify the responsibility of the service provider, return the business problems to the business team, return the operations problems to the operations team, and return the architecture problem to the architecture team.

2. Strengthen the Concept of the Platform, avoid repeated toils, and pay attention to the precipitation of sustainable operations ability. We believe that the platform is the most understandable and inheritable way.

3. Strengthen the Practice of Data Driving, establish multi-dimensional measurement platform, make the data analysis capability open, specify the responsibility for different problems, and make good use of TC organization power to drive the continuous optimization and evolution of the whole system.

Continuous Operations – Active-Active Architecture

2. Build Observability for Cross Clouds, go deep into the business, find out the business code, configuration characteristics and service dependency behavior, and establish cross cloud observability capability and perceive the source & target details of cross cloud traffic using the CPE tuple data from fiber lines and the eBPF data from the kernel ConnTracks.

3. Disconnect Network Between Clouds, bravely moving forward into the industry no man's land, and forcefully landing the network disconnection between the clouds, strive to solve the problems at the root, prevent the new non-standard problems or the fission of active-active architectures, and establish an active-active cloud measurement perspective to constrain the architecture behavior and guide the business update.

4. Approval from an Independent Perspective, introducing the chaos engineering idea, safely breaking the specification of its explosion radius, conducting the global single cloud fault injection drill from the online, and introducing the QA perspective, so that they can objectively accept the active-active capability from the user's perspective.

Continuous Operations – Emergency Plan Platform

As mentioned above, the standardization construction, service access, continuous delivery and active-active architecture mainly focus on fault prevention and strive to extend MTBF time to achieve high reliability. If the prevention doesn’t work, we will try our best to shorten the MTTR time as more as possible.

2. Emergency Plan Boundary: the emergency plan should not have business logic. If there is, it should be moved down to the business layer for implementation.

3. Emergency Plan Reliability: the emergency plan belongs to the strong control plane, and its reliability needs to be higher than the reliability of the business applications. The data storage needs to be deployed in a modular multi master manner, and circular dependency is not allowed, especially the strong dependency on the portal account system.

Continuous Operations – Traffic Scheduling

The SRE team focuses on the overall traffic scheduling in the north-south direction, the architecture team focuses on the long tail traffic scheduling in the east-west direction, and the business team focuses on the characteristic traffic scheduling, such as sharding traffic to different clouds by lesson. The scheduling object is the domain name. The scheduling starts from the public network entrance. Once the traffic enters the cloud, the single cloud is closed-loop, except for the underlying data storage, cross cloud SaaS traffic requests are not allowed. There are two scheduling methods, one is precise scheduling by DOH according to weight, and the other is non precise shunting scheduling by DNS according to resolution line.

Continuous Operations – Scope & Analysis

1. Build the Foundation. According to the three observable gold indicators, at the beginning of standardization, define the log specifications, and divide them into 8 types of logs for 4 scenarios, which were solidified into the application framework. Indicator collection mainly includes basic indicators, service indicators and business indicators. The tracing registration mainly generates or transparently transmits the trace ID at the portal, and uniformly registers the trace related information in the mesh sidecar, so that the service can be accessed at a low cost and can be used when opening box.

3. Create Scenarios. First establish a service perspective and let the business focus on service value. Second, establish an Ops view for fault scope and troubleshooting perspective, which enables fault scope at the top and troubleshooting at the bottom. At the same time, thousands of service monitoring or alarms are easy to lose focus if the eyebrows and whiskers are grasped. At this time, we should treat them differently and use the service level metadata to achieve convergence monitoring and alarm focus. Finally, the traffic light modeling and scoring are designed to build a global SRE perspective view.

Continuous Operations – Capacity Monitoring

Continuous Operations – Others

At the alarming view, build a global general service alarming system, strive to play the role of the first jump of the fault, and make a quick query of the unrecovered and recent alarms to facilitate the support of the global decision-making. Permission operations mainly senses the changes of key permission points to prevent cross-border authorization and changes due to out-of-control Ops. Change control is mainly based on the five military regulations.

Environment Changes

Why can such a system be built in a short time (about 2+ years)? There are four major environmental changes, which are interrelated and drive each other.

1. Field Changes: The current cloud native technology is highly mature and the industry practice is in full swing.

2. Internal changes: The technical team is pragmatic, and the architecture is remodeled according to the physiological concept of Cloud Native, which has been tried and tested.

3. Industry Changes: The Internet dividend is gradually fading. Because of this, businesses team can have more time to calm down and cultivate their internal skills.

4. Organization changes: A strong boss leads the team, from the top to the bottom, from the outside to the inside, praising technology belief and practicing the culture of responsibility.

It is precisely because of these four changes that SRE has also brought unprecedented impact, forcing us to think deeply and explore transformation.

Transformation Thinking

Role Thinking

It can be seen from the team model that each operations role has its own main services. Only SRE is not the provider of business services, but depends on the business, infinitely extending and amplifying the value of the business. In the long run, it is easy to evolve into a horizontal support team, which greatly weakens the professionalism of SRE. Therefore, service-oriented operations are imminent.

Returning to the original intention, let's take a look at the definition of SRE. It is the acronym of site reliability engineering. The last word emphasizes engineering rather than engineers. The key point is to define and reconstruct the operations with engineering ideas and methods, so as to make the operations more automatic and more service-oriented, instead of hunting more engineers to dig holes repeatedly.

The traditional operations tend to do thing passively. It is better if there is no problem. The style is carefully and rigorous, and focus on the process specification and SOP. The operations in the cloud native era puts more emphasis on building business data, measuring business risks, giving suggestions on standardized solutions, and taking the initiative. It is good at solidifying process specifications and SOPs into the architecture or platform.

Finally, we should clearly understand the responsibilities of SRE role, that it is, to make the business iterate efficiently and make the business run reliably.

Ability Exploration

After having a new understanding of the role, let's look at what kind of ability model SRE needs in the cloud native era. I think there are four main points:

2. Development Ability: SRE is also a technical type of work, and obviously needs development ability. Relying solely on DEV / architecture is unreliable, and the distant water can't put out the near fire. Without development ability, we can't do things well in detail. If everyone does development, how to make difference with DEV? The boundary is that the DEV build the middle platform, and SRE is built on the middle platform.

3. Operations Ability: It is required to consciously use the existing tool system or create new tools to conduct all-round technical operations and make good use of data.

4. Embrace Open Source: In the cloud native era, open source may do wells after the above three abilities are ready. At this time, we need to open our eyes, sincerely embrace open source, integrate open-source products, and apply them to operations practice at low cost.

Future Outlook

There are two key points. The first is ability upgrading, and the second is to make operations service-oriented. In order to better adapt to the cloud native era, we always adhere to and deepen our technical belief, upgrade our team's technical abilities, and achieve business through technical means. The second is to deepen the concept of operations servicization. The value of operations will eventually be reflected in the business. However, the best way to reflect this value is to make operations service-oriented, and maximize empower the business. Ability upgrading is to make operations servicization better. The operations servicization will open a window and give SRE more room for growth, so that they can upgrade and iterate, just like the concept of Chinese Taiji, so that the team can keep growing.

告警通知变得轻松便捷——微信告警接口指南

577 2023-03-26

SRE Thinking and Practice

实时警报通知：微信告警通知的重要性解析

告警通知变得轻松便捷——微信告警接口指南

睿象云智能告警平台的分派策略