Introduction:OnJanuary15th,the first OCP China Technology Seminar was held in Shenzhen. This conference was jointly organized by Tencent Cloud and OCP International Community. At the conference site, Tencent expert engineer Yang Xiaoying delivered a speech entitled “Tencent Cloud DCOS Technology Sharing” at the OCP technical seminar. The following is the full text of the speech. Yang Xiaoying, a master student at Sun Yat-sen University, Tencent server management and control architect. Mainly responsible for Tencent server automation operation platform, private cloud infrastructure supervision and control program

This sharing outline.

1. DCOS Concept & Advantages.

2. Introduction to DCOS Solutions.

3. Demy of DCOS modules.

4. DCOS Project Application & Open Plan.

Hello everyone, I am very happy to have the opportunity to discuss and learn with you. The theme I shared today is the DCOS project

DCOS Concept & Advantages.

1.DCOS concept.

DCOS, the full name of Data Center Operating System, is dedicated to building a management engine for private cloud infrastructure, providing services such as server and network device supervision, configuration management, and alarm management

2.DCOS Advantages.

DCOS is relatively comprehensive, and it actually benefits from Tencent’s many years of operational experience in infrastructure. Tencent has millions of servers, tens of thousands of network devices, a large and complex business ecosystem, and accumulated a lot of valuable operational experience. In addition, we also consider the uncertainty of the private cloud environment, such as the devices used by users. Types, actual business needs of users, etc., are all unpredictable, so DCOS combines Tencent’s excellent operational experience with a focus on improving its customization capabilities

In terms of architecture, DCOS adopts modular and layered design, which divides modules according to functions, and users can choose to install according to their own needs. The layered design supports centralized and distributed deployment: centralized deployment is simple, one machine can achieve full network control; distributed is relatively flexible, and can adapt to complex network environments. In addition, DCOS provides many open APIs for users to conduct secondary development and build their own operating systems

DCOS Solutions.

1.DCOS’ role in the private cloud.

What role does DCOS play in private cloud management? It mainly provides four categories of services: CMDB (Configuration Management), BME (Physical Bare Metal Management), OneMonitor (Monitoring), OneAlert (Alarm). It fills the gap in cloud solutions such as OpenStack’s supervisory control of servers and network devices to some extent. The cloud other OSS systems and user-owned systems interact with them through the DCOS API to build the entire management platform

2, DCOS function list.

Next, let’s take a look at what specific features DCOS provides in these four categories of services

The leftmost is the configuration management system CMDB, which is responsible for managing the physical information of the infrastructure. This is the first step in our infrastructure life management. Users import data into the CMDB and become the source of other module data

We need to install the server before the business goes online, so we developed the second module, the out-of-band deployment module, which provides the server out-of-band operation (such as power-on operation of the switch), OS installation (including PXE installation and fast reloading), as well as out-of-band password management

After using this module to deploy the server, we may need to change the OS or release the business system during the operation, so we have launched a third module, the server management module, which supports remote control of the server. Such as file transfer, script execution, etc.

In addition, during the operation, we may also pay attention to the operation of the server, there is no fault, etc., so we have the monitoring module of the server. This module collects OS basic data, including OS status, performance, etc., and also monitors the processes and ports of the business application. At the same time, we also added hardware monitoring, which can help users understand the machine operation more fully

In addition to focusing on server failures, we will also pay attention to the status of network devices. The fifth module is about the collection and monitoring of network devices. Network device monitoring includes SNMP traffic collection, log collection, session traffic, and network quality detection

The last one is the alarm module, which is responsible for the configuration of alarm policies and alarm management, such as alarm judgment, deduplication, and shielding

DCOS modules are revealed.

Next, let’s take a look at each module exactly what it looks like

1.CMDB (Configuration Management).

Just mentioned that CMDB stores physical information for all infrastructure. It is derived from Tencent’s years of experience in IDC operations, abstracting multiple management objects, including IDC lines/exports, network equipment, servers, IDC rack locations, and IP resources. We are able to manage the basic information of these physical objects and their associations. At the same time, we also provide component data such as server hard disk, as well as port information of network devices, through which we can draw physical topology. This is the first step in the entire facility lifecycle management

2. Server Management.

The second is the management of the server. It also draws on the deployment experience of hundreds of models within Tencent and the management experience of massive servers. We implemented automatic discovery of server resources, out-of-band management, OS deployment, and remote control

After the machine is powered on, we assign it an out-of-band IP through the DHCP service, thereby realizing the automatic discovery of resources and then taking control of its out-of-band. After that, we can install the OS through PXE. In addition, our deployment module also supports fast reloading. Of course, because we can’t predict the shape of the user’s business, we have opened up a lot of custom capabilities, such as custom OS installation, custom RAID combination, custom partition, customized after deployment. Operation, etc. The far right is the remote control module, which provides stable and efficient file transfer and script execution channels. Users can develop their own work platform based on this module

3. Server Monitoring.

Next, let’s see how to monitor the server, what functions do our monitoring modules have? It includes the collection and monitoring of software and hardware, as well as the monitoring of third-party components. Of course, it also provides channels for users to report their own collected monitoring data

In the OS basic monitoring, we collected CPU utilization, memory usage, disk IO, and network card status. In the hardware, we collect configuration information such as power supplies, fans, hard disks, RAID cards, etc., and generate alarms such as missing memory, missing fans, power failures, and hard disk failures. In addition, for business applications, we provide process port monitoring, and also support users to import DataDog open source scripts for third-party component monitoring. Finally, since the user may want to monitor the status of the system, it is necessary to collect data on the local machine, and hope that there are ways to help store the data. Therefore, we also provide a custom reporting channel, and the user reports the self-collected business data. DCOS storage and forwarding

4. Network Monitoring.

Next, let’s take a look at network monitoring. In fact, for network monitoring, we split four sub-modules. The first one is the SNMP module. It is mainly responsible for collecting network device port information, including port configuration, port inbound and outbound traffic, and the overall operating status of the device. Abnormalities such as interruptions and loss of association. Because we can’t cover all the types of network devices on the market, we have designed a custom collection mechanism: the user writes the collection template that meets its requirements according to the established syntax and default template, and then binds the device to the template and imports it. System, we can automatically collect and monitor the device

The second is the log collection of the network device. We can collect the logs of the device, perform data parsing and legality check, and then perform keyword matching to determine whether an alarm is generated. We support user-defined matching rules, that is, which alarms users want to see and which levels they are interested in, they can write their own rules and import them into the system

The third is network quality detection. We can find out if the network is pinged and there is packet loss delay. Users can deploy the DCOS probe client as needed, define the probe task in the background, and the system will automatically help him detect the network situation he is concerned about

Finally, session traffic monitoring, we support the data collection and parsing of the netflow/sflow/netstream protocol, extract source IP, destination IP, source port, destination port, access direction, protocol, etc. from the session data, according to the rules. Do summary processing and store. Based on this information, users can analyze the traffic usage of the business, and then perform business adjustment and cost optimization

5. Alarm Management.

As mentioned above, the monitoring of servers and network devices is mentioned. When an abnormality occurs in the device, we can find an exception and alert it out. However, these alarm users may not care much or need special handling. For example, if the user feels that an alarm has occurred many times before he or she knows it, or the repeated alarm needs to be filtered. So we introduced the DCOS alarm module. Our alarm module provides configuration management of alarm policies and judgment, deduplication, shielding, and notification of alarms

Its data comes from the server and network device collection module, of course, it also supports the user to report the police. We can call the API for alarm policy configuration and alarm query. The alarm policy can determine how the alarm will be processed, such as how many times it takes to be notified, whether the alarm is blocked, or whether the condition is met before being forwarded to the message center. According to the alarm policy and the received data, we can enter the alarm judgment, the alarm de-duplication, the alarm mask, and then the alarm notification. Finally, we will also determine if the alarm has been recovered and notify the user after recovery. This is the entire set of alarm management mechanisms, which implements the customization requirements of the user just mentioned

DCOS Project Application & Open Plan.

So far, we have introduced most of the functions and modules of DCOS. Of course, we will continue to improve the existing functional modules, and will introduce more regulatory control services, such as fault prediction, to enrich our platform capabilities

Next, let’s take a look at DCOS’s current landing scene and open plan

1. Project application.

DCOS has been with more than 15 medium and large enterprises along with Tencent Financial Cloud and proprietary cloud. The number of these enterprises ranges from several hundred to tens of thousands, and the types of enterprises are various, such as banks, supermarkets, exchanges, etc. Clients include CCB Head Office, HKEx, Yonghui Supermarket, and Weizhong Bank

2. Patent & Open Source.

Patent, DCOS has many domestic and foreign patents; and in terms of open source, our configuration management module (CMDB) has been open sourced within Tencent, and other modules are also being implemented. Of course, we are also actively promoting external open source

3.Open plan.

Finally, we plan to contribute the use of DCOS software to the OCP open source project. Initially, we will open the CMDB module, server-related modules (including out-of-band deployment and remote control), and an alarm module. Other modules will be open based on their maturity. We hope that through these measures, we will contribute to the OCP ecosystem and the entire cloud solution!

The above is my share today, thank you all!