Brief Introduction of Data Center Technologies

Shun Huang

5.00/5 (2 votes)

Apr 12, 2017

CPOL

23 min read

22581

I have been working on data center industry years. I felt the entry barriers of data center technologies is high. At the beginning of my career, it is a challenge to me to understand a lot of technologies and how they work together without a systematical guideline. After several yearsâ€™ experience, I thought maybe I could do something to help people who have the same problem as I used to have. So, I started the series of data center technologies. The article is the first article of this Data Center Technologies series, and it gives a general idea of data center technologies. I hope by summarizing my experience, newbies in this industry or someone is interested in data center technologies can be benefited.

What is a Data Center? What does it do?

A data center can be viewed as a facility (or facilitates) whose duty is to manage a large amount of data. A data center is logically composed of application software, software infrastructure (system software), and hardware infrastructure as the picture below.

data_center_abstract_view

Hardware infrastructure indicates physical components such as servers, storage systems, network components, power supply, cooling system, and anything needed to support data center operations. Software infrastructure, on the other hand, refers to software running on hardware infrastructure, including low-level programs (e.g. operating systems) that interact with the hardware. Software infrastructure also offers many data center features to serve a large amount of data such as data protection, backup, and virtualization. Besides, software infrastructure also provides management functionality for IT staff not only manages data but also manages hardware resource.

The top level of the data center is application software which provides services to end users. Application software can vary based on usage. The applications can be a database, email service, web browsers, and enterprises applications such as ERP, billing systems.

What should the Ability of a Data Center have?

Data is the most precious asset in data centers. Data centers require abilities to ensure data service works properly; many technologies are used in data centers to achieve this goal.

High Availability

Data centers are supported to run 24/7/365 without interruption. Planned or unplanned downtime can cause business users serious damage. High availability and disaster recovery are critical capabilities that data centers are required to ensure the continuity of business.

The circumstances to break the continuity of data centers can result from various reasons.

System Failure and Data Center Outage

Murphyâ€™s Law clearly stated â€˜â€™anything that can go wrong will go wrongâ€™â€™, and can be used to describe the data center case â€“ any component in a data center can go wrong: processors, memory, servers, disks, network, and power supply. To minimize the effect of failure, generating multiple copies of data in different power domain is one common way to protect data. For example, hard drives have a surprisingly high failure rate based on Backblazeâ€™s Hard Drive Report. Even if storage media never had a failure, it would not be possible to last forever. When one drive stops working, there is at least one additional copy of the data that was on this drive on another drive. Besides, different power domain can also be referred to that backups, i.e. copies, are in different geographic locations to prevent the case that an entire data center loss due to physical disasters such as earthquake, fire, or flooding.

Virtualization is one key solution to prevent data center outage and recover from a system failure by the decoupling of the physical hardware components. With virtualization, software and services can keep running without interruption when underlying physical components fail. (refer to Software Defined Storage and Software Defined Data Center)

Software Bug and Logical Error

Human makes mistakes, the software may have bugs, and malware may be injected, which could result in data corruption, services stop, performance dropping, or more serious damage. When logical errors happen, â€˜â€™turning back the clockâ€™â€™, i.e. reverse to a good version of software or data, is a solution to return the system to normal. Taking snapshots or full clones periodically are required to turn back the clock when an error happens.

Measuring Availability

Availability indicates that how long components, applications, and services should function properly over a certain period. Although the ability to recover from a disaster can only be known after a disaster happens, there are ways to measure this ability by either historical record or estimate.

The simplest form of measuring availability is in absolute values (e.g. 990 out of 1000 hours) or percentage terms (99.9%) by applying the equation:

availability = operating time / (operating time + outage time)

MTBF, MTTR, and MTTF

More accurate measurements can be used â€“ Mean Time between Failure (MTBF), Mean Time to Recovery (MTTR) and Mean Time to Failure (MTTF).

mtbf_mttr_mttf

MTBF is the average time between two successive failures of a specific component or a specific system.
MTTR refers to the period of a component, or a system is recovered after a failure.
MTTF gives the average period between the recovery of a component or a system and a new outage.

Based on these measurements, availability can be written as:

availability = MTTF / (MTTF + MTTR)

This equation illustrates that availability increases substantially as the MTTR is reduced, i.e. the faster recovery services to normal, the higher availability data centers have.

RTO and RPO

RTO and RPO are used to measure the ability of recovery after a disaster.

The Recovery Time Objective (RTO) indicates the maximum allowable period for resuming services after a failure. For instance, an RTO of 60 minutes means that the maximum restart time is 60 minutes.

rto

The Recovery Point Objective (RPO) specifies the maximum tolerated level of data loss in a business continuity solution. For example, if the RPO is set to 5 hours, it means the backups must be made at intervals of 5 hours or less.

rpo

Ability to Perform High Performance

Data centers are supported to serve a large number of clients and massive data and built with premium hardware and latest technologies. As a result, performing high performance is highly expected in a data center.

Simplicity (Easy to Use)

The traditional data center may take IT department hours to days to deploy a data center to be fully operational. An all-in-one solution we can just turn on that handles anything we throw at it is on demand. Also, a data center can be huge and complicate. It requires an amount of resource to maintain and operate. Using software to automate data center management not only lowers cost but also reduces the risk of mistakes resulted from human operations.

On the other hand, data centers deal with the gigantic amount of data, e.g. million and million files, folders, and user records. When the scale of data becomes huge, it is impractical to manage them by human alone. Simple management tools are desired.

The goal is to make a data center able to manage both data and itself automatically, i.e. no IT involved (Virtualization is a solution trying to achieve this goal).

After all, data center management must provide features to simplify management. For instance,

Trivial to deploy such as plug-and-play
Self-healing such as automated rebalancing workload to utilize resource
Trivial to scale up and scale out, i.e. add/replace node or drives or components
Graphical UI and step-by-step instructions for repairing such as drive, fan replacement
Automatic patch maintenance
Rolling, in-place software upgrade, so no service is interrupted
Remote management, so IT can manage and monitor remotely

Scalability in both Performance and Capacity

The resource is limited, but the speed of data growth is faster every year. Traditionally, IT department estimated the data growth for next few years, purchase the data center equipment based on the estimation, and then hopefully it is enough for next few years. Unfortunately, this method does not work in the era that data growth rate is so fast. Purchasing data center components few years prior is not practical. In addition to capacity, performance also requires being upgraded when the limitation exceeds. When the systems in a data center need to expand, normally there are two ways to do it: scale up and scale out.

Scale Out and Scale Up

Scale up refers to systems that expand capacity by adding more drives and increase computation power by adding more processors or more memory.

Scale up has some advantages such as no network (communication) cost, cheap (only buy needed component instead of a whole node), and scale up solution is easy to design.

However, scale up also has some disadvantages. The major disadvantage is scale limitation. A single server has limited space, e.g. the number of PCIE slots and RAID controllers. Scale up cannot be beyond the limitation. Second, adding only one type of components can cause performance imbalance. For instance, if adds many drives, but processors and memory remain the same, processors and memory can become the new bottleneck.

scale_up_and_scale_out_1

Scale out refers to systems that do not scale by adding individual components such as memory, drive, and processors. Instead, scale out systems expand by adding a new node (a new node could be a server which includes processors, memory, and drives).

Scale our solution is the current direction of data center technologies toward to. Scale out solves the scale limitation issue that scale up has. Besides, scale-out architecture can be designed to support hundred or even more nodes to generate millions of IOPS and to have Petabytes capacity. The major challenge of the scale-out solution is difficult to design. It requires heavy network traffic for communication and needs to solve load balancing and migration programs. These issues and scale-out solutions are Software Defined Storage and Software Defined Data Center aim to solve and provide.

Ability to Provide Data Insight and Data Analytics

No doubt, data is valuable. However, without data analysis, data itself is meaningless and useless. An ideal data center should be data awareness and provide data analysis to help enterprises discover the data insight and generate business value. With the visibility of system behavior and system level real-time data analytics, it helps to understand whatâ€™s going on inside the system. It also creates chances to improve the overall system performance such as better usage of files and storage and locate the root cause of issues easily. Discovering hidden value from data is on-demand, but the bottom line is data centers should be, at least, able to collect and provide data.

Storage Technologies Overview

In todayâ€™s big data era, data is growing at unprecedented speed. Also, data is the enterprisesâ€™ most valuable asset. Therefore, storage systems take on the heavy responsibility to store, protect, and manage data. This section covers basic storage system concepts and technologies, including architectures of the storage system, the connectivity for storing data, and presentations of the storage system.

Storage System Architectures

Storage systems can be developed in the following architectures: Storage Area Network (SAN), Network Attached Storage (NAS), and Directed Attached Storage (DAS).

Storage Area Network (SAN)

SAN is a dedicated network that allows users to share storage. SAN consists storage devices, interconnecting network components such as switches, routers, and adapters, and host servers connected to this network.

san

The characteristics of SAN can be summarized as following:

Dedicated network; not accessible by other devices through LAN
Use protocols include Fibre Channel (FC), Fibre Channel over Ethernet (FCoE), Internet Small Computer Systems Interface (iSCSI)
Provide block level access only, but file systems can be built on top of SAN
Additional capabilities include disk zoning, disk mapping, LUN masking, and fault management

Network Attached Storage (NAS)

NAS is a file level storage devices that connect to a network and accessed using file sharing protocols.

nas

NAS has following characteristics:

File level access storage architecture with storage attached directly to LAN
Support Ethernet connection and allow administration manages disk space, disk quotas, provide security, and utilize snapshot
Support multiple file-level protocols: NFS, SMB, CIFS

Directed Attached Storage (DAS)

DAS is a block level device which is directly attached to the host machines. The storage can be internal or external disk enclosures, i.e. JBOD and RAID, with interfaces such as SATA, eSATA, SAS, and FC. No network is required.

das

Presentations of Storage

To allow hosts, OS, and applications to communicate over a network with storage systems, they must agree on how they communicate. Storage systems provide three types of interface for communication: block, file, and object. If a storage system provides block interface, it is usually called block storage system. Similarly, if a storage system offers file interface, it is called file storage system; if a storage system has object interface, it is called object storage system. However, a storage system can present more than one interface. The picture below shows three types of storage presentation.

storage_presentation

Block

A block device, e.g. hard drive, allows data to be read or written only in blocks of a fixed size (e.g. 512bytes, or 4KB). Block storage presents storage as logical block devices by using industry standard iSCSI or FC connectivity mechanisms. In Storage Area Network, SAN uses LUN (Logical Unit Number) to represents a logical abstraction (virtualization) layer between the physical disk devices and the host machines. The host OS sees the logical unit as a block device and can be formatted with file systems. A block device requires a block driver (in host machines) that offers the kernel the block-oriented interface that is invisible to the user or applications opening the /dev (in Linux) entry points. Block storage allows for manipulation of data at byte-level.

block_storage

File

As its name implies, file-based storage systems provide shared access to entire file systems down to the individual file. Files are structured in a filesystem and organized hierarchically so that each file can be located by path. Besides, each file has metadata that contains attributes associated with it, including owner, permission, group, and size. The filesystems are network attached and require file level protocols such as NFS (Linux) and SMB (Windows). In file storage systems, data is written and read into files with variance-length. File storage system can be built on top of block storage system.

file_storage

Usually, file-based storage has following characteristics:

Standard file system interface
Support POSIX interface
Location and path aware
Target massive workloads
Target the workloads that have lots of reading and write
Target structure data set

Object

An object storage provides access to whole objects or blobs of data by using API, e.g. RESTful API, specific to the system. Each object usually includes data, metadata for internal managements, and an UUID: the data can be anything, e.g. photos, videos, or pdf files; metadata contains the information to describe what the data is about; unique ID is assigned to an object when the object is created, so this object can be retrieved by ID. Cloud based storage systems such as Amazon S3 and OpenStack Swift are known for offering object based storage access. There is no limit on the type or number of metadata, which makes object storage flexible and powerful. Anything can be included in the metadata.

object_storage

Object-based storage has following characteristics:

Target cold data, e.g. backup, read-only, archive
Target unstructured data set because anything can be included in metadata
Write-once (immutable)
Data access through RESTful API, e.g. S3 and SWIFT
Immutable objects and versioning
Location unknown

Storage Media

Several types of non-volatile media are used in data centers to persist data. The table below shows the brief comparison between four popular storage media.

Â	Capacity	Latency	Throughput
HDD (SATA/SAS)	Large	High	Low
SSD (SATA/SAS)	Medium	Medium	Medium
PCIe SSD (AHCI/NVMe)	Medium	Low	High
NVDIMM	Small	Very Low	Very High

HDD has last for decades. It offers huge capacity and low price. However, the latency and IOPS are relatively bad. SSD (Flash) is taking over the storage medium market share with the increase of capacity and decline in the price. SSD offers much better performance than HDD.

NVDIMM stands for Non-Volatile Dual In-Line Memory Module. It is the next generation of the storage medium. It has an extremely fast performance with very expensive cost.

Summary

The picture below illustrates the layout of storage system architectures.

san_nas_das_2

The table below summarizes the difference between NAS, SAN, and DAS.

Â	NAS	SAN	DAS
Presentation	File, Object, Block	Block	File
Protocol	Ethernet	FC, FCoE, iSCSI	SAS, SCSI, FC
Separate Network	No	Yes	No

The table compares the different presentations of storage systems.

Â	File	Object	Block
Use Cases	Structured data sets Data with heavy read/write	Unstructured data sets Large data sets Immutable data Backup data Archive data Read-only data	Virtual Machine Database Email server RAID
Data Size	Variance size	Variance size	Fixed size
Identified by	File name and file path	Unique ID	Address (LBA)
Composed of	File and metadata	Data, metadata, unique ID	Fixed size raw data
Access through	Standard file access POSIX interface	REST API	Block device
Data granularity	Directory Mount point	Namespace Object	LUN
Performance (normal case)	Medium to high	Low to medium	High
Data organization	Hierarchy	Flat namespace

Software Defined Storage and Software Defined Data Center

Software Defined Storage (SDS) is an approach to data storage systems by using software to provide abstract layers to decouple from the hardware. SDS aims to be able to run on any commodity hardware and a solution for scale-out storage systems.

Software-Defined Data Center (SDDC) is a concept that data center infrastructure is virtualized and represented as software functions. SDDC usually includes Software Defined Networking (SDN) â€“ manage network traffic, Software Defined Compute (SDC) â€“ manage workloads, and SDS â€“ manage data. The combination of SDN, SDC, and SDS is also referred to as Software Defined Infrastructure (SDI) which is an important component in a cloud environment.

This section mainly focuses on SDS.

Why Software Defined (Virtualization)?

The key of SDS and SDDC is virtualization. Virtualization means using software to provide a computer-generated simulation of the real hardware resource.Â The basic idea is using software to manage hardware resource, so the clients or users can see the software layer without knowing the underline hardware.

basic_virtualization

Virtualization has many benefits and provides many abilities that traditional storage system does not have.

High Availability: with the software and hardware separation that virtualization provides, it achieves the goal of high availability. Virtualization layer hides the underline hardware components, so software running top on it does not stop when the underline hardware components failure happens, e.g. driver failure or node failure. Although the performance downgrade may be observed, the overall service continues working. Therefore, an idea virtualization environment can accomplish the zero-down time requirement. Besides, Virtualization layer also replicates data to separate power domain to avoid a single point of failure. Even if a failure occurs, there is replication(s) available to be recovered from.

virtualization_ha

Hardware Flexibility: because virtualization separates the underline hardware from software, software can run on any commercial hardware. Using commercial hardware to build a data center usually means lower cost than on-prem Besides, hardware upgrade does not affect the software services.
Scale Out: when the hardware resource usage is approaching the limit, virtualization affords a way to add new nodes, drivers, or another hardware resource without stopping services. Virtualization makes scale out easier. Besides, IT departments no longer need to purchase extra hardware components (especially drives) ahead. Thin provisioning is one method using virtualization technologies to optimize the utilization of available storage, so IT departments can buy drives when they need.
Simplified Management: because of software-defined, the management UI can be designed as simple as possible. The software can provide policy-based management regardless of the hardware that is being used. Management automation is also feasible with the policy-based management. Therefore, the overall IT required can be reduced, and IT operations can be simplified.
Advanced Features: virtualization usually comes with advanced features such as snapshot, compression, data deduplication, encryption, and intelligence.

Cloud and Edge Computing

The Cloud (or Cloud Computing) is a concept that the delivery of on-demand computing resources over the network. In other words, a cloud is a logical group of lots of servers, storage systems, networking components, and software connected to the internet.

cloud_computing

Types of Cloud Services

Cloud is a service-oriented model â€“ â€˜â€™everything as a serviceâ€™â€™ (XaaS). Based on different requirements, there are many types of services. The three most common types of services are Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS).

Infrastructure as a Service (IaaS)

In IaaS model, the service providers deliver raw and not a pre-configured resource. The service providers may provide tools to manage the resource, but the users are meant to be the ones who configure the resource and install the software.

Amazon EC2 is an example of IaaS.

Platform as a Service (PaaS)

PaaS model usually provides a ready environment that comes with pre-configure setup (e.g. LAMP and MEAN). In this model, the users can focus on their main business without spending the time to configure their environment.

Microsoft Azure, Amazon AWS, and Google Cloud Platform are examples of PaaS.

Software as a Service (SaaS)

SaaS is software that users can access and use without downloading and installing. Many online services people are using daily are SaaS. For instance, Microsoft Office 365, Gmail, XBOX Live, Google Docs.

Deployment Models

The cloud can be categorized as follows: Public, Private, and Hybrid.

Public

Public cloud indicates that the cloud providers own and maintain the resource and offers the services to public use. In other words, the public cloud is meant to be used by anybody. Public cloud can be free or pay to use. Microsoft Azure and AWS are the two biggest public cloud providers.

Private

Private cloud is privately owned by a single organization for its internal use. It means only this organization or its customers can use the cloud resource. There is no architecture difference between public and private cloud.

Hybrid

Hybrid is the combination of the public and private cloud. This model not only offers cloud for public use but also offers an on-prem solution for exclusive usage.

Why and why not (Public) Cloud?

One major benefit of using the cloud is hardware free. The users do not need either to purchase any hardware or to maintain the physical components. In many cases, using cloud also means lower costs than managing a dedicated data center, i.e. private cloud. Using cloud eliminates the costs of building a data center and the costs of operating a data center, and reduces the costs of IT departments. However, in some other cases, using cloud does not save money. For big corporations who have heavy network traffic and huge storage demand, chances are the charge of using the public cloud is more than the cost of building a dedicated data center.

Although it seems everything is moving to cloud, in some cases, people slowly move to cloud or refuse to do so. The main reasons are security concern and control. When using the public cloud, data is stored in the cloud providersâ€™ data centers. If the cloud is out of service (e.g. AWS has an outage), the users have nothing to do but wait until it is recovered. Besides, data held somewhere else cannot be tolerated for customers who have very sensitive data. For instances, health care industry and financial industry hold sensitive information, and entertainment industry is very sensitive to leaks. The other reason not to use the public cloud is network connectivity. In some areas where do not have adequate network bandwidth or do not have stable network connectivity, users do not gain benefits from using the public cloud.

What is Edge Computing?

In contrast to cloud computing, edge computing pushes computing power to the edges of a network. Edge devices, i.e. Internet of Thing (IoT) devices, have limited computing power and storage space. Therefore, the most common model today is that IoT devices are only responsible for collecting data and sending the data back to the cloud. Then, compute-intensive tasks, e.g. data analytics and prediction, occur in the cloud. A typical example is as the picture below.

iot_today

Nevertheless, in some situations, the computation must take place at the edge devices: sending data back to the cloud from the edge devices and waiting for the computing results from cloud takes too much latency. In the case of an auto-driving car, the auto-driving car needs to respond in a very short period in any traffic conditions. If the auto-driving car is not able to respond in a very short amount of time, accidents may happen.

With respect to the growth of IoT, low-cost and power saving processors, and storage becoming more available, the prospect of edge computing is likely that more computation will be pushed to the edge. Eventually, direct communication can happen between devices. Every use case is possible.

iot_future

With the decrease in price and the increase in capacity, performance, and reliability of SSD, SSD is taking over HDD in both enterprise storage and personal computer markets. First, SSD price is closing in on HDD very quickly. HDD prices are dropping slightly, and HDD still has the minimum cost to be built. Therefore, the price is no longer the obvious reason to choose HDD, but SSD. Second, HDD has performance limited by physics. HDD is mechanical spin-drive. With the physical limitation, HDD is getting harder and harder to increase performance. In short, SSD will replace the main non-volatile storage media in data center soon.

One opportunity here is that many existing storage systems were designed for using HDD. Therefore, these storage systems are not optimized for SSD. When these systems switch from HDD to SSD, they do not gain significant performance increase. This fact opens a new opportunity for vendors to build new storage system utilizing the power of SSD.

Cloud Computing and Edge Computing are not competing against each other

Amazon AWS, Microsoft Azure, and Google Cloud Platform continue leading and earning cloud market share. Many customers started moving to cloud solutions. On the other hand, IoT ecosystem is also growth very fast. With more powerful SoC, processors, and other components, IoT device can do more compute-intense tasks at the edge. As mentioned in â€˜â€™What is Edge Computing?â€™â€™, in some cases, it is critical to do the computation jobs, e.g. analytics and prediction, at the edge. Cloud and edge are not taking over each other. Instead, cloud and IoT devices coordinate together to build a more mature and comprehensive ecosystem.

Data analytics and intelligence are important

Data matters, even machine generated data matters. Big data, data analytics, and machine learning are buzzwords that everybody is talking about nowadays. However, most discussion focus on human-generated data. This area will remain hot and important. On the other hand, people realize that machine-generated data has value as well. By providing comprehensive and abundant system level data helps people to remove performance bottleneck, increase utilization, locate the root of issues, and discover the hidden value that people did not know before. Besides, with the power of machine learning, it is possible to take one more step to build data centers that can manage themselves. Shortly more system level data will be pulled out, be analyzed, and more intelligence will be built-in data center technologies.