Advanced Server Technologies

1. Introduction

The goal of this report is to identify the current techniques for the implementation of scalable and highly available servers, the description and critical evaluation of a particular advanced server and the justification of advanced servers in business.

2. Current Techniques

The techniques identified for implementing scalable and highly available servers are load balancing, high performance and high availability clusters.

2.1 Computer Clusters

A computer cluster is a group of linked computers, normally connected to one another using fast local area networks or a virtual local area network (VLAN). The computers work closely together and it’s as if they form a single computer. An example of a simple cluster is a pair of machines working together as a search engine. One machine would be used to retrieve data and the other hosts the front end website. The user sends requests from the site and receives the desired results. To the user it will be as if they are accessing just one machine but this is not the case and is achieved by using techniques mentioned later.

Clusters are used to increase the availability and in some cases performance of applications and data. Typically they are more cost effective then a single computer of similar speed and/or availability. This reduced price and comparative performance enables companies to get a greater return on investment (ROI).

An Additional difference between singular computers and clusters is the increase in scalability. Clusters are easily expanded and upgraded as requirements change in the market by adding additional nodes to the network, theoretically expansion is infinite. Along with being cheaper and scalable a cluster has increased availability, as mentioned previously when compared to a singular computer. This is because when a node in the clusters fails its operations can be transferred to a different node with no interruption in the service. (Bourke 2001)

Clusters can also be found working on complicated computational problems such as simulations. The computers in this type of cluster are tightly coupled and perform the work of what was originally referred to as supercomputing. In this instance the cluster isn’t working on similar data sets, it is computing in parallel. This type of cluster would share a dedicated network due to frequent communication among nodes, be densely located, and the computers would share a similar specification (homogenous nodes). At the opposite end of cluster design is one or few nodes that require none or little communications between nodes, this is loose coupling and is sometimes referred to as Grid computing.

2.1.1 Load Balancing Cluster

Load balancing is the process of distributing network traffic among two or more servers and has the effect of making several servers appear as one. The Computers are linked together to share computational workload or function as a single virtual computer and results in balanced computational work among different machines, improving the performance of the cluster system.

Networks in frequent receipt of high network traffic can use dedicated servers to balance the load ensuring no one machine is over laden and making the most from available bandwidth. The dedicated hardware is often referred to as the load balancer. Fig 1.0 depicts a simple server load balanced network or a SLB network. In the example one load balancer is used but in normal setups load balances come in a redundant array of two or more, increasing availability. (Bourke 2001)

As a load balancer distributes load among multiply servers all that is needed to increase the serving power of a cluster is more servers. This equates in an easy solution when the network load increases, as servers can be added immediately to handle the increase in traffic.

Software can also be used to divide incoming traffic among available servers based on algorithms, these include:

  • Round-Robin, which distributes the load equally to each server, regardless of the current number of connections or the response time.
  • Weighted Round Robin, this is similar to round-robin but also considers each servers different processing capabilities.
  • Least Connection, that sends a request to the server with the fewest connections.
  • Load Based, which sends a request to the server based on the server with the lowest load.

Put simply server load balancing works by having one URL, one IP address, identifying traffic intended for a site and manipulating packets before and after they reach a server. This is achieved by manipulating a packets source or destination IP address. This occurs in a process known as the Network Address and Translation (NAT). (Bourke 2001)

Clusters or multiple computers that work together, also use load balancing to spread out processing jobs among the available systems.

Fig 1.0 server load balanced network (Bourke 2001)

2.1.2 High Performance Clusters

A High performance cluster (HPC) is used to increase computing performance by sharing the workload between nodes and is specifically designed to combine computer processing power.

HPCs are used when obtaining maximum performance to solve large and complex problems. The popularity and power of HPCs is ever growing and they currently make up 80% of the 500 fastest computers in the world. (Advanced Cluster Technologies Inc. 2011)

The performance achieved from clustering computers has lead to scientists and engineers implementing HPCs to deal with complex problems, where as previously expensive supercomputers had been used. The reason behind this trend is the increases in performance from commercially available off the shelve hardware and the relatively low cost. This makes HPCs the scalable and cost effective business solution when there is a need to solve complex problems quickly.

An example of a HPC can be seen in Fig 2.0. This example shows nodes connected to each other via a high speed interconnected network. Other clusters may have interconnection amongst nodes but in a HPC the importance of the coordination and synchronisation of results requires for a high performance connection i.e. Myrinet, InfiniBand or SCI. This type of HPC can be considered as a Beowulf cluster.

Beowulf clusters are built from commercially available hardware which is normally identical and run on open source software such as operating systems like Linux or FreeBSD. This can result in an inexpensive high performance parallel computation cluster.

The primary drawback of a Beowulf clusters is the requirement for specifically designed software in order to take advantage of the resources. The programmers write software that works by taking advantage from the cluster using clustering libraries or middleware. The most popular libraries are PVM (Parallel Virtual Machine) and MPI (Message Passing Interface). This software allows computing problems to be solved at a rate that scales almost linearly in relation to the number of processors in the cluster.

Fig 2.0 High performance cluster

A cluster that performs similar to a HPC cluster is the well known Seti@Home Project. The Seti cluster consists of over 5 million home computers joining the cluster through the internet. This type of HPC is distributed computing and enables the cluster to benefit from processing power that is geographically spread out around the world. The nodes involved with the Seti cluster are sent whole calculations rather than pieces of a problem and are tasked with the analysis of data from the Arecibo Observatory radio telescope. (Taylor 2009)

A recent hardware trend in high performance clusters is the inclusion of graphics processing units (GPUs) for general purpose computing. The reason behind modern GPUs being included in HPC nodes is due to the nature of its architecture, each GPU containing hundreds of processing cores and the capability of the GPU to execute a large number of threads in parallel. This makes the GPU better suited at computing problems in parallel when compared to traditional CPUs, which perform better at single threaded accelerations i.e. operating systems.

An addition reason behind GPUs being used as general purpose processors in HPCs is that CPU technology cannot scale in performance sufficiently, thus causing demand to be unmet. While alternatively GPUs are developing and scaling proficiently due to a growing games industry that forces constant innovation. Equally the demand for new graphics cards from computer enthusiast and ‘gamers’ keeps the cost of these cards low. Such recent developments include the Tesla chipset from Nvidia that has been designed from ground up for parallel computing. (Vasiliadis 2010)

Fig 2.1 shows the benefit of using a Tesla GPUs on just one machine to complete complex problems. The performance continues to scale when these computers are clustered as nodes in a HPC. (Nvidia 2011)

The use of GPUs in cluster nodes results in a hybrid high performance cluster where GPUs and CPUs work together to perform general purpose computing tasks. The systems are referred to as being heterogeneous nodes opposed to the previously mentioned homogenous nodes. The first hybrid system to use GPUs and CPUs together and make it in the top 500 supercomputers was from Tokyo in November 2008. It was called Tsubame 1.2 and a picture of the cluster can be seen in fig 2.2. Tsubame demonstrated to the world that hybrid cluster can offer high performance at reduced energy consumption, making it appeal to the green culture of the world today.

Fig 2.1 Benchmark - performing protein sequence analysis (Nvidia 2011)

Fig 2.2 Tsubame 1.2 Institute of technology (Nvidia 2011)

2.1.3 High Availability Clusters

High availability clusters also know as failover clusters are implemented for the purpose of improving availability. A high availability cluster operates by using redundant nodes and a private network connection. The private network connection is called the heartbeat and is used to monitor the health and status of each node in the cluster. The most common size for a high availability cluster is two redundant nodes but most clusters will use significantly more. Fig 3.0 shows a high availability cluster with just two nodes that supplies a database from a shared disk.

Businesses often use this approach when supplying databases, file sharing, business applications, and websites. This is mainly due to the clusters ability to detect hardware and software faults using the heartbeat and immediately restart an application onto another node. This process doesn’t require technical support and is known as Failover. Retrospectively, server crashes that occur outside of a high availability cluster or on a cluster without a heartbeat connection can’t continue to supply their services until someone fixes the problem. This results in users on the system unable to connect to these services with the possibility of loss of income. The heartbeat design is not without its own problems and if the private network goes down can result in every node supplying the requested service due to no network responses. This risk can be mitigated with redundant networking equipment i.e. cabling, removal of singular points of fail and good networking traffic design. (Best Price computers 2005)

Fig 3.0 High availability cluster, where one is the active node and the other is redundant and used in case of a backup. (Oracle 2011)

For an example of how a large public Web site might typically combine techniques to achieve high availability see fig 3.1. This example uses redundant front end servers that are load balanced, resulting in a site that users will always be able to connect to. The Server cluster ensures a request is always responded to and in order to guarantee the availability of data, the cluster implements a redundant fibre connection to the storage area.

Fig 3.1 The combination of techniques to achieve high availability. (Microsoft 2011)

3. The Mole 8.5 Cluster

The Mole 8.5 is a high performance cluster built by the institute of process engineering (IPE) in 2010 and is the thirty-third fastest supercomputer in the world as of July 2011. (Top500 2011)

The IPE used Nvidia’s Fermi GPUs to build the heterogeneous supercomputer and it contains 1920 GPUs in the cluster. The Fermi GPU is a combination of the previously mentioned Tesla card and Nvidia Quadro. The Fermi GPUs make use of the virtualization capabilities of the Nvidia Quadro and the parallelisation of the Tesla, making it very effective at the computation and displaying of results from simulation applications i.e. a molecular dynamics simulation. (Zhang and Yunquan 2009)

Supercomputers like the Mole are ranked by running the Linpack performance benchmark which measures the systems floating point computing power. Results from these tests are reported as the system’s ability to calculate floating point operations per second or Flops. The mole 8.5 is able to achieve 207.3 teraflops (TFlops) or 207 trillion calculations per second using all 320 of it nodes. Retrospectively, if a human was to perform 1 calculation per second using a calculator, it would take said human 6,559,574.28 years to achieve what the Mole 8.5 can perform in just 1 second. Countries such as America, Russia and China are trying to reach speeds of 1000 peta Flops and plan to achieve this within the decade. Funding for these projects can be compared to previous space and arms races demonstrating the desire for higher speed computing and its benefits. (Zhang and Yunquan 2009)

3.1 Implementation

The Chinese Academy of Sciences who own the Mole 8.5 has recently displayed the capabilities of hybrid cluster to the world. The Chinese researchers simulated a whole H1N1 influenza virus at the atomic level, using the cluster as a computational microscope and verifying theoretical understandings. This type of research makes for more effective ways at developing drugs and the speed at which the Mole works also means it can help control epidemics by enabling the scientists develop antiviral drugs quicker than ever before. To achieve this simulation the Mole had to simulate billions of particles behaving in accurate environmental conditions. This was made possible by the development of specialist application that utilised the Fermi GPU acceleration. (Nvidia 2011)

3.2 Architecture of the Mole 8.5 Cluster

Each node in the cluster is comprised of two Intel quad core CPUs and Two Intel south bridge chips called input/output hubs (IOH) offering 10 gigabytes of bidirectional direct media interface for communication between the north and south bridges. The connection used between the two CPUs and IOHs is Intel’s quick path interconnect (QPI) and is a point to point processor connection that replaces the front side bus (FSB) in order for the two CPUs, IO hubs and routing hubs to access other components via the network using full duplex integrated point to point connections. There are two PCI switches affiliated with each IOH that enable two PCIe X16 slots on each IOH. This Results in the support for 4 PCIe slots in total on each IOH, meaning each node can support 8 GPUs but due to current technology restrictions from the PCI switching BUS more than 6 GPUs results in non-linear scalability. Therefore each IOH and the node are equipped with three Fermi GPUs and to ensure low latency network communication an Infiniband host network adapter (HCA) is also included. (Zhang and Yunquan 2009)

4. Justification of Clusters in Business

A recent study reported that system down time causes businesses to lose on average £160,000 a year and reduces potential revenue generation by 17% (Patel 2011). I feel this statistic highlights the importance of business’s implementing tolerant systems and would recommend a highly available server cluster. As mentioned previously this particular type of advanced server technique provides an architecture that will improve system durability and reduces the time needed to restore data. Keeping a system operational in the event of a system failure should help to mitigate the loss of revenue and ensures a business’ services remain available.

Additionally using clusters as an advanced server technique can further save a business money. Where companies previously needed expensive mainframe or minicomputers to process data, the use of clustering makes it possible for businesses to use inexpensive computer components to the same effect. In the past a company using a mainframe type IT solution would have only been able to buy IT equipment from a limited selection of suppliers, consequently this keeps prices high and small to medium businesses were unable to own systems because of the price. Whereas, this is not true if a company decides to implement clusters. The use of commodity hardware in a cluster keeps the pricing of said hardware inexpensive and freely available, due to the use of mainstream platforms. The operating systems and other types of software can be found under open source licensing further reducing cost.

Without using clustering as an advanced server technique, a successful system can cause an equal burden to an IT department as a failed one. The problem with a successful system is, the more popular it becomes, overall load increases, and subsequently this reduces the performance of your system. A clustered system reduces the need for this demand to be anticipated, that would otherwise result in a business’s up front commitment to unnecessary expensive hardware. As demand is a difficult trend for companies to predict, non-clustered systems have the potential of being underutilised and results in a reduction to the businesses ROI. Due to clusters scalability it makes it possible to adjust a clustered system when the performance and capacity calls for it, meaning prediction of systems popularity is not required. This results in an adequate level of hardware and an improved ROI. Achieving this is relatively easy and cheap via the addition of more nodes. During this upgrade it is also worth noting is a system can remain operational, maintaining the system up time.

To further justify the use of advanced server techniques, namely high availability. It is worth noting that implemented correctly it is possible for a server cluster to be self-configuring, self-monitoring and self-healing. A server cluster correctly configured to self heal is able to achieve this by automatic backup (replication), ensuring all files are the same across multiply machines (synchronisation) and automated failover switching. This will in turn free up an IT department, meaning they can perform other tasks, enhancing productivity.

Advanced server techniques are not only helping businesses get the most from there IT budget but also enabling ground breaking research that is helping mankind. As mentioned in section 3, Chinese researchers simulated the whole H1N1 influenza virus at speeds only made possible by clusters. The speed in which this research was achieved helped create vaccines and ultimately saved lives. These types of simulations are happening all over the world and are only made possible by one or more of the techniques mentioned during this report.

5. References

Advanced Cluster Technologies Inc. (2011). Types of clusters. Available: Last accessed 25th Oct 2011.

Best Price Computers. (2005). Cluster Computing / Computer Clusters. Available: Last accessed 3rd Nov 2011.

 Bourke, T (2001). Server Load Balancing. United States of America.: Published by O'Reilly & Associates. p03-22.

Kuzmin, A. (2003). Cluster Approach To High Performance Computing. Computer Modelling & New Technologies. 7 (2), p07-15.

Meuer, H and Strohmaier, E. (2011). Top500 Super computers June 2011. Available: Last accessed 25th Oct 2011.

Microsoft . (2010). High Availability Solutions Overview. Available: Last accessed 1st Nov 2011.

Nvidia. (2011). Bio-Informatics and Life Sciences. Available: Last accessed 5th Nov 2011.

Nvidia. (2011). Chinese Researchers Tap GPU Supercomputer for World's First Simulation of Complete H1N1 Virus . Available: Last accessed 5th Nov 2011.

Oracle. (2010). Fail Safe and Third Party Cold Failover and Clusters. Available: Last accessed 20th Oct 2011

Patel, M. (2011). Understanding the True Cost of Downtime for London Businesses. Available: Last accessed 19th

Taylor, M. (2009). About . Available: Last accessed 1th Nov 2011.

Vasiliadis, G. (2010). Parallelization and Characterization. Pattern Matching using GPUs. 1 (4), p01-05.

6. Bibliography

Murph, D. (2011). IBM's Mira supercomputer does ten petaflops with ease, inches us closer to exascale-class computing. Available: Last accessed 10th Nov 2011

Randybias. (2010). Grid, Cloud, HPC … What's the Diff?. Available: Last accessed 23th Oct 2011.

TechTarget. (2001). Failover. Available: Last accessed 29th Oct 2011

UKFast.Net Limited . (2001). Benefits of load balancing. Available: Last accessed 25th Oct 2011

Webopedia. (2011). load balancing. Available: Last accessed 23rd Oct 2011.

Wikipedia . (2011). Load balancing (computing). Available: Last accessed 20th Oct 2011.

Created: 2014-09-17 13:27:27 Updated: 2014-10-06 12:59:30