At JCC, the computing infrastructure has been built using the highest quality industrial strength components and with expertize and planning to provide for high tolerance for failure of any one component. This is done in several ways:
Virtual servers: Our virtual machines (see Virtual Server in the FAQ) are implemented across several blade servers. Blade servers are physically small machines that have gigantic amounts of computing power. We currently use a number of blade servers for hosting and will increase that number as required. Our standard is to support the entire workload on one less blade than are actually deployed. The blades work together along with the hypervisor (see Virtual Server in the FAQ) to support all hosted virtual servers. They each have multiple CPUs with 12 cores and 72 Gigabytes of memory and together can host a very large number of virtual servers.
Server Uptime: It is possible that a blade may fail. In this case, the workload operating on that blade is migrated to one of the other available blades.
There certainly will be a blip in service should a blade shutdown be unplanned. However, should it be necessary to take a blade out of service for some reason, e.g. repair of a component in the blade itself, the virtual servers will be moved from the blade hosting them to one of the other blades transparently to those using them.
Blade Enclosure: Blades are housed in something called a blade enclosure. The enclosure provides power and cooling to the blades themselves. The enclosure contains sufficient redundancy in its power supplies that if one or more supplies fail the remaining supplies can provide plenty of power to the remaining blades.
Power: There are sufficient power supplies in the blade enclosure to allow for at least three concurrent failures. If more supplies than that fail simultaneously, JCC will shut down other blades in the enclosure not participating in the hosting work to allow the blades doing the hosting to continue to work without degradation. The enclosure can support up to 16 blades of the type that JCC is using for the hosting work.
Disk Storage: The disk storage used for customer data is provided by something called a SAN. This device is a computer system itself and currently contains 72 large fast physical disk drives. The drives are industrial strength and are designed to work continuously. They have much higher reliability characteristics than those disks sold for normal servers. The SAN presents virtual disks to each of its client servers. Customer data is stored on virtual disks.
A physical disk drive may fail and with a large number of them the chances of some physical disk failing is larger than for a single disk. Accordingly the physical disks are managed by the SAN as a group and all virtual (customer) disks are managed with redundancy; that is if a physical disk fails customer data is not lost but continues to be available because the SAN is specifically designed to provide continuous access to data. When a physical disk supporting some portion of a customer’s virtual disk does fail, the SAN automatically rebuilds the redundancy characteristics of the customer virtual disk on the remaining physical disks. When a new physical disk is used to replace the failed one, customer data is likewise moved to make use of this new capability.
There are sufficient disk resources in the SAN to provide for the highly unlikely possibility of several disks failing.
Everything in the SAN is redundant. That is, there are two fibre-optic paths to the SAN from the clients. There are two fibre switches to allow up to four simultaneous paths between the SAN and the clients. There are two controllers in the SAN that duplicate each other’s work and if one fails, the remaining controller will take up the entire workload transparently. By monitoring the resources consumed by the controllers, JCC ensures that the remaining single controller can support the entire workload. There are two paths from each of the SAN to the disks in the controllers. Everything is duplicated.
Network: The internal network is also redundant. Each blade has multiple network connections and the network is connected via a set of network switches that operate in parallel. This architecture allows continued operation should any network device fail.
Monitoring: Since there is so much redundancy in this architecture, it is perfectly possible that something can fail and nobody will notice. This is an obviously risky situation since a second failure can render something inoperable. To protect against such a hidden failure, we maintain a continuous automated survey of the entire infrastructure. Should something unexpected happen this monitoring tool will send internal E-mail to those responsible at JCC for systems operation so that the condition can be rectified. If the failure is serious enough, our maintenance provider will also be notified automatically so that they can pull together the parts necessary to effect repairs and arrive on-site in a timely manner.
Reliable Power: JCC has only one power line entering the building. As you may know, the AC power delivered by the utility is a shared resource used by all utility customers. It experiences the effects caused by everybody using that power simultaneously. For instance, starting any motor such as a refrigerator compressor will cause large “spikes” on the line. Utilities are not required to even out those spikes because the cost of doing so would be prohibitive.
These spikes can and do damage power supplies. JCC uses a device called a UPS that accepts the utility power, converts it to direct current and uses that to charge a large bank of batteries (the cabinet is about 5 feet high and 3 feet wide by 4 feet in depth. Incoming utility power is converted to DC and charges the batteries. The batteries are then used for power for all computer equipment in the building by running the DC current through an inverter to provide AC. The result is a very clean power supply to all computers and results in much higher reliability for power supplies and other delicate electronic circuits.
In the graph on this page, the erratic blue line displays power coming into the UPS. The green line shows the much cleaner AC power coming out of the UPS and being delivered to our data center. The data in this graph is over a 24-hour period and represents a typical day. It is easy to see the benefit of extended lifespan for our electronic equipment.
The batteries can supply sufficient power to keep all computers running for about 15 minutes. The batteries are sized to allow one or more to fail. The remaining can do the work, albeit for a shorter time. Battery health is monitored continuously.
Should utility power be interrupted, this is detected by something called a transfer switch. After the interruption the switch waits for a minute or two and then activates a generator that will power the entire office. This generator runs on natural gas which is supplied by the gas utility and we therefore consider it essentially inexhaustible. The generator takes the place of utility power until power is restored.
During the most recent extended power outage in late June through early July, 2012, JCC’s systems remained running and our office was even air-conditioned as the generator is sized for powering the entire office. JCC was up and our internet connections were secure and running for the entire outage.
Internet Connection: At this writing, JCC has a single high-speed (fractional Ethernet) line from the internet coming into our office. We are currently exploring the addition of a separate connection from a second internet supplier and will install that as soon as we can reliably do so. The issues for this are not simplistic as our providers will have to deal with the fact that we are also routing our packets.
There will be additional benefits to this additional line. Customers will have multiple network routes to JCC. Of course, customers don’t really choose routes, but their providers do and having a choice will allow connections that are faster for their particular networks. This should give all JCC customers the fastest connections to their hosted servers.