At JCC, the computing infrastructure has been built using the highest quality industrial strength components and with expertize and planning to provide for high tolerance for failure of any one component. This is done in several ways:
Virtual servers: Our virtual machines (see Virtual Server in the FAQ) are implemented across several blade servers. Blade servers are physically small machines that have gigantic amounts of computing power. We currently use a number of blade servers for hosting and will increase that number as required. Our standard is to support the entire workload on one less blade than are actually deployed. The blades work together along with the hypervisor (see Virtual Server in the FAQ) to support all hosted virtual servers. They each have multiple CPUs with 12 cores and 72 Gigabytes of memory and together can host a very large number of virtual servers.
Server Uptime: It is possible that a blade may fail. In this case, the workload operating on that blade is migrated to one of the other available blades.
There certainly will be a blip in service should a blade shutdown be unplanned. However, should it be necessary to take a blade out of service for some reason, e.g. repair of a component in the blade itself, the virtual servers will be moved from the blade hosting them to one of the other blades transparently to those using them.
Blade Enclosure: Blades are housed in something called a blade enclosure. The enclosure provides power and cooling to the blades themselves. The enclosure contains sufficient redundancy in its power supplies that if one or more supplies fail the remaining supplies can provide plenty of power to the remaining blades.
Power: There are sufficient power supplies in the blade enclosure to allow for at least three concurrent failures. If more supplies than that fail simultaneously, JCC will shut down other blades in the enclosure not participating in the hosting work to allow the blades doing the hosting to continue to work without degradation. The enclosure can support up to 16 blades of the type that JCC is using for the hosting work.
Disk Storage: The disk storage used for customer data is provided by something called a SAN. This device is a computer system itself and currently contains 72 large fast physical disk drives. The drives are industrial strength and are designed to work continuously. They have much higher reliability characteristics than those disks sold for normal servers. The SAN presents virtual disks to each of its client servers. Customer data is stored on virtual disks.
A physical disk drive may fail and with a large number of them the chances of some physical disk failing is larger than for a single disk. Accordingly the physical disks are managed by the SAN as a group and all virtual (customer) disks are managed with redundancy; that is if a physical disk fails customer data is not lost but continues to be available because the SAN is specifically designed to provide continuous access to data. When a physical disk supporting some portion of a customer’s virtual disk does fail, the SAN automatically rebuilds the redundancy characteristics of the customer virtual disk on the remaining physical disks. When a new physical disk is used to replace the failed one, customer data is likewise moved to make use of this new capability.
There are sufficient disk resources in the SAN to provide for the highly unlikely possibility of several disks failing.
Everything in the SAN is redundant. That is, there are two fibre-optic paths to the SAN from the clients. There are two fibre switches to allow up to four simultaneous paths between the SAN and the clients. There are two controllers in the SAN that duplicate each other’s work and if one fails, the remaining controller will take up the entire workload transparently. By monitoring the resources consumed by the controllers, JCC ensures that the remaining single controller can support the entire workload. There are two paths from each of the SAN to the disks in the controllers. Everything is duplicated.
Network: The internal network is also redundant. Each blade has multiple network connections and the network is connected via a set of network switches that operate in parallel. This architecture allows continued operation should any network device fail.
Monitoring: Since there is so much redundancy in this architecture, it is perfectly possible that something can fail and nobody will notice. This is an obviously risky situation since a second failure can render something inoperable. To protect against such a hidden failure, we maintain a continuous automated survey of the entire infrastructure. Should something unexpected happen this monitoring tool will send internal E-mail to those responsible at JCC for systems operation so that the condition can be rectified. If the failure is serious enough, our maintenance provider will also be notified automatically so that they can pull together the parts necessary to effect repairs and arrive on-site in a timely manner.
- Next >>