Friday, March 15, 2013

Server System Stability

We have been asked recently how VisionLink maintains a stable server platform. Here is a summary of what we do, working from real-time responses to longer-term planning.

(Formatted PDF version of this information)

We know that our customers cannot help their clients effectively if the technology they depend on is not available.  It is not a pleasant experience to be working with a person or family in the middle of a crisis, and not be able to connect them to the assistance they need.  
1. Stable Technology
Our servers often exceed 1.5 million hits per day, and our uptime has averaged better than 99.9% for the past decade, and 99.97% for the past 12 months. We are continuously finding new ways to improve these numbers to maintain an even higher level of stability.
2. Real-Time Monitoring

Our IT team has created monitors for many hundreds of data points across the CommunityOS server and network platform, including thermal sensors, data transfer rates, server request completion rates, storage capacities and much more.  These monitors let our staff respond to issues before service is degraded or interrupted.

3. 24-Hour Response

CommunityOS systems are busy around the clock. In turn, this requires staff who are able and willing to quickly respond regardless of the time of day.  We have paging systems, on-call calendars, and other procedures in place to support round-the-clock support.

4. System Maintenance

Regular maintenance is required to keep the systems operating at peak efficiency.  We typically schedule maintenance windows Wednesday evenings so that we can install fixes, security updates, and enhancements and conduct necessary maintenance. Doing this well requires many, many hours of preparation and testing to ensure that a short maintenance window can be used quickly and safely.

5. Demand Forecasting

We monitor average and peak levels of demand all the time so that we can make good business decisions about when to expand capacity and in what manner. During the Joplin Tornado for example, we experienced demand levels 400% beyond requirements (and have since expanded capacity.)  This is challenging work; the larger the system, the more difficult it is to scale quickly.

6. Redundant Server Facilities

Our server systems are redundant within their own facilities, and then redundant across multiple server site locations. The primary and fall back sites are continuously running, and continuously distributing data among primary and fallback systems.  Equipment fails all the time; the point is to be sure that redundant systems are in place, configured correctly, and ready to take up the load.
7. System Architecture

The most important--and yet most invisible, part of server stability are the decisions made by our IT professionals.  We insist on industry standard solutions so that fixes are easily acquired; we carefully construct internal redundancies from everything from how fiber enters a server facility to the redundant machines, cooling, and power.  At the more technical level, very specific decisions are made which impact the efficiency and effectiveness of the day-to-day operation, but also which impact how easily future changes can be implemented.  It takes years of domain specific technical expertise to make these decisions correctly.

8. Investment in the Infrastructure

Maintaining server systems that are responsive, redundant, and which can be deployed from multiple locations is expensive. VisionLink as a company, and our customers across the nation, recognize the need for this kind of investment. It is about choices: invest in more features or a more stable server platform? The same dollar cannot do both.

9. The Art & Science of Compromise

Truth is, making decisions about server system priorities is part art, and part science. If budgets were unlimited the answers would be easy.  We rely on nearly 15 years of experience, and a highly qualified staff to make critical decisions about how to resource which parts of the server and network infrastructure. We do not always get it right, but running at 99.9% for more than a decade suggests we do so more often than not.

10. Thanks to Staff & Partners
Behind the scenes are professionals running these systems, and making the kinds of decisions which can have critical consequences at any time.  Great people working with great customers makes it possible to deploy stable servers, and to solve problems very quickly when they do arise.  Clearly, however, the time, effort, and money spent up front reduce the likelihood of failure and make recovery that much faster.

Douglas Zimmerman
VisionLink, Inc.