By Ravi Prakash, Product Manager
As a healthcare CIO mandated with the transition to electronic healthcare records, you’ve done your due diligence, evaluated various EHR vendors and most likely settled on Epic, a software vendor with over 25% of the US acute care hospital market share. You’ve followed Epic guidelines for minimum hardware requirements and acquired what you believe is the right workstation, server, networking, and storage infrastructure for Epic. How do you take precautions to ensure that you don’t have outages due to underlying shared infrastructure – comparable to the 6 day outage reported at Boston Children’s Hospital in 2015?
There are many factors unrelated to Epic itself which could cause latency and even downtime:
If you are a large organization, Epic recommends that you have a tiered database architecture and this is based on InterSystems’ Enterprise Caché Protocol (ECP) technology. Caché is a commercialized version of a 53-year old database called MUMPS (Massachusetts General Hospital Utility Multi-Programming System). A unique characteristic of Caché is that it has an I/O access pattern of continuous, random database file reads interspersed every 80 seconds by a large burst of writes. This puts a demand on the shared SAN-attached storage array that it should have enough cache available to absorb all 80 seconds of write into cache and de-stage it before the next write cycle from Caché hits the storage array. If the storage array doesn’t have enough write cache available it will hold off acknowledging write requests till its own cache is free up. If Epic Caché cannot finish its writes in 80 secs, database access could exhibit latency. This could be due to no fault of Epic but a factor of the underlying shared hardware infrastructure.
Why can’t you just rely on the monitoring tools provided by your storage vendor? For the simple reason that arrays usually have tracing with 60 second summaries. As you can imagine, a 60 second summary of an 80 second process isn’t terribly useful. VirtualWisdom has wire level visibility into Fibre channel (or NAS protocol) traffic – every single conversation at line rate – and offers second-by-second summaries which can make a world of difference. In addition, 99.9th and 99.99th percentiles are shown by timing every single exchange on the wire and placing the timed result in histogram represented buckets, ranging from sub milliseconds upwards.
You may be thinking: But I’ve followed all the guidelines sent to me by Epic… For instance, Epic might tell you “all average read latencies must be 12 msec or less for ECP configurations”. However, your datacenter monitoring expert might turn around and ask you “Noted but at what granularity? Is it to be every 5 min, every 1 min or every 1 sec?” Guidelines from a software vendor are good but the devil is in the detail in how customers interpret these guidelines.
You may well ask: Why can’t I just rely on Epic System Pulse to monitor Epic? For the simple reason that while System Pulse does a good job of collecting performance and health metrics across the Epic service it has no visibility into anything like shared SAN or shared network storage – one of the primary causes of EPIC application slowdowns.
What else could go wrong? If your Clarity reporting application is running on a host connected via Fibre channel Host Bus Adapters (HBA) to a Brocade SAN switch which uses Virtual Channels, you might have a scenario where all existing HBAs may be sharing the same virtual channel resulting in significant congestion. With help from VirtualWisdom and by manually mapping FCIDs you could determine which Virtual Channels were being used by specific device ports. This will help you determine if ports should be reallocated & additional HBAs deployed so that they may use other virtual channels. Alternately you might have incorrect queue depth settings on the HBAs in the hosts running Clarity which may cause latency in the SAN fabric. VirtualWisdom can help you detect this and recommend the right queue depth settings to use.
What could go wrong outside the server (virtualized or physical), HBA and storage? 3rd party backup software (for backup of Epic including the Caché database) may end up taking hours more than expected for the backup. VirtualWisdom could use time comparison charts to show you when there is a gap in the backup read workloads, using this data your backup manager could use trend data to identify what may turn out to be a timeout in the dedupe engine of the backup software.
The recurring theme in this article it is that there are many variables in the underlying infrastructure: servers, HBAs, SAN switches, networked storage, backup software – all outside the control of the Epic application which often contributes to latency and downtime in Epic. Your goal then should be to catch any deviations and take proactive action before they spiral into downtime. That is the motivation behind so many healthcare firms using VirtualWisdom to monitor Epic and Cerner infrastructure. Like to learn more? Give us a call!