Expecting The Unexpected

Disaster recovery strategies & considerations for hedge fund firms

Bob Guilbert, Managing Director, Eze Castle Integration
Originally published in the July/August 2006 issue

In the high-stakes world of hedge funds, abusiness interruption can cost a firm significantly. Dependant on a sophisticated foundation of powerful computers and applications, data networks and voice communications, hedge funds cannot tolerate even the slightest disruption in their IT service.

Natural disasters, such as floods or earthquakes, are what most people envision when considering a disaster recovery plan. However, any event that prevents you from accessing the data and systems necessary to conduct business should be taken into account. A regional power failure, a rapidly spreading computer virus, employee sabotage, external data fraud, devastating terrorist attacks or even an influenza pandemic should all be covered in a disaster recovery (DR) plan.

Outages are unacceptable for today’s businesses. Consider the impact if your trading systems were to go down, or your voice-communications were interrupted during peak trading hours. These business disruptions are extremely costly for hedge funds pursuing sophisticated strategies that rely on the ability to detect and exploit short-lived inefficiencies and opportunities. They result in loss of revenues, damaged relationships with clients, and a reduction in overall productivity.

Disaster recovery plans are also being evaluated by another critical audience – your prospective investors. Investors are adding DR plans to their due-diligence check list and requesting that a comprehensive, tested disaster recovery plan is in place before they invest their money. DR plans are also closely tied to compliance and governance requirements so investment firms are required to maintain and backup their data for regulatory reasons. A comprehensive DR and compliance plan is crucial to maintaining everyday operations and reporting activities. To help you monitor your fund activity at all levels and to be prepared, you need a disaster recovery plan that delivers in-depth transparency into all of your various systems.

With so much at stake, your company has no option but to implement a well planned DR plan.

The disaster recovery considerations

Disaster recovery planning, implementation and management is a discipline requiring specialized talents that are often difficult to recruit and retain. Following are some of the initial considerations when looking to implement a DR system.

The reality of tape

Data is a firm’s most crucial asset, andprotecting it is one of the most important issues in maintaining continuous business operations. To strictly rely on unstructured backup and archiving processes with unreliable media is unacceptable when dealing with your valuable data. Tape is an appropriate choice for day-to-day restoration, archiving or longer-term storage, but it is completely unsuited to the critical tasks involved in disaster recovery and business continuity.

Following are some of the uncertainties you have to consider when using tape backup:

  • Have you captured a good, valid backup? Is there data on the tape?
  • Where are you storing the data? If it is not offsite, the backup may not be helpful if your data center is destroyed.
  • Are the drives and equipment at the offsite location compatible with your tape format?
  • Assuming you have compatible systems, will the tape’s index restore correctly to achieve a successful recovery?
  • How quickly can you access the data on the tape and become operational?

Upfront and ongoing costs

Business-specific requirements for hedgefund firms will vary. Firms that adopt buy-and-hold “long” strategies have fewer trading requirements. Firms that pursue technical and sophisticated strategies to exploit inefficiencies are very sensitive to downtime as their strategies require the ability to execute fast, high-volume trades. A firm’s disaster recovery preparations and strategies must reflect their underlying business strategy. These underlying requirements in turn directly shape capital-budget decisions.

Universal upfront costs include server hardware, software, connectivity and other resources, such as staff training. Collectively, these represent major investments of capital. More broadly, firms must consider if outsourcing disaster recovery to a service provider or keeping it in-house is right for their business. Questions to consider when evaluating in-house versus outsourcing include “should you lease the real-estate and procure, install and maintain all of that equipment yourself?” and “what are the capital-budget implications of outsourcing DR versus handling it in-house?”

The upfront capital costs of each approach, in-sourcing vs. outsourcing, are generally the same – but ongoing maintenance and management should be given careful consideration as it varies with different approaches.

Outsourcing

Understandably, many firms are unenthusiastic about investing their valuable time in understanding, executing and managing a thoughtful, comprehensive disaster recovery plan. Most firms prefer instead to devote their time to the revenue-generating activities of the firm, and want to focus on trading strategies and investment opportunities. However, having a disaster recovery system is a crucial business operations component of any responsible investment firm.

Therefore, firms are increasingly outsourcing appropriate portions or the entire disaster recovery plan to qualified service providers who can bring infrastructure, expertise and focus to their DR requirements and challenges.

Determining the right disaster recovery approach for your fund

It is important to assess all of your critical systems and make decisions about which data, application and voice systems are the most important steps you need to take when formulating a disaster recovery strategy. A key objective in prioritizing your various applications, systems and data sources is determining the Recovery Point Objectives (RPOs) and Recovery Time Objectives (RTOs).

An RPO is the targeted point in time to which systems and data must be recovered after an outage, and represents the maximum amount of data loss a business can incur in an outage. Organizations must first determine their RPO and then build a DR application that meets their RPO. For example, a trading application might have an RPO of 30 seconds. In the event of an outage and recovery only the latest 30 seconds of data would be lost, everything up until that last 30 seconds would be available.

The goal for the amount of time it takes to actually recover that lost data or service is represented by the RTO. In other words, how long are you willing to wait for recovery of your data? The RTO for mission-critical systems – such as trading or voice systems – might be extremely short while the RTO for a general ledger system might be several hours. These choices carry significant implications in terms of the investments they require, so to make the right choices for your firm you need to carefully analyze the various tradeoffs.

Trading strategies can affect determining a firm’s RPO or RTO. If your firm, for example, is primarily engaged in high-complexity arbitrage or sophisticated quant strategies, your RTOs might be shorter than a firm primarily going along with buy-and-hold strategies.

The “key contributor” dimension is another consideration. Ensure that employees whose knowledge of real-time data is most crucial, receive added emphasis and attention in the DR plan. Ensuring that the biggest revenue-producers (or, perhaps, portfolio managers) receive higher priority for service recovery is key.

Hot sites vs. remote sites: the trade-offs

A hot site is an offsite physical location where copies of a business’ critical systems, such as trading applications and data, are maintained. A hot site also includes an office space from which employees can work during an outage. The office space can include real estate with separate offices, cubes, desks, workstations, phones and additional office resources and infrastructure.

A hot site must be located within reasonable distance to a firm’s primary location so employees can access it quickly. An earthquake or hurricane could take a wide path and make both locations inaccessible. However, if a hot site is located too far away, employees may not be willing to travel the 50-100 miles to reach it – particularly at a time of a natural disaster or unrest that leaves their home or family vulnerable.

Understanding that hot site facility operators “overbook” their facilities much like airlines is imperative. These facilities charge on a per-seat basis, so they regularly overbook their seats to maximize their profit. In the event of a far-reaching crisis, firms may end up competing with other hot site customers for the same facilities. It is important to understand your rights and access privileges.

A remote site, by contrast, provides a more efficient and concentrated set of services that are often more suitable for a hedge fund. Without physical desks and office infrastructure, a remote site provides a replica of a firm’s IT environment that employees can securely access through standard Internet connections. In most cases, this model provides several advantages:

  • Convenience for employees – An effective disaster recovery plan includes contingencies for varying types of outages. Locking employees into meeting at, or working from, one location can reduce the plan’s effectiveness. In addition, employees need not be concerned with traveling great distances under trying circumstances.
  • Lower cost – Real-estate, office equipment and telecommunications costs are cut because maintaining a duplicate physical location is not required. Employees can easily access key business systems and data via the Internet from virtually anywhere.
  • Dedicated IT Resources – Dedicated IT resources are housed and professionally managed at the remote site, which eliminates competition with other hot site clients for limited resources and space.

Additional evaluation criteria

Three additional important factors to consider when evaluating and selecting a disaster recovery system are: Infrastructure, Security and Testing.

Redundant infrastructure

The infrastructure of the remote site or hot site must have multiple levels of redundancy designed and built into each of the following aspects of the facility.

  • Network – A DR facility must have redundant network equipment and multiple service providers.
  • Power – Multiple sources of power are key, as well as backup power generators and on-site fuel to run the generators.
  • Air Conditioning – Backup cooling systems are a key component of a DR facility because servers and other systems generate a significant amount of heat.
  • Security – Uncompromisingly high levels of security are essential. Key security technologies include virtual private networks (VPNs), virtual LANs, firewalls, and more.Security
  • Redundant Systems – Whether it is routers, servers or T1 lines, a remote site or hot site provider should always have an extra unit ready for deployment if a primary unit fails.
  • Storage – The best deployments use RAID methodologies to “stripe” data across systems for performance, and mirror that data for improved protection and availability.

Security

Since remote and hot sites have a constant flow of people mainly unaffiliated with your company, the remote site should have an even higher standard of physical security than a firm’s primary location. Important must-haves include:

  • Locked cabinets, cages and rooms housing equipment
  • Human security, including guards monitoring video cameras, patrolling and managing visitor logs
  • Biometric security
  • Perimeter/monitoring security

Testing

Only disaster recovery plans that are regularly and rigorously tested are considered useful. When an outage occurs, firms do not want to rely on an untested DR plan where gaps, mistakes and failures could be encountered and leave employees without service. Regular testing allows a firm to find and amend gaps caused by technology changes or upgrades, and also trains employees so they are comfortable when the DR plan is actually executed.

Starting off small and building up to a full, comprehensive test that includes an unannounced exercise is the best testing technique. By starting small, employees can become familiar with the resources available to them during an outage. As testing requires the shutdown of various systems and components to ensure appropriate fallovers occur, experienced individuals with training in DR solutions should lead these tests. Essential plan testing guidelines include:

  • Provide detailed procedures to employees and closely follow them during a test
  • Verify the backup data and telephone trees
  • Use actual data for testing
  • Change the scenario from test to test

Conclusion

Unfortunately, in most instances, business continuity events occur around ordinary occurrences such as a local power failure or a water-main break. However, even mundane matters such as these can cause detrimental damage to an investment firms’ operations. Careful analysis and planning can prepare a firm to “expect the unexpected” and help ensure business as usual operations in the event of an outage.

In closing, here is a brief checklist for building a disaster recovery plan:

  • Analyze voice, data systems and determine RPOs and RTOs for each system
  • Classify each into prioritized categories
  • Identify key performers requiring earliest access
  • Identify remote-site provider candidates
  • Establish stringent service level agreements with service providers
  • Evaluate the quality of the provider’s infrastructure
  • Assess the provider’s data/voice and physical facility security
  • Assess the provider’s testing plans