Hidden threats to data

Inadequacies in storage governance and weaknesses in data management may pose far less-visible risks to a company's data. To mitigate these threats, you must be aware of the impact and probability of these risks to reduce or eliminate them.

This article can also be found in the Premium Editorial Download: Storage magazine: Better disaster recovery testing techniques:

Many organizations have a good handle on external risk. They've implemented disaster recovery (DR), business continuance and security measures to protect their data and applications. On the internal security front, companies have instituted systems that limit physical and digital access to critical systems to reduce the likelihood of a disgruntled or unauthorized employee purposely or accidentally damaging/absconding with crucial data....

But while focusing on these obvious perils, firms may overlook the seemingly mundane--but potentially more damaging--dangers that can arise due to lax administration and procedures.

Inadequacies in storage governance and weaknesses in data management are often subtle and may pose far less-visible risks to a company's data. To mitigate these threats, you must be aware of the impact and probability of these risks so you can take pre-emptive action to reduce or eliminate them.

Internal risks stem from two broad exposure areas:

  • Governance exposures: weaknesses in management practices (policy, procedure and control infrastructure)
  • Data exposures: weaknesses and inadequacies in data protection

By consciously evaluating and addressing these areas, you can substantially reduce threats to your data, lower costs and improve business-unit relations.

Alignment: When IT and business units have common goals, a partnership of enablement (and even appreciation) supplants the old view of IT as a necessary evil or even an impediment. Lack of alignment can result in inadequate or poorly communicated policies that can cause data to be inappropriately handled and exposed to undue risk. You can test for alignment using soft or hard measures. Soft measures include an assessment of your relationship with the CIO, as well as an assessment by managers and key business analysts with their counterparts in the business community. Some issues to consider include how often to meet, whether to converse on an ad-hoc basis or only at scheduled meetings, and so forth.

More empirical measures include defining policies for interaction between IT and business units. For example, company policy might require IT to provide services in tiered offerings with the business units responsible solely for choosing (and paying for) those services. In such a case, you may consider tracking the following:

  • The percentage of the IT budget related directly to business unit-initiated projects
  • The percentage of the IT budget spent on maintenance vs. development
  • Time delivery of commitments, service levels, problems and projects
  • Business unit satisfaction
  • A defined process to regularly ensure continuous alignment

Cost management: Data management costs may be another indicator, as cost overruns reflect badly on how efficiently storage is organized and managed. Gartner Inc. and other analyst firms say that 70% of a storage organization's costs are for administration, not hardware. Besides knowing where your budget dollars go, you should consider:

  • How costs are tracked
  • If a formal cost model identifies realistic costs to provide specific services to business units
  • The ability to correlate operational metrics to costs
  • Whether staffing levels are built on an empirical basis of a known transaction handling capability (e.g., number of alerts or number of tape movements)

Asset inventory: It's difficult to manage something you don't know about. If storage assets at the component, connection and dependency level aren't documented, inadequate change management can open the door to risk. The interdependencies of all hardware and software components in the environment must also be documented, or unwelcome consequences can occur. For example, connecting another server to an available port can impact interswitch links and increase latency to the point where a key database application is disabled, perhaps losing data until the problem is fixed.

Roles and responsibilities: Well-defined roles and responsibilities are essential, but lines of demarcation between job functions must be clearly drawn. In many organizations, these lines are often vague and responsibilities seem to overlap. For example, who's responsible for host bus adapter installations--the server group or storage staff? Another example may involve interaction between backup architects and operations staff regarding ownership of backup servers and the backup LAN. Poorly defined responsibilities could result in important activities being overlooked, leaving data insufficiently protected.

In most organizations, IT services are requested in a variety of ways and in a timeframe that inevitably requires an understanding of cross-functional workflow, inputs, outputs, handoffs and control points. A classic example of such a cross-functional requirement is the process governing change control and provisioning. The interactions, roles, responsibilities and cross-functional handoffs involved in these processes must be documented and understood, and buy-in by all those participating in the process is required. If roles, responsibilities and demarcation lines aren't clearly understood, there will be no accountability.

Staffing and organization: Staffing levels ensure that defined responsibilities can be met, but they're often a cause of friction between IT and those who control the budget. This contention may be exacerbated because IT often has difficulty making an empirical case for staffing (see "Building a staffing model," below). The old adage of "X number of storage administrators per a certain number of terabytes" is too broad to be useful. A method to calculate workload for each task based on tangible entities such as alerts, provisioning requests and changes is essential to make accurate staffing decisions. This ensures that workloads can be staffed without compromising service levels or opening the firm to risk by using underskilled people for critical storage tasks.

Procedures: Standard operating procedures (SOPs) are a key element to mitigating risk. Procedures need to be in place to ensure data consistency and quality. SOPs provide a baseline, demonstrate to auditors that a defined process has been executed, and show that compliance, completion and quality metrics have been produced. SOPs allow a consistently repeatable process with lower-level skills. Without SOPs, consistent results can't be guaranteed.

Building a staffing model
An understaffed storage department contributes to risk when mistakes are made or staffers take risky shortcuts. Each company's unique blend of skills and capabilities will influence its staffing model, but the following assumptions can serve as a baseline development model.

The first step is to list the number of major technologies used in storage and backup. Typically, an individual can competently master three significant technologies. You should also consider the requirements for daily operational activities such as provisioning, tape handling and responding to alerts. Finally, complexity factors are used to weight the calculation. These factors might include the physical complexity of the environment, average skill levels and perhaps the maturity of standard operating procedures.

An example of a baseline staffing model:

The table demonstrates a staffing algorithm that considers the impact of heterogeneous technologies, operational transaction volume levels and other complexity factors within a storage environment. Each of the three areas (technology, transaction and complexity) has weighting/assumption criteria (shown on the right). The data on the left in each area will be weighted by the criteria on the right (e.g., 5,000 tape ejects at 2,500 ejects/person = 2 people). 

Assumptions for each parameter are general-purpose values based on experience within a number of environments, but can be fine-tuned as appropriate to fit the needs of a specific situation.

Value of data: Another key risk is being unable to ascertain the value of data under your management. If data hasn't been valued, it's unlikely to be managed appropriately and may not be available when needed. Inevitably, the value of data is equivalent to the value of the business application accessing it.

In many organizations, once an application has been implemented, its associated data is identified and placed into complex command lines of various backup engines. Over time, all trace of ownership becomes lost and no one knows what application owns a particular file. It's critical to be able to tie data to an application and to tie that application to the business unit it services.

The key identifier of data--its application and related business owner--must be maintained in a manner similar to that of an asset register. This register should include data interdependencies, both parent and sibling. Without an understanding of where data comes from and where it goes, application interdependency can't be determined. Without application interdependency, recovery of a logically verifiable point becomes extremely difficult, if not impossible. This means that while applications and data may be recoverable, the combined application functionality may fail because dependencies weren't synchronized and managed during the recovery process.

Archiving awareness: With the emphasis on information lifecycle management and compliance, many organizations are migrating data from high-cost primary storage tiers to lower-cost tiers or even to offline storage. If the migration involves moving rather than copying data, applications may need to be archive-aware. If data is moved from the application into a structure the application can't access, retrieving data archived for compliance activities in a timely manner may be difficult.

Data retention: Retention policies are often created without close consultation with business units or are based on rudimentary compliance requirements. Organizations often institute an across-the-board retention policy. In years past, data volumes weren't high enough to warrant particular differentiation and there were few compliance requirements beyond IRS retention rules. Today, data is growing at an annual rate of 50%-plus and company lawyers are increasingly tempted to mandate keeping everything forever to avoid the consequences of non-compliance. In this environment, it's critical to develop a retention class of service with attributes identifying retention periods for various legislative initiatives, as well as required immutability, rendering, integrity and security attributes. Addressing these complex issues outside the framework of a class of service can lead to significant complexity, which could impact administration costs and lead to potential legal exposure in retention, retrieval and security compliance.

Recovery objectives: Unrealistic recovery point objectives (RPOs) and recovery time objectives (RTOs) are a major risk exposure. In an attempt to respond to business needs, RPO and RTO commitments may be made that don't adequately consider the realities imposed by logistics and technologies. From a logistics perspective, any RTO of less than 12 hours will probably require an automated failover. It's impractical to expect people to evaluate and declare a disaster, initiate DR at alternate points, notify the DR team and sequence the recovery, resynchronization and restart of applications in less than 12 hours.

On the technical side, the infrastructure needed to support a one-hour RTO is the same as for a four-hour or eight-hour RTO. It's only when the delta hits 24 hours that significant differentiation in the support infrastructure is required, except perhaps in very small organizations with a limited number of servers. Getting the DR team to the alternate site is one major challenge; sequencing the multitude of servers that need to be brought back is another. Once physical recovery has been completed, additional effort is required for logical synchronization that can blow out the most practical recovery objectives. DR tests that use a Dev/Test/QA infrastructure to bring applications into actual production mode (i.e., operated by users) for a 24-hour test period will reveal any exposures in this area.

Data integrity: At some point in any compliance investigation, you must prove that your data hasn't been changed by unauthorized people/functions and hasn't been corrupted by intent/malfunction. The policies and SOPs supporting the protection of data integrity must include audit-agreed checkpoints and controls. It's not enough to have data archived on WORM media. Your SOPs and related controls must demonstrate a chain of custody that protected data from change the moment it reached a status that demanded immutability. A great example of this need is when to capture immutable copies of received and sent e-mails. Certainly, it must precede any capability for deletion or modification; the SOP must demonstrate--through completion, compliance and quality metrics--the consistent accomplishment of stated goals.

Getting started
It may seem as if there are overwhelming opportunities to mitigate risk. To start, list risk areas in a matrix and describe any negative outcomes that might occur from inattention to those areas. Then for each risk, describe its potential negative outcome and rank the following attributes of each as high, medium or low.
  1. What's the probability of a negative outcome occurring?
  2. What's the impact on your operation if the negative outcome occurs?
  3. How difficult is it to avoid this risk?
A score indicating remediation priorities can be derived using a numeric scale of five, three and one to denote high, medium and low risks, respectively, and then applying some weighting criteria. This is a subjective assessment, but it provides an audit trail of your rationales and can be used to seek consensus or convey priorities.

The matrix can help to quantify the risks in each category. You should develop a project plan to address the top five areas where the risk appears likely to occur and where the impact could be significant. The project plan will provide mitigation costs and define the value of the risk.

You can turn a subjective impact description into a dollar amount by identifying the business functions that would be impacted. The risk exposure can be calculated using "Value of data at risk" to better understand the impact on the organization.

The value of data is calculated by determining the short-, medium- and long-term impact on the organization if it's unable to conduct business because the data and infrastructure aren't available. This includes hard dollar amounts, as well as "softer" dollar amounts resulting from loss of reputation, customer dissatisfaction and lost opportunity. These numbers are often already available, having formed the basis for the priorities in the organization's formal disaster recovery/business-continuance plan. Understanding the value at risk allows you to make more intelligent investment decisions for risk mitigation strategies.

Data rendering: When data is archived, retrieval requirements may prevent the data from being rendered. If data requires rendering to information, risk may be incurred if the original platform application is unavailable. For example, if invoice data is archived, can the application be used to subsequently render that data back into information, i.e., the invoice? This is a critical issue that an organization's legal team needs to address. If data can't be rendered, it must be stored as information using an interchangeable format such as XML.

Data Security: Security is a major issue in every organization, but most of the focus has been on access control, intrusion detection and containment. While controlling access to servers limits access to data, there are many other paths to this data. Any management device in the Fibre Channel or Ethernet fabric provides a potential entry point for an intruder. Data in production can be at risk if these exposures aren't carefully managed. Data at rest is also significantly exposed; this has been dramatically demonstrated by recent reports of lost backup tapes containing sensitive data. Encryption techniques are touted as risk mitigation, but encryption raises its own risks related to retaining, securing and accessing the encryption key when needed. In storage environments, attention should be paid to securing data moving over the desktop LAN, the WAN, backup-based networks and specialized high-speed, point-to-point networks. The obvious issue is whether the data can be read as it travels over the link. Additional exposure comes from allowing development and test staff to have free access to live data that may include sensitive information.

Awareness is the first step in reducing data risk. By considering the internal risks outlined here, you can develop an appropriate risk profile and mitigation plan (see "Getting started," above). Sharing your risk analysis and mitigation plans (including business impact issues) spreads the responsibility around. It will also provide an empirical basis for CFO and CEO support for any necessary investments.

This was first published in October 2005
This Content Component encountered an error

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

-ADS BY GOOGLE

SearchSolidStateStorage

SearchVirtualStorage

SearchCloudStorage

SearchDisasterRecovery

SearchDataBackup

Close