Backup Systems Performance & Capacity Design

Source: Hewlett-Packard Company

INTRODUCTION
The following paper outlines Hewlett-Packard Information Storage America's recommendations to construct the optimal backup system for midrange to enterprise level customers. This paper will not examine the portable or desktop customer, other than inclusion within a larger, networked system. Given this criteria, this paper will be focused on Hewlett-Packard's DDS (DAT) and DLT products. The intent of this document is to educate System Engineers (SEs) on factors to keep in mind when encountering customers with data protection opportunities. This document should be used as a tool by SEs developing the information required to generate backup system proposals. The areas to be explored will include:

Definition of Terms
Key Initial Questions for Customers
Backup Data Path: Performance Definition
Scoping Customer System Capacity Requirements
Software Considerations, Features, & Benefits
The Ideal System: Local & Networked

DEFINITION OF TERMS
(click here to return to the top of the page)
Backup Window - Time allocated for a backup to occur, both duration and start/stop time.

Bandwidth - Term describing performance capability.

CD-R / CD-Recordable - a process to write data to CD media. Data recorded on CD-R media cannot be erased and can be read in most CD-ROM drives.

CD-RW / CD-ReWritable - a process to write data to a special CD media such that the entire media can be erased and new data can be written to it.

Concurrency - Term describing two or more processes, usually backup jobs, running simultaneously.

Data Retention Period - Duration in which data can not be erased from tape.

DDS - Digital Data Storage; the data standard for storage devices using 4mm digital audio tape cartridges.

DLT - Digital Linear Tape; a high-performance, high-capacity, half-inch tape cartridge format.

Differential Backup - Data that has changed since the last Full Backup is backed up.

Disaster Recovery - Term used to designate the steps required to fully recreate a system.

Fibre Channel - A network, cable, hub and adapter technology using optical fiber on copper wire to gain speed and distance. Fibre Channel is the network structure that will probably be used for SAN implementation.

File-by-File - Backup method, in which each file must be opened, read from disk, written to tape, and closed.

Full-Only Backup - 100% of the data is backed up during each backup session.

Image - Backup method in which a device (e.g. hard drive) is locked, written to tape, and unlocked; also known as "Snapshot."

Incremental Backup - Data that has changed since the last Backup (full or incremental) is backed up.

Interleave - Mixing data from multiple sources to a single tape device.

LCP - Lowest Common Performer, term used to describe prime limiter within a backup system.

Local Device - Tape drive is connected directly to the computer to be protected.

LTO - Linear Tape Open; a new technology standard developed by HP, IBM and Seagate. Initial capacity is targeted at 100 GB per cartridge.

NAS - Network Attached Storage; a type of storage made possible by a thin server that provides network communication to a storage device as its only function.

Network Device - Tape drive is connected to a computer, which is connected to a network, serving multiple networked computers.

OBDR - One Button Disaster Recovery - An exclusive feature of HP DAT Drives that allows every backup tape containing the system directory to recover the server directly from tape.

Online Retention Period - Duration data must be readily available (without operator intervention).

Parallel Streams - Backup method, whereas multiple backup jobs (or operations) can be sent to multiple drives concurrently.

RAIT - Redundant Array of Inexpensive/Independent Tape, method to employ multiple tape drives to improve performance and/or redundancy.

RAM - Random Access Memory, electrical based storage.

Restore Window - Time allocated to recreate data.

SAN - Storage Area Networks; a network dedicated to storage devices allowing a group of servers to share storage products on a network.

SCSI - Small Computer System Interface, protocol definition used by tape and other computer devices.

Snapshot - See Image.

Staging - Backup method, whereas data is copied to a large storage device (usually RAID) before copying data to tape.

Start/Stop - Interrupted operation of a tape drive (opposite of Streaming).

Streaming - Uninterrupted operation of a tape drive (ideal condition).

KEY INITIAL QUESTIONS FOR CUSTOMERS
(click here to return to the top of the page)
One of the most difficult aspects of designing a backup system is gathering enough information to make an educated assessment of a customer's data protection situation. The primary means for gathering this information is by asking detailed questions regarding a customer's computer systems infrastructure. The questions can be divided into two sections: Basic and Extended.

Basic questions enable Systems' Engineers to have a fundamental understanding of the customer's configuration. This will allow the SE to provide a generic Backup System design. The three basic questions are:

Where does the data to be backed up reside (local, network, combination)?
How much data is there to be backed up daily and weekly?
What is the Backup/Restore window? Are They Different?

With this information, it is possible to determine if the backup is within the scope of our product offering and gain a general idea of which technology is most likely to solve the customers's situation. Also, it is possible to determine how many drives and tapes will be needed to provide basic data protection.

Extended questions are necessary to identify the specific product or products that will actually need to be ordered to solve the customer's data protection needs. The following questions must be asked to actually design a realistic data protection proposal:

What is the customer problem we are trying to solve and what is the timeframe for deployment?
What is the present Network Topology? (Eithernet 10baseT, 100baseT,or Gigabit vs Token Ring); the network interconnections? (Hubs or Switches); the Number of Clients and Client Capacities?
What are the data types to be protected (Oracle, Exchange, etc.)? How many GBytes are there of each data type. List the data type and capacity for each server on the network.
Is there an existing Backup Server? If so, describe its configuration (processor, RAM, disk space, SCSI cards, slots available, etc.)
What Backup software is currently being used, if any (include versions)? Is the customer open to changing the existing package?
What Backup methods are currently being used (full always, full/incremental, full/differential)?
What tape devices are currently being used, if any? Is there a need to stay with any media type due to archived data?
Is the Backup System intended for a single workgroup, department or entire enterprise?
Is there a Backup System budget already defined? What time frame is planned for the implementation?
Is there a Disaster Recovery plan in place?
What is the off-site storage strategy?
Does the customer require high availability, i.e., able to continue backup after device/media/backup server failure?
Ask for a complete network schematic as this will be essential for the actual implementation.

BACKUP DATA PATH: PERFORMANCE DEFINITION
(click here to return to the top of the page)
To understand why the SE needs the information requested, it is important to describe the components of a Backup System. As with any system, the overall performance is determined by the lowest performer, or Lowest Common Performer (LCP). By surveying the elements that make up the Backup System, performance can be readily estimated.

There are primarily two basic backup models: local and remote. A local backup model need only concern itself with components within a single computer. A remote backup model has to take into account not only its own components, but that of the remote devices as well, especially the network. Below is a graphic representation of the components, which must be evaluated to determine the system's performance:

With the exception of the Switch and NIC (Network Interface Card), a local or network backup server is the same. Again, the performance of either system will be dictated by the slowest component of the system, or the Lowest Common Performer (LCP).

Tape Device
At the heart of each system is the tape device. The tape device will be limited in performance by its native transfer rate, embedded SCSI controller and the number of devices per bus (in the case of libraries). To put this into perspective, the following table shows the optimum and typical data rates for all tape drive, autoloaders, and libraries (rates are per device, not per drive in the case of multi-drive devices), assuming streaming tape devices:

1 Multiple drive performance assumes independent busses, best case configurations (local, dedicated SCSI, etc.)
2 All autoloader/library capacity figures allow for one cleaning cartridge
3 Typical transfer and capacity rates assume 1.5:1 compression ratio
4 Determined by device duty cycle and typical performance numbers (in GB)

The table illustrates the difference in supposed "maximum" transfer rates, and what is seen typically. The "Typical Transfer Rate" is the column that must be kept in mind in determining a system's overall performance. All library performance numbers assume dedicated, wide SCSI controllers. Tape devices are highly dependent on the SCSI controller to which they are connected.

SCSI Controller
A SCSI controller acts as the intermediary between a host computer and tape device. As with tape technology, there are multiple variations of SCSI controllers. Depending on the controller type, performance will be affected. Below is a table showing maximum and typical transfer rates for various SCSI controllers:

Hard Drive
The hard drive will come into play if it is used as a staging area or if it is the object to be backed up. Staging is a method employed by some backup software applications in which backup data is copied from networked computers down to a local large hard drive (usually RAID) and held for future transfer to the tape device. The intent is to shrink the backup window, since it is usually quicker to copy data to hard drive than it is to tape (no controller/tape overhead). Once the transfer is complete, the network is free to handle normal business operations, while the tape backup can happen during normal business hours. This technique is difficult to maintain as restores could require two steps: restore from tape to local drives; copy data from local drive to network drive. Given the wide variety of hard drives and RAID technology, the hard drive is rarely the cause of any performance problems. Since some software applications do keep critical information (e.g. database information) on the hard disk, it would be prudent to have at least 4-gigabytes of RAID 5 storage on the Backup Server.

CPU
The Central Processing Unit is the key to managing all processes within a computer system. The CPU is responsible for providing the horsepower for all applications. For a Backup Server, a minimum CPU rating of 200 Megahertz is recommended. No application other than the backup application should be running on the Backup Server.

A critical area that the CPU must control is network traffic. While adding multiple network interface cards (NICs) to a Backup Server theoretically can enhance performance, no more than two NICs per processor should be added. Ideally a one-to-one ratio of CPUs to NICs should be configured.

RAM
Random Access Memory is the area in which applications run. RAM requirements will vary depending upon the operating system and the backup application. For a Backup Server, 64 MB of RAM is a good baseline.

NIC, Switch, Hub
The Network Interface Card controls the flow of network traffic to and from the Backup Server. A Switch or Hub connects these NICs to the network. Two of the most common types of networks in industry today are 10baseT and 100baseT. The "10" and "100" refer to megabits per second rather than megabytes per second. The following table shows performance expectations for these two types of network technologies:

Again, limit the number of NICs to two per CPU. The ideal configuration is a one-to-one ratio of NIC to CPU. If the entire network is made up of 100baseT components, transfer rates of up to 8 MB/s is possible. A switched hub provides better performance than a basic hub. If the network is mixed, assume a transfer rate of the Lowest Common Performer, (i.e., 10baseT or 800 KB/s). The exception is backup applications using controlled interleaving where multiple network computers can send data to a single backup server simultaneously, effectively multiplying performance (see Software Considerations, Features & Benefits).

Performance Definition
Given the various components of a Backup System, it is easy to understand the difficulty and confusion Backup Systems can produce. The most important concept to remember is that the entire system's performance can be no greater than the slowest system element. Consider the following system description:

DLT 80 Drive; (32.4 Gbytes per hour)
Adaptec 2940 UW Ultra Wide SCSI Controller; (144 GBytes per hour)
Pentium II 400 Processor;
512 MB RAM;
10baseT Network Card; (2.8 Gbytes)

The best performance that can be seen for this Backup System will be no more than .8 MB/second or 2.8 Gbytes per hour, due to the presence of the 10baseT network interface card.

Identifying the LCP is the key to estimating a Backup System's performance potential.

SCOPING CUSTOMER SYSTEM CAPACITY REQUIREMENTS
(click here to return to the top of the page)
In order to deploy a backup solution, an essential area to examine is the average target server size. To gauge the number and type of storage devices to be deployed, the following must be known:

Backup Window. Restore Window.
Backup Model (Full Only, Full-Incremental, Full-Differential)
Total Capacity of Each Target Client Server and their physical location in relation to the backup server.
Software Preference (Include version and all optional modules)
Desired Online Retention Period

These factors will determine the size and amount of storage products for the customer.

The backup model primarily establishes the speed at which the customer requires a Restore to occur. Restore speed performs best with Full-only backups, followed by Full-Differential backups, and finally by Full-Incremental backups. The higher the Restore speed desired, the greater the on line retention capacity becomes. The backup window will dictate the speed at which the backups must occur, and what level of concurrency (software driven) is required.

While the total capacity of all target backup client servers is valuable, the mixture of capacities enables System Engineers to fine-tune solutions. For example, if the only information given is knowledge of 20 target backup client servers with a total capacity of 700 GB, we would have to assume an average client size of 35 GB. If this were true, designing a backup system would be simple. However, what is more common is a mixture of systems, or weighted averages. For example, in the system previously described, maybe 55% of the systems are 35 GB, with 30% being 15 GB, and the remainder (15%) being 120 GB+. This type of information is invaluable in designing an optimal system.

Hewlett-Packard would obviously prefer specifying a total HP solution, including software. The reality is some customers will already have a preference. For backup system capacity sizing purposes, only the number of simultaneous clients to each backup device needs to be known. With a limited backup window, concurrent operations will greatly decrease the need for additional storage devices.

Finally, determining how long data must be readily available (online) must be taken into consideration. With the advent of tape libraries, customers can now keep backup data online, decreasing the need for operator intervention during both backups and restores. Even smaller systems equipped with a single DAT 40X6 Autochanger can hold a substantial amount of online information for an extended period depending on the backup model and total capacity.

An Example
As an example, let us use the aforementioned system: 20 target backup clients, a total capacity of 700-Gigabytes, a backup window of four hours on Saturday and two hours the rest of the days, and finally there is no software preference. The backup model is further defined to be one full backup per week followed by six differential backups. The first order of business is to determine the relative amount of data to be backed up. This can be modeled mathematically by examining the percentage of data backed up per day. For example, if the system usage is the following:

Full Backup + 6 Differential Backups = Effective Weekly Backup Capacity

100% + 10% + 17% + 23% + 30% + 37% + 43% = 260% or a 2.6 Capacity Multiplier

The resultant figure, 2.6, can now be used as the "Capacity Multiplier." By multiplying the total capacity of the network by this number, the "Effective Weekly Backup Capacity" can be determined. In this example, the effective weekly backup capacity would be

700 GB x 2.6 = 1.820 TB of data per week for the weekly online capacity.

Therefore, for each week a customer would like to have backup data readily available, they would need 1.820 TeraBytes.

Next, the backup window must be used to determine the amount of data that must be processed per hour. This will help us determine the technology to use as well as the number of drives that will be required to back up the total capacity in the time allotted.

Each Full Backup will be 700 GB and must be done in 4 hours:

700 GB / 4 hours = 175 GB per hour for the Full Backup. Each Differential Backup will range from 70GB to 300GB. We will use the largest for the calculation:

300 GB / 2 hours = 150 GB per hour for the Differential Backups.

Now it is possible to use the various drive performance factors to determine the number of drives per hour will be required to get the job done.


Full Backup
      
			   DAT 24  =  175GB per hr. / 5.4 GB per hr.  =   32.4 drives
			   DAT40   =  175GB per hr. / 16.2 GB per hr. =   10.8 drives  
			   DLT 40  =  175GB per hr. / 8.1 GB per hr.  =   21.6 drives  
			   DLT80   =  175GB per hr. / 32.4 GB per hr. =    5.4 drives


Incremental Backup
      
			  DAT 24  =  150GB per hr. / 5.4 GB per hr.  =   27.7 drives
			  DAT40   =  150GB per hr. / 16.2 GB per hr. =    9.3 drives  
			  DLT 40  =  150GB per hr. / 8.1 GB per hr.  =   18.6 drives  
			  DLT80   =  150GB per hr. / 32.4 GB per hr. =    4.4 drives

From these calculations it is clear that DLT 80 drives are the best choice for this application.

HP SureStore DLT Libraries range from 1 drive and 20 slot units up to 6 drive and 60 slot units. Using both the total number of drives required and the weekly on line data requirement it is easy to see that a 6 drive library with at least 1.8TB is needed. Since the only library with 6 drives is the 60 slot unit, the likely choice is a SureStore DLT 6/60 Library. This unit can hold 2 weeks total data on line if needed and provides the required 6 drives to meet the full backup time window.

Other Considerations
Now that the general need has been scoped and a match found that meets the requirement, the final proposal must use the rest of the collected information to fine tune the solution. The network type will be critical here since the one library will need at least a 100BaseT network with switches connecting all servers to the backup server with individual NIC / SCSI cards for each DLT 80 Drive in the library to transfer the required data in the time available.

If all of the servers are not on a switched 100BaseT or better network, it may be necessary to use 3 SureStore DLT 2/20 libraries located on three backup servers located on the data side of the network bottleneck or LCP factor. It is even possible to use 6 - 818 Autochangers if the network topography is such that only three or four servers can be streamed to each DLT drive. Other combinations can also be used to fine tune each portion of the network based on the capacity and throughout of individual server groups like one - 818 , one 2/20 and one 4/40 library as long as the total number of drives equals or exceeds the number calculated above.

Many times it will be less expensive to place more backup units around a network than it is to upgrade all of the components of a large network. The key to a successful implementation of a proposal is to thoroughly understand the entire network and the trade-offs that the customer is willing to make based upon either budget or physical constraints of the network.

SOFTWARE CONSIDERATIONS, FEATURES, & BENEFITS
(click here to return to the top of the page)
Hewlett-Packard recognized three main software packages to be used as backup applications. These are HP Omniback, CA ArcServe and Veritas Backup Exec. All of these packages have strengths and weaknesses to consider when designing a backup system. Below is a feature set comparison of the three packages:

File-By-File
This feature describes how data is physically backed up. Using this method, the file system is employed to open the file to be backed up, read it, write it to the backup media and closed. This technique has a high overhead cost, especially if thousands of small files need to be backed up. The advantage to this method is it is frequently the lowest cost software solution, though the poorest performer.

Image
An image backup operates at the device level rather than the file system level. This has the advantage of much lower overhead resulting in much higher performance. Instead of opening each file sequentially, an image backup opens an entire volume or drive. Most software packages recommend (or require) the target drive to be "offline" or inaccessible during an image backup.

Central Control
Each software package has its own unique interface to manage their respective backup system. Presently, only HP Omniback offers centralized control. This implies that a system administrator can oversee and control multiple backup systems from a single console without having to "log-in" to each individual Backup Server. This has a distinct advantage when exceptionally large or enterprise-level systems are concerned.

Push Agent
A push agent is a piece of software that prepares data on a networked computer before it is sent to the Backup Server. Typically, this software will be responsible for determining which files on its respective system needs to be backed up based on the system administrator's criteria, creating a file list, possibly compressing the data, and then sending the data. A push agent's role is to relieve some of the duties of the Backup Server and increasing overall performance.

Disaster Recovery
Disaster recovery is defined to be the set of steps required for a system to be recreated following a catastrophe. In the software application context, it refers to restoring a server to its former state using a set of floppy disks (usually three), a tape device and a tape containing the server's data. This simplistic process is a vast improvement to previous procedures, which involved re-installing the operating system, applications and then the data. This new process of recovering from a disaster is substantially easier to use. Recently, CA ArcServe has announced the ability to restore remote servers in the same fashion.

RAIT
RAIT stands for "Redundant Array of Independent (or Inexpensive) Tape. The concept is not new; hard drives have had this technology for years (RAID). Essentially, RAIT borrows the RAID technology to increase speeds and provide redundancy. When using RAIT 0 (striping), it is important to remember the Lowest Common Performer rule when specifying a system. RAIT 1 (mirroring) is a valuable feature, especially when planning offsite storage. RAIT 5 (striped mirroring), while technically intriguing, has limited practical use.

Parallel Streaming
Parallel streaming is the ability to have multiple data backup streams (or jobs) going to multiple backup drives simultaneously. This potentially can increase performance, provided it does not run into a Lowest Common Performer. This technology does not provide any sort of redundancy.

Interleaving
Interleaving, like parallel streaming, relies on multiple concurrent operations to increase performance. However, while parallel streaming directs single data streams to multiple drives, interleaving gathers multiple data streams and directs it to a single drive. This has the advantage of potentially multiplying the backup system's throughput (again, subject to the LCP!). The following depiction shows how this can be done:

Effectively, the five 100baseT clients combine their throughput via the switch, and present the 100baseT Backup Server with a single data stream. Additional data streams (or jobs) can be directed at this backup server, provided it has additional tape drives to handle the requests. Currently, only HP Omniback supplies this powerful feature.

[This concludes Part One of this white paper. To continue reading, please click here.]