Performance Analysis of Multiprocessor Disk Array Systems Using Colored Petri Nets

Due to the increasing gap between the performance of the processors and secondary storage systems, the design of the storage systems has become increasingly important. Arrays of interleaved disks are a popular method of increasing the performance of secondary storage systems. In order to optimize the performance and configuration of the disk arrays, performance evaluations are required. This paper presents a Colored Petri Net simulation model which can represent various configurations of systems containing multiple processors connected to a disk array system across a single stage interconnection network. This flexible model allows many system parameters to be varied such as number of processors, buses and disks in the array and the delay distributions associated with each. The performance estimates produced by this model are validated in this paper against those found in other models and found to be in good agreement. This paper shows that the CPN model presented here is flexible and accurate enough to allow the model to estimate the performance of "?-any widely varying system configurations.


Introduction
The performance of processors and semiconductor memories is increasing at a much greater rate than 1/0 systems such as magnetic memories. Therefore, the performance of the 1/0 systems is impacting increasingly upon the total system's performance to the point where it can become the source of a performance bottleneck in the system. The throughput of the 1/0 system can be increased by replacing a single disk 1/0 system with a disk array in which data may be placed on different disks so it can be accessed concurrently. [1,2,3,4].
Many different organizations of disk arrays have been proposed in the current literature [2,3,8]. In order to understand the benefits and costs of each disk array configuration, it is important to have a method for the estimation of the whole system's performance. This will allow the system designer to unders~and the effects of various system elements upon the. system's performance.
There are two types of models that are generally used for the performance analysis of systems. The first is an analytical model, which reduces the system's functionality to a set of equations. The equations are then used to estimate the system's performance. The second is a simulation model, which generally encapsulates the system's functionality into a model in a more direct manner. The 1 simulation model is then executed to emulate the system's performance.
Several analytical models have been developed which are based upon many simplifying assumptions to allow the system to be described by a usable set of equations. While these equations allow the quick generation of results, they can also describe only a limited or unrealistic set of system configurations. One such example is in a paper by Lee and Katz where an analytical model is developed which assumed that each processor issues a new request for a block of data whenever any of the subblock data requests from the previous request are finished. [3] This assumption implies that all the sub block data requests generated from a request for a block of data finish their disk accesses at the same time and that each processor spends no time processing the data which it has just received. This is not a realistic assumption because in a real system each disk request may have a different service time because of the starting position of the head on each disk, or a different number of requests present at each disk.
In a paper by Yang, Hu and Yang, a more realistic set of assumptions about the disk array and how it processes requests is presented. However, this model can neither address the relationships associated with the interconnection network (IN) which connect the processors and the 1/0 system nor can it handle different size data accesses within the same run. [1] 2 As shown above, a common problem associated with existing models is that the assumptions which are made to enable the system to be characterized by a set of equations also limit the model's ability to handle all the different parameters which are important in a system. 3 Chapter 2:

Guiding Assumptions and System Description
The model presented in this paper tries to more accurately describe a real system by expanding upon the system assumptions described in reference [ 1]. The assumptions are as follows: 1. Each processor generates a request for a block of data stored on in the disk system. The request for a block of data, called a logical disk request or an array request, is replaced by several subblock requests, called disk requests. The disk requests are then transferred to the appropriate disk where the subblock is stored. The separate disks can then service the disk requests in parallel.
2. The array request size, which is the number of disks accessed by a single array request, can change depending upon various attributes of the disk array such as the subblock size, the parity scheme, the parity group size, and the request type. Therefore, the array requests cannot be guarantet?d to access either only one or all of the disks. · 3. The individual disk requests of an array request may finish at different times due to both the interference between disk requests at each of the disks, and the different seek times on each disk due to the random starting position of each disk's head. 4. It cannot be guaranteed that a new array request is always issued upon the completion of a disk request. This depends upon the workload of the 1/0 system and the frequency at which the processor generates requests. 4 5. Each processor is capable of multiprocessing. Therefore, more than one array request generated by the same processor may exist at the same time. 6. The size of the traffic transferred across the interconnection network, either the data requests or the data blocks, should be allowed to be variable within a single simulation run. It cannot always be assumed that each data block is the same size for all processors in the system. 7. The interconnection network is made of one or more buses which connect the processors to the disks in the disk array. The number of buses in the system cannot always be assumed to be enough to support the workload of the system.
These assumptions accurately describe a real system containing a disk array 1/0 system. In the following a model based on the above assumptions about a system containing a disk array is presented.
The model is a simulation model which was created using Colored Petri Nets (CPN). CPNs, as most simulati~n modeling tools do, allow the user the flexibility to model in detail whatever area is deemed of interest in the system.
The model consists of several independent processors connected to a single disk array 1/0 system via an interconnection network (IN) as shown in Figure 1. Figure 2 shows the points of resource contention which will be described in the following paragraphs. /

5
A disk array is an I/0 system which replaces a single disk with a collection of disks. In a single disk I/0 system a block of data is stored usually together on the disk. In contrast, in a disk array Interconnection Network Figure 1 System Configuration system this block of data can be broken into one or more subblocks which are then stored on separate disks. Because each of the disks in the disk array can be accessed concurrently, the block of data can be accessed more rapidly than on a single disk system.
Each processor can generate a logical disk request, hereafter called _,,, an array request, for a block of data from the disk array which in turn is broken into several disk requests. The number of disk requests per array request varies depending upon several 6 parameters such as the size of the data requested, the amount of interleaving between the disks and the parity scheme of the disks. emulates this behavior to allow a performance analysis of this system using various system parameters and configurations.

"-
The CPN model can predict the response time of an array request, and analyze the disk, interconnection network and processor utilization under various system configurations and workloads. This model is validated through a series of measurements and 8 compared with the findings presented in [1]. This model is used to perform a quantitative evaluation of the disk array's performance for different IN and disk data integrity configurations. The model presented is fairly general and could be used by disk array or system designers to study the effects of various system parameters and configurations. 9 Chapter 3:

The Colored Petri Net Disk Array Model
This chapter describes the Colored Petri Net model of a system which contains a disk array 1/0 subsystem. The chapter is broken up into two parts, the first describes the functionality of the system and the second describes the parameters used in the model.

A Functional Description of the Disk Array Model
The following is a functional description of how a disk request is generated and handled in the system which is modeled. The limits and derivation of the model's variables, which are capitalized, are described in section 3 .2. Figure 3 shows a simplified version of the Colored Petri Net model which will be used for discussion purposes. The actual CPN model is included in Appendix A.
There are P independent processors that generate array requests.
The processors are represented by CPU tokens which reside in the CPU Processing Data node of figure 3. One of the attributes associated with each token is the time it is available for use. When the simulated time reaches the time at which a processor token ts enabled, that token moves to the Generate Disk Requests token where a set of N disk access request (DAR) tokens are made. The set of disk requests generated at the same time represents an array request.  Another attribute of each array request is the assignment of disks.

Generate
The disk associated with the first disk request is chosen at random from the disk array. Hereafter, the disks associated with the array request are assigned sequentially.
The size of the data subblock requested is also an attribute of the DAR. Thus different size data blocks may be accessed from the disk array within the same simulation. The size of data block accessed affects both the disk's Service Time and the bus' Transfer Time.
When the disk requests have been generated, the CPU token then returns to the CPU Processing Data node and the time at which the token will be enabled next is updated by an amount calculated by the ThinkTime function. The ThinkTime represents the amount of time that all the processes for that processor are busy performing internal operations which do not require the disk array.
In order to simulate a multiprocessing e~vironment, each processor will generate another array request after a specified ThinkTime, regardless of whether the other array requests made by that processor have completed. The bus resource will remain busy for an amount of time, called the Transfer Time, which is related to the size of the data being transferred and the data transfer rate of the bus. The other queue will wait until there is a bus resource available before proceeding.
Once the disk request token passes across the IN, it enters the disk array. There are NumDisk disks in the disk array. The disk request token will wait until the disk resource token it requires is available.
When the required disk is available, the disk request is granted access to the disk. The disk is then unable to process another request until this access is complete. The amount of time the access takes is called the disk's Service Time which is a function of the disk's SeekTime, the Rotational Latency a~d the Disk Access Time.
The disk request token is replaced by a data token which can be a different size than the disk request. The data token is also not available until the disk access is completed. If two DARs are waiting for the same disk, then one is chosen at random to be serviced. The other DAR must wait for the disk to become available again before it can be serviced. Disk accesses to different disks can be performed in parallel. 13 After the disk access has completed, the data enters the Disk-to-CPU IN queue to be transferred to the processor. As for the disk request, this queue is served internally in a FIFO fashion and externally in contention for bus resources with the CPU-to-Disk IN queue.
Once across the IN, the data subblock waits at the processor for all other data subblocks in its array request to arrive. Once all arrive, the array access is complete. Therefore, in contrast with reference [2] the array request processing does not complete when one of the disk requests is finished. In addition, like reference [ 1] the array request processing as a whole is not completed in a FIFO fashion due to the handling of the various disk requests at each disk.
Although it may appear that this model only simulates reads from a disk, it also accurately describes the case where a write to a disk is performed in which the write has a completion handshake that is the same size as a read disk request. This is true because, in a system which has handshaking, the amount of time the IN and the disk array are busy would be the same whether the piece of data is passing to or from the disk.
The main disadvantage of a CPN is that if the modeler is not careful the model can get too complex to be analyzed. This is due to the direct relationship between the CPN model's complexity and the size of the state matrix related to the model. In addition, as the state matrix gets larger the simulation model executes more slowly. It 14 was found that performing the simulations on a higher performance platforms with more RAM available resulted the ability to simulated more complex models, and the current models can be simulated more quickly. Therefore, the modeler must balance the amount of detail in the model and the host computer's ability to handle the complexity contained in the model.
In order to extract data from the model's outputs, a C program was written which extracts the CPU, Bus and Disk utilization data from the raw data produced in the simulation. This program is shown in Appendix C. If different information were required by the modeler, the program could easily be altered to extract it.
The system modeled has several irregular queue characteristics which would make the development of analytical queuing models difficult. The CPN model developed emulates this behavior to allow a performance analysis of this system to be performed using various system parameters and configur~tions.

Description of System Parameters
This section describes the formulas and limits of the parameters which were referenced in the previous sections.
-The ThinkTime function is user definable and for this model has been set to an independent, exponentially distributed random variable with mean Z as it was in [1].
-N is the number of disk requests in an array request. Its value is determined by several factors such as the amount of 15 declustering between disks and the parity scheme used. The value of N can range from 1 to the number of disks in the disk array. The number of disk requests generated by each processor, N, can either be constant for all processors or variable based upon the system being studied.
-A DAR is a disk access request. There are N DARs generated to represent each array request. The information stored in a DAR for this model is: The originating processor, which element of the array request it is, the disk to be accessed, the number of elements to be accessed, and the size of the data block request. as defined in references [ 1, 2, 3 and 4].
-Seek Time (Ts) = time to get the head to the correct track of the disk Ts = Ta *X + Tb * X + Tc As defined in reference [ 4] 16 and as defined in reference [ 1] x = l(tl) -(t2)1 where tl and t2 are random numbers from between 0 and the number of tracks on a disk, T. This makes the model more realistic by giving X a mean distribution of {T/3).
-Rotational Latency (Tr) = time to get head to correct data block or sector with-in the track) As defined in [ 1,4]  The method used to validate the CPN model described in the previous chapter was to compare the results of the CPN model to those found in models of similar systems presented in other papers.
In particular, the results of the analytical model developed in reference [l] were compared to those of the CPN model for same values of system parameters. The analytical model presented in reference [1] was chosen because the assumptions made in developing that model were very similar to those of the CPN model.
The assumptions made in reference [1] were the same as those listed in Chapter 1 for the CPN model with the following exceptions: In reference [ 1] it was assumed that the number of buses is always adequate to support the system's load. To comply with this in the CPN model, the number of buses in the IN was specified to be large enough that the IN imposed no limitations on the rest of the model.
Another simplifying assumption made in reference [1] was that all array requests made in a particular simulation were the same size.
This means that the size of the data accesses per disk and the array request size N are both constant across all the processors for all array requests made in a particular system configuration. This was not difficult to comply with as the CPN model was designed to allow these parameters to either be constant or varied. Finally, in reference [ 1] the individual disk requests of each array request were assumed to be independent of each other. To comply with 18 this would require the method of disk request generation to be altered in the CPN model. Because this was determined to be a weak relationship in reference [ 1], the method of disk request generation in the CPN model was not altered. Thus, in the CPN model the disk requests which originate from the same array request will access disks sequentially from some arbitrary first disk as described in Chapter 3. Therefore, it was possible to satisfy all the assumptions made in reference [ 1], with the exception of the independence of disk requests accessed by the same array request which was considered a weak assumption.
Because of the complexity of the systems modeled in reference [l], the length of time required for each simulation run using the CPN model was quite long. Therefore, the length of the simulation runs had to be limited. On average, for a system which was of the complexity of the ones presented in this section, the amount of time to perform a simulation run for a range of values would be around 24 hours. Limiting the length of the simulation runs can lead to significant errors when the data varies a great deal such as at high system load. Therefore, it cannot be guaranteed that the simulation results produced are within the guidelines normally used for determining when to end a simulation run. However, the simulation runs were extended as long as time and RAM allowed in order to minimize these errors.
As defined in reference [1], the utilization of the disk array system is a function of the rate at which requests arrive at the disk array and the service rate of the disk array. The service rate of the disk array is generally constant and independent of the arrival rate.
Thus variations to the disk array's utilization are induced mainly by variations the disk array's arrival rate, called Lambda. Lambda is defined as follows: where N, P, NumDisks and Z are defined in section 3.2 In order to exercise the CPN model at disk utilizations over its range, lambda will be varied in two different simulation runs. In the first run the number of disks in the array is varied and in the second run the number of elements in an array request is varied.
As in reference [ 1], other system parameters, as defined in section 3.2, were set to typical values as follows: Ta = 0.4632ms, Tb = 0.0092ms, Tc = 2ms, NumCyl = 949, size of data accessed from each disk, the subblock size, = 4 kbytes and the average transfer rate was 0.6023 msec/kbyte across the IN. As in reference [1], the disk's Data Access Time, which is the amount of time to actually read the data from the disk, was not included in the disk's Service Time calculation. In addition, number of processors P was set to 10 and the mean think time Z was set to 100 msec.
In the first comparison, which is shown in figure 4, the disk utilization was varied by altering the number of disks from 30 to 100. The value of N was set to 10. It can be observed from this 20 figure that the CPN model's average disk array processing time closely matches those found the analytical model in reference [1].  --• -Reference 1

---a--CPN Results
The main reasons for the discrepancies which do exist are discussed in the conclusion portion of this section.

Results
In the second comparison, shown in figure 5, the size of the disk array request size N was varied from 3 to 15 as was done in reference [l]. The number of disks in the disk array was set to 50.
As in the previous figure, the results of the CPN model closely match those produced in reference [ 1].

Conclusions:
For the most part the results produced by the CPN model closely match those produced by the analytical model in reference [1], especially at low system utilization. It generally accepted that a model should estimate response times at low to medium loads within 15% of disk requests are assigned randomly as in the model reference [l]. As disk requests collide, their array response time can increase greatly as one of disk requests must wait until the other request completes before it can access the disk. While the effects of this are minimal at low to medium system loads where few disk collisions occur, at high system loads the analytical model will have many more collisions than the CPN model. This problem was noted in reference [1].
2. When one or more of the subsystems is highly utilized there is more chance of error in the CPN model's results. When one of the subsystems becomes a bottleneck, it can cause the array response time to vary greatly from one array request to the next. In the CPN model it would take significant simulation time for these varied response times to average out to a consistent value. Since the amount of time for simulation was limited, the areas of high system utilization will have a greater amount of error m the CPN model results than when the system utilization is low. This is most apparent in figure 5 when N is greater than 11. At this point the array response times do not have a smooth curve shape as desired. Therefore this portion of the CPN data is most suspect to error.
Together these are the reasons for differences between the results of the CPN model and those of the analytical model in reference [1]. 23 overall, the CPN model appears to adequately model the operation of the system of interest, especially at low to medium load. 24 Chapter 5: Analysis Using the CPN Model The last chapter shows that the CPN model accurately estimates the response time of a disk array to various system loads. In this chapter, some of the assumptions made in reference [1] will be investigated and a performance evaluation will be done using the CPN model.

S.1 A Study of the Effects of Varying the Number of Buses in the Interconnection Network
In the other disk array model's studied, the effects of the interconnection network (IN) on the overall system performance were ignored. Therefore, the first assumption investigated will be to vary the number of buses in a single stage IN to determine how this affects the system's performance. The second assumption investigated will be to vary the size of data accessed by each disk request. This will be used to study the e. ff ects of various methods of ensuring data integrity in disk array upon the system's performance.
To make the CPN model consistent with those used in reference [ 4] the following assumptions were made: configuration where there is a bus present to connect each processor to each disk, which requires 500 buses, but simulation showed that the array response time was stable when the number of buses was greater than 10. Therefore, the addition of more buses would not bring any value to the study.

Disk Array Response Time vs Number of Buses
Number of Buses  In conclusion, the bus system can severely limit the performance of the disk array when the number of buses is small. In contrast, once the number of buses reaches a certain point, more than 4 in this case, adding more buses dot{s not significantly alter the I/0 subsystem's performance. Therefore, a system designer must ensure that there are enough buses to prevent the IN from limiting the system performance while not including too many buses in order minimize the cost of the system. 28

Methods and Subblock Size on System Performance
In the second analysis problem, the effects of the overhead induced by various Redundant Array of Inexpensive Disks (RAID) data integrity schemes and a new method proposed in reference [ 4] on the overall disk array response time will be studied. The RAID configurations are used to ensure that the disk array is fault tolerant. If a fault does cause a disk to lose the data, these methods allow the data to be fully reconstructed.
Each RAID configuration has different costs. These costs come in terms of the performance impact that the overhead RAID processing incurs upon the total disk array performance. The costs are also monetary as each RAID configuration requires additional disks in order to perform the specific RAID algorithm. Thus the goal of the RAID disk array designer is to minimize both the costs while maintaining the disk array's fault tolerance.
In reference [4], it was noted that the overhead caused by the RAID configurations has the most impact when the disk accesses are for small sized data. This is because for small accesses the amount of time used to transfer the data across the IN is much less than the amount of time required to access the data on the disk. This imbalance results in a bottleneck in the disk array. Because the disk array is already much slower than the rest of the system, the impact of this bottleneck can be great. 29 The case which best exemplifies the overhead induced by small disk accesses is the one in which the small accesses are Read-Modify-Write accesses. This type of access requires more accesses between the disk and disk controller than a simple Read or Write access. The transfers between the disk controller and the disk do not use the IN, but instead are handled by a bus which is inside the disk array. It is assumed that there is only one bus between the disk controller and the disks. Therefore, each of the transfers between the disk controller and a disk must occur sequentially.
This is the worst case scenario possible because the service time for a Read-Modify-Write request will be the sum of the service times for each of the several accesses required between the disk controller and the disks. This assumption is consistent with reference [ 4].
The overhead incurred is different for each RAID method because each method causes a different amount of additional Read and Writes between the disk controller and the. disk to perform the actions to ensure the data consistency. In reference [4], four different data integrity configurations were presented. They are non-redundant disk array, RAID Level 1, RAID Level 5 and a new scheme called Parity Logging. The details of each configuration will be discussed in the following paragraphs.
The first configuration was the standard, non-redundant disk array configuration where no data backup occurs. Each Read-Modify-Write operation requires a Read from the disk, the data is updated 30 by the disk controller and then the new data is written back to the disk. There is no additional overhead associated with Read-Modify-Write operations. It was included to provide a point of reference to be used for comparison with the other disk array configurations.
The second method was RAID Level 1 in which a second disk array was added which contains a copy of all the data sent to the first array. This method is often called "disk mirroring". For each Read-Modify-Write operation the data is read from the primary disk, updated, and then written to both the primary disk and its mirror disk. Therefore the performance overhead incurred is an additional write to the second disk. Because the performance overhead is not great, the main disadvantage to this method is the cost of a complete second disk array.
The next configuration is a RAID Level 5. In this method, a single disk is added to the primary disk array. This disk maintains parity information about the data on the primary. array to ensure that data can be reconstructed. For each Read-Modify-Write access, the data must be read from and written to the disk array and in addition the matching data on the parity disk is read from and written to the parity disk. Thus the overhead incurred is an additional read and write for the updating of the parity disk. In the case of a small access to a disk array the overhead for a RAID Level 5 system can impose a significant system impact. 31 The method proposed m reference [ 4]  In reference [ 4 ], it is stated that the impact of small accesses is greatest on Read-Modify-Write accesses to the disk and then proceeds by presenting the worst case scenario where all the accesses are small. To do performance analysis of a system it would be helpful to see the overhead caused by each data integrity method for more than one data access size. In this study it was assumed that the size of the data accessed will have two possible types: a small access which performs a Read-Modify-Write on a 3,2 single data block and a large access which reads a whole track from a single disk. The proportion of the large to small accesses will be altered to study the effects of the large and small accesses on the system performance.
The disk array configuration was the same as those used in references [3] and [4] for an IBM Lightning drive as follows: ., The RAID Level 1 imposes much less overhead than RAID Level 5 while still maintaining complete data integrity. The main problem with RAID Level 1 is that two complete disk arrays are required which can be costly.
The parity logging appears to not impose almost no overhead upon the system while providing similar data integrity protection as RAID Level 5. From this data it appears that the CPN model underestimates the overhead incurred by the parity logging methodology. As stated in i:eference [4], the expected overhead was to be around 25% of the disk response time. It did not appear from the simulation data that the parity full disk transfers occurred. The parity disk updates account for a large portion of the overhead in this scheme. Therefore, unless this transfer occurs the CPN model will underestimate the response time for this model. Because the model appears to be correct, the way to increase the likelihood of getting the disk accesses to occur is to run the simulation for longer periods. 35 The complexity of these system configurations was less than the complexity of the models presented earlier in this paper. This was mainly due to reducing the number of disks and setting the array size N to 1. Because the model is simpler, the simulation could proceed much more quickly than earlier models. Therefore, the simulations for this particular performance analysis was run for twice as much simulation time as the simulation runs in the validation section of this paper. This leads to more simulation data which produces more reliable results. This can be observed in figure 8 as the data series for each disk data integrity configuration appears to be nearly linear as expected. However the fact that the Parity Logging results are less than expected indicates that there still is some error in the results. Therefore, the longer the simulation run and the less complex the model is, the more accurate the results of the simulation.
Small writes are prevalent in many applications. Small accesses can impose a severe performance penalty for certain disk array data integrity configurations, in particular RAID Level 5. Therefore a system designer must balance the performance degradation brought on by the data integrity configuration, the proportion of small accesses to large ones, the system's data integrity needs and the cost constraints of the system. This model can be used to do performanc~ analysis's of systems which conform to the basic · system architecture and can be characterized in a functional or procedural fashion. It can be used to validate analytical models such as ones presented in references [l] and [ 4].
While this model can estimate system performance on systems which have much larger state spaces than generally is possible with Petri Nets, its main limitation is still the complexity of the state space. If the model is very complex, then if the model can be 38 simulated at all, it must be performed on a high performance computer platform. To ensure that the model will run, the model complexity must be minimized.
In addition, the amount of time it takes to perform a simulation is a function of the complexity of the system's state space. To ensure accurate simulation results, the simulation time must be maximized.
In conclusion, the model developed     Petri Nets have been used in performance studies of systems in many cases, the results of which show that Petri Nets are useful in systems that are not too complex [5,6,7]  The average value from this chart is then used as a data point on one of charts used to characterize the system's performance over a range of system parameters. For example, the above chart's data produces the D=80 data point in Figure 4 of this report. 65