Multics Technical Bulletin MTB-673 Disk DIM To: Distribution From: Tom Oke Date: 08/16/84 Subject: Adaptive Disk Optimization Comments should be sent to the author: via Multics Mail: Oke.Multics on System M or CISL. via Mail: Tom Oke Advanced Computing Technology Centre Foothills Professional Building #301, 1620, 29th Street, N.W. Calgary, Alberta, Canada T2N 4L7 (403) 270-5414 1 ABSTRACT This MTB outlines modifications to the existing disk control software to produce an optimization of system performance. These modifications produce a disk driver (disk DIM) which adaptively optimizes disk operations to attain best possible system responsiveness and system throughput, as opposed to best possible disk throughput. The resulting disk DIM permits better resource utilization, is site tunable and provides additional metering support over the existing software. ________________________________________ Multics project internal working documentation. Not to be reproduced or distributed outside the Multics project. Multics Technical Bulletin MTB-673 Disk DIM 2 INTRODUCTION The modifications described in this MTB are aimed at producing a MULTICS disk driver (DISK DIM) which adapts to the changing paging needs of the system to optimize system responsiveness and performance. Traditional disk optimization schemes have been aimed at optimizing the throughput of the disk system itself, and have not accounted for the management characteristics of secondary storage inherent in a Virtual Memory system, particularly one in which all storage access is essentially done through a paging mechanism, rather than distinct file IO. Such approaches have meant that system responsiveness and system performance were adversely affected by the delays imposed to service disk IO. In a traditional IO system, in which processes explicitly wait (block) until IO completion, these delays may be necessary. In a system which only blocks processes reading information from secondary storage, and which buffers data destined for secondary storage, such service characteristics are detremental to system operation by increasing average process blocking delays and reducing the level of multi-programming. MULTICS has attempted to solve some of these problems by a prioritization and buffering of IO, such that blocking types of IO's, page reads and VTOC operations, are high priority and will be done before non-blocking operations, such as page writes. This noticably increases system throughput and responsiveness. It is still too static an optimization technique to serve well through a wide range of system loading. The new DIM described in this MTB is an attempt to answer the dynamic requirements of optimization of a system such as MULTICS. It provides acceptable system performance over a wide range of load levels, and permits site tuning of the disk system to individual site needs. Tuning is done at the level of individual IO types, for each drive, providing a great deal of flexibility. Ballpark system defaults attempt to provide a system which does not require a great deal of tuning to provide acceptable results. Additional meter information is provided to provide a site with as much information as possible to determine hardware and tuning needs and effectiveness. The resultant DIM, provides a wide range of tuning parameters, and responds well to good ballpark tuning values. It has a fairly broad tuning range and does not appear to demonstrate tuning instabilities. Thus it should be difficult to MTB-673 Multics Technical Bulletin Disk DIM provide BAD tuning parameters when following the guidelines mentioned in this MTB. 3 MULTICS DISK MANAGEMENT The following sections outline the physical characteristics of the MULTICS disk system, and the rational behind its current management. MULTICS disk management is done through disk sub-systems, composed of channels, Micro-Programmed Disk Controllers (Disk MPC's) and disk drives. 3.1 Disk Drives Each disk drive can be dual-ported, meaning that it can be controlled through either of two sets of cables, each typically connected to a different MPC. This provides redundant physical paths to each disk drive, and permits continued drive service if one physical path is down. 3.2 MPC's Each MPC has physical connections to one or more IOM's and is accessed through the use of Logical Channels. A 451 MPC (used for MSU451 disk drives) can have two LA's (Link Adaptors) and thus can be connected to two different IOM's, providing redundant paths to the MPC. Since MSU500/501 disk drives require a faster transfer, each 607 MPC can have only a single LA. A 612 MPC can handle both 451 and 501 drives, and has a single LA connection per MPC (two separate MPCs exist in the same cabinet). Path redundancy for MSU500/501 is provided through redundant MPC's, each typically cabled to a different IOM. 3.3 Sub-systems A disk sub-system is a reflection of the physical connectability characteristics of the MULTICS disk hardware. A disk sub-system is granted a string of disk drives and a set of Logical Channels. Each Logical Channel must be able to access all drives of the string, thus all disk drives have the same physical connectability and the logical paths from all drives of a sub-system are identical. The disk software uses any of the channels granted to it to access any of the drives granted to it. If the channel cannot access the drive an error situation results, and the channel will be removed from active service. Multics Technical Bulletin MTB-673 Disk DIM 4 MR10.2 SYSTEM LIMITS The following section details the characteristics of the disk control system, as seen in MR10.2 and earlier releases. 4.1 Channels As of MR10.2 a disk sub-system has a maximum software configuration limit of 8 channels. The channels specified to the sub-system are presumed to be specified in terms of an LA base channel and the number of Logical Channels configured on the LA. This presumes specification in terms of physical paths. This specification is used to order channel use to evenly load all available physical paths to the disks of the sub-system. The following example defines two 451 MPC's, each of which has two LA's. Each MPC is connected to two IOM's. The PRPH and CHNL cards supply information to the system about the sub-system definition for DSKA and describe the channel connections to it. It would be specified in the config_file as: PRPH DSKA A 8. 4 451. 6 CHNL DSKA A 12. 4 B 8. 4 B 12. 4 MPC MSPA 451. A 8. 4 B 8. 4 MPC MSPB 451. A 12. 4 B 12. 4 4.2 Channel Use Ordering The algorithm used to allocate channels to the sub-system permutes channel numbers to produce an ordering which utilizes the MPC's LA base channels first, thus distributing the IO load across the maximum number of separate physical paths. It does this by taking the index to an array of channels and reversing the bits of this index to alter the selection order. This looks like: Array Index: 000 001 010 011 100 101 110 111 Permutes to: 000 100 010 110 001 101 011 111 0 4 2 6 1 5 3 7 For this permutation algorithm to function correctly, the number of channels allocated to a sub-system must be a power of two, and the number of channels on each base channel must also be a power of 2. MTB-673 Multics Technical Bulletin Disk DIM Thus for the following prph and channel examples you get: prph xxx a. 8. 2 chnl xxx b. 8. 2 a 12. 2 b 12. 2 0 1 2 3 4 5 6 7 original: A08 A09 B08 B09 A12 A13 B12 B13 permuted: A08 A12 B08 B12 A09 A13 B09 B13 If channel specification is not done in powers of two then incorrect initialization occurs, which may not crash a system until a heavy disk load requires use of the incorrectly initialized channel entries. prph xxx a. 8. 3 chnl xxx b. 8. 3 0 1 2 3 4 5 6 7 original: A08 A09 A10 B08 B09 B10 *** *** permuted: A08 B09 A10 *** A09 B10 B08 *** In this example if we try to run 4 drives simultaineously we will use an uninitialized channel entry and the system will crash. This same situation will probably re-occur if an ESD is attempted, since we heavily load the IO system at this time too. This algorithm is supposed to permute channel use so as to split the loading of a sub-system between 2 or 4 Link Adaptors. If three link adaptors are specified, or a non-power of two channel count, the algorithm malfunctions as above by creating uninitialized channel table entries, or by incorrectly assigning load across its Link Adaptors. 4.3 Seek Overlap Multics uses Logical Channels to hold complete disk requests (both seek and IO). If a drive is requested to perform some action, the channel upon which this request is made is busy until the request is completed. This gives a direct correlation between the number of Logical Channels available to a disk sub-system and the number of drives which can simultaineously be doing IO. (In fact it is only the SEEK portion of the IO which can be overlapped, but this is a major portion of the delays involved.) A disk sub-system with fewer channels than drives can only simultaineously move heads on as many drives as there are channels. In effect a disk sub-system with more drives than channels looks for the purposes of head movement and seek delays Multics Technical Bulletin MTB-673 Disk DIM as if it had only as many drives as there are channels, with the drives being proportionally bigger. This reduces total system throughput. This makes the limited number of channels which can be allocated to a disk sub-system, due to software implementation limits, significant and a point of inefficient service. It tends to limit the size of a disk string which is allocated to a sub-system and may needlessly cause additional hardware to be required to support seek overlap. In addition, one desires more Logical Channels to be allocated to a sub-system than there are drives to reduce system degradation in situations where an MPC or a Link Adaptor is lost. 4.4 Queue Limits Disk sub-systems are currently limited by an allocation of 64 queue slots which are accessible only to that sub-system. This limits each sub-system to have no more than a total of 64 requests queued for all its drives at any point in time. Due to the page write burst characteristics of MULTICS page management, it is rather easy to load individual drives to more than this level, and quite trivial to load an entire sub-system to more than this level. When a sub-system runs out of queue elements, page control must wait in a spin-lock situation termed an ALLOCATION LOCK. MULTICS IO burst characteristics tend to poorly distribute the instantaineous load across the disk sub-systems, thus poorly utilizing the queue resources allocated to each sub-system. Optimization of head motion is done using a nearest-seek-first algorithm. The efficiency of this algorithm is rather dependant upon the number of requests available to be optimized, more requests will produce a shorter average seek length. If a shallow queue depth is available, then optimization is not as efficient and produces longer seek lengths and lower throughput rates. If a deep queue is available, then shorter seeks will occur. 4.5 Per Sub-System Work Queue Management The MR10.2 implementation allocates work queues on a per sub-system basis. Each sub-system has two separate queues, a high priority queue and a low priority queue. On-cylinder requests are not optimized between the two queues. MTB-673 Multics Technical Bulletin Disk DIM Having combined queues for all drives on a sub-system produces queues with more requests in them, and higher overheads for the nearest-seek algorithm, but does not improve optimization, since no more requests per drive are queued. Thus there is no optimization improvement from combined queues. Having complete separation between the high and low priority queues prevents any possible optimizations between these request types and tends to create de-optimization of all requests due to the pre-emptive nature of its operation. 4.6 Request Service Selection One of the aspects of optimization is how drives are selected for service, in situations where more than one drive could become active. Current operation has request selection made to keep channels busy by examining the sub-system work queue to pick the drive to service next. The work queue, of each priority, holds all the requests for that priority for all drives of the sub-system. The oldest requests are at the head of the queue, the youngest at the tail. Selecting drives for service from the oldest unserviced requests tends to place emphasis on busiest drives. It does not prevent stagnation of requests, since the nearest-seek algorithm will pick the closest seek of the current priority to the current cylinder position. It does however tend to provide unequal service according to the current queue loading at the head of the queue. It is dependant for its operation upon the availability of a sub-system wide queue to determine ordering of requests for all drives of the sub-system. 5 OPTIMIZATION TECHNIQUES Two common disk optimizing strategies are nearest-seek and combing. Nearest-Seek This technique attempts to produce the highest possible disk throughput rate by presuming that the best request to service next is the one nearest to the current head positition. This algorithm attempts to produce the smallest average seek length, but suffers from request stagnation problems if the arrival rate of requests within a certain band of the disk surface is higher than they can be processed. At this arrival rate requests outside the Multics Technical Bulletin MTB-673 Disk DIM band stagnate since there is always a closer candidate for processing. Combing This technique is named for the head motion it produces. It attempts to guarantee request service and totally eliminates request stagnation. It starts the heads at one edge of the disk surface and sweeps them to the other edge, servicing all requests in cylinder order. Requests which are behind the current head position, and arrive after the head has passed their location, wait only until the reverse sweep. The optimization algorithm used in the MR10.2 software is the nearest-seek algorithm, and it does in fact suffer from request stagnation at extremely high IO rates. It is possible to hang normal system operation through this stagnation. 6 ADAPTIVE OPTIMIZER The following section details the modifications of the Adaptive Optimizer to alleviate the restrictions of the MR10.2 disk control software and provide an Adaptive Optimization to disk requests to better service the MULTICS system. 6.1 New Channel Allocation The new channel allocation algorithm presumes that the channel specifications supplied to it are according to the premises of a sub-system. Each indicates a real Link Adaptor Base channel and indicates the number of Logical Channels physically allocated on it. It considers these channels as unique resources and allocates these resources to the sub-system until they are exhausted. For example: prph xxx a. 8. 3 chnl xxx b. 8. 2 a 12. 4 b 12. 1 Has the following channel resources: A08 A09 A10 On one link adaptor LA0 B08 B09 On one link adaptor LA1 A12 A13 A14 A15 On one link adaptor LA2 B12 On one link adaptor LA3 MTB-673 Multics Technical Bulletin Disk DIM Resource allocation takes the first channel from all link adaptors, then the next and so on, and produces an allocation of: A08 B08 A12 B12 A09 B09 A13 A10 A14 A15 LA0 LA1 LA2 LA3 LA0 LA1 LA2 LA0 LA2 LA2 This produces the best possible loading situation irrespective of the number or grouping of channels. It also permits intuitive understanding of what order channels will be used and the loading on the Link Adaptors. The new resource allocator will allocate up to 32 channels to a sub-system, the current hardware maximum of 8 channels for each of four possible LA's (only with 451 MPC's). If desired, this channel limit can be trivially increased as future hardware capacity is increased. This greatly increases the previous limit of 8 channels per sub-system. With these two modifications the channel bottleneck and channel ordering limitations are alleviated. In addition the channel ordering has been rationalized to permit an easy understanding of what will actually occur and how channels will be used. 6.2 New Free Queue Allocation The FREE QUEUE is treated as a system-wide resource and its members may be committed to any disk drive regardless of sub-system. The free queue modifications also permit the FREE QUEUE size to be specified in the config_file on the PARM card. For example: PARM xxx xxxx xxxx DSKQ n Where n can be either octal or decimal (terminated with a period) and indicates the number of elements in the FREE QUEUE. Limits are enforced to average no fewer than 5 nor more than 200 elements per drive. Thus a 10-drive system may have between 50 and 2000 queue elements. If no parameter is provided, the default is 20 elements per drive. Observation of running systems indicates that most of the excessive overhead spent in paging is due to ALLOCATION LOCK waiting, which is a SPIN LOCK until a free queue element is freed by the completion of a physical IO. On an system overloaded with disk requests, either by bursts longer than the freeq depth per sub-system, or average request load greater than the freeq depth available, this paging overhead can mount to the region of 25% or more of the total system CPU time. This SPIN-LOCK is done with Multics Technical Bulletin MTB-673 Disk DIM the PAGE TABLE LOCK held, which prevents all other paging until the completion of an IO in the sub-system causing the ALLOCATION LOCK. Testing with a very large freeq, with the FREEQ modifications, has indicated that the deeper queues available do in fact produce noticably better optimization and system throughput and can easily eliminate ALLOCATION LOCKS. In tests of a highly loaded system, much more loaded than is possible with the existing freeq depth available, it has been shown that paging overheads stay in the region of 6% or less on a single processor system doing about 75 IO/sec on three 451 disk drives. If the system is provided a normal size freeq (64 elements) it uses about 35-45% paging, due to ALLOCATION LOCKS and has a significantly reduced disk throughput. 6.3 Per Drive Work Queue Management The new work queue management has a single queue per drive, which holds only the requests for that drive. This will tend to reduce optimization overheads and speed servicing time. It will also make full optimization of requests possible, since all requests will be considered in the optimization algorithm, and on-cylinder optimization occurs. Queue elements have been extended from 4 words to 6 words to permit the addition of forward and backward linking in the queues and for expanded fields for sector, cylinder and queue_time. In addition, the pvtx (physical volume table index) has been added to the queue entry to permit easier debugging. Since there is a single work queue per drive, and all additions are made to the tail of the work queue, the oldest requests are at the head of the work queue. This makes it easy to determine the oldest request to prevent request stagnation. 6.4 Request Service Selection As previously seen, the MR10.2 system selects drives for service by viewing the collection of requests for all drives represented in the two work queues, as an indication of drive loading and using this collection to determine the best drive to service. The new optimizer cannot use this selection mechanism, since work queues are now on a per-drive basis, and do not represent sub-system loading. Instead it will round-robin through the MTB-673 Multics Technical Bulletin Disk DIM drives to select the next requests to keep a channel busy. This will provide continuous and equal service to drives. A point to remember is that one should be striving for more channels than drives. In such a situation there will be no advantage to either method, since there will always be a channel to be serviced and all drives will be kept maximally busy. The only time that selection order will determine an efficiency difference is in busy sub-systems which have insufficient channel allocation, or which are suffering a hardware channel degradation. 6.5 Optimization - Concepts The disk system is an extension of the real memory of the MULTICS system to provide its VIRTUAL MEMORY characteristics. As such we are not per-se looking for an optimization which provides the maximum throughput for a disk system, instead we are looking for VIRTUAL MEMORY service characteristics which elicit the best response and throughput from the entire system. Nearest-seek and disk combing techniques will provide high drive throughput, with disk combing preventing stagnation, and nearest-seek perhaps having the advantage in higher IO rates. But good system performance, as viewed by the users, is whether they get good response, and whether their individual process throughput is acceptably high. Two major types of IO occur within the MULTICS system: Blocking IO IO which causes a process to wait for IO completion. This IO prevents a process from executing (blocks the process) and reduces the number of currently executable processes. An example is a page read sponsored by a page fault. The process must stop execution until the data it requests is available to permit it to continue. Non-blocking IO IO which is essentially a background task, which does not cause the system to wait for its completion. An example is a page write caused by the need to evict a page from memory to permit the page frame to be used for another page. A user process does not need to be suspended until the write is complete, in fact the user process may well have logged out by this time. Multics Technical Bulletin MTB-673 Disk DIM There are many shades of these IO forms. VTOCE reads are blocking IO's, since a page cannot be read until the VTOCE has been read to fill in the ASTE information. A VTOCE write is non-blocking, since it does not cause an executing process to be suspended. But since there are a limited number of VTOCE buffers available, a large number of non-blocking VTOCE writes can cause other processes to be suspended until sufficient VTOCE buffer space has been recovered to read a blocked VTOCE in. In the same way, the disk queue can be fully allocated to requests to write pages, and the core map could be saturated by pages waiting to be written. Either situation will cause processes to block until queue or core map space can be recovered by the completion of the non-blocking IO. In this way even non-blocking IO can cause blocking to occur when the applicable resource is totally consumed (over-loading of the resource). 6.6 New Adaptive Optimization This paper introduces a new optimization strategy which aims at idealized service to the current demands of disk operation. The simplistic desire of this system is to provide the lowest possible service times for blocking IO, and to ignore the existance of non-blocking IO. However it also recognizes the loading of various queue resources and takes this into account in determining optimization. This optimization identifies a number of different IO types, such as VTOCE READ, VTOCE WRITE, PAGE READ, PAGE WRITE, TEST READ and TEST WRITE. Each IO type is individually optimized according to its current disk loading, with higher optimization for IO types which are more heavily loaded, on a per-drive basis. Thus response is degraded only on drives needing higher throughput of lower priority IO and only for the type of IO which is reaching load saturation. It does this through an optimization strategy specified for each IO type for each drive. Optimization is done utilizing a Nearest-Logical-Seek algorithm, rather than a Nearest-Physical-Seek algorithm. A Logical Seek is determined by a combination of the optimization priority of the request's IO type, and the Physical Seek length of a request. The optimizing algorithm uses a MULTIPLIER to convert a Physical Seek Length into a Logical Seek Length. This multiplier is determined by the queue management routines each time they queue or de-queue a request. Each drive has a table of IO types MTB-673 Multics Technical Bulletin Disk DIM which holds optimizing information, used to derive the multiplier, and holds the multiplier itself. Optimization is determined according to a site specified optimization policy, which states two points on a straight line, the Response Point and the Load Point. Response Point The Response Point is the point at which the system will optimize for best system response, and is the point, for each IO type, when there is only a single request of that type in the queue. The value stated for priority is given in tracks and is essentially the multiplier to be used for an IO type for this single request case, to convert a Physical Seek into a Logical Seek. For example, to mirror the current High/Low queue priority separation, one would state the Response Point value for High Priority IO as 1, for Physical = Logical, and would state the Response Point value for Low Priority IO as the number of tracks on the disk. Thus a single track move for a Low Priority request looks longer than the longest possible move for a High Priority request and nearest-seek optimization produces complete separation, of all but on-cylinder requests. Load Point The Load Point is the point in queue loading for each IO type at which it should become fully optimized. For Blocking IO, this would typically state a point which preserves sufficient multiprogramming. For Non-Blocking IO, this would typically state a point before resource saturation of that IO type would occur, and cause the IO type to become Blocking. This point is stated in terms of queue elements. Between these two straight line end-points, the optimizing algorithm derives an optimizing multiplier which will be a combination of the current load of the IO type and the stated policy for the IO type. There will be a continual merging of Logical Seek lengths such that there will no longer be complete separation, and short seeks of low Response priority will start to be taken in preference to long seeks of high Response priority. Thus the simple policy statement will cause interaction in a complex manner, dependant upon the relative loadings of the IO types at any particular point in time. Beyond the Load Point limit, the multiplier will remain at 1, to continue optimization at maximum throughput. Multics Technical Bulletin MTB-673 Disk DIM The multiplier is basically determined by the straight line formula: multiplier = max (1.0, intercept - slope * load_level) Given that: At Response Point: load_level = 1, multiplier = Response At Load Point : load_level = Load, multiplier = 1 One can trivially derive from the two equations: Response = intercept - slope * 1 1 = intercept - slope * Load That: Response - 1 = -slope * (1 - Load) and therefore: slope = -(Response - 1)/(1 - Load) intercept = Response + slope This results in the restriction: Load Point > 1 The straight-line formula uses a subtraction of the slope product to permit a positive slope value to be used, since this will be printed in the meter. The optimization algorithm uses a nearest-seek algorithm on Logical Seek Lengths, derived from the formula: Logical Seek Length = Physical Seek Length * multiplier This calculation is performed within innermost loop of the nearest-seek algorithm. Thus the use of a simple calculation is highly desirable. 6.7 Optimization Policies A typical policy statement might be: Response Point Load point VTOC READ: 100 3 VTOC WRITE: 150 3 PAGE READ: 200 5 PAGE WRITE: 80000 100 MTB-673 Multics Technical Bulletin Disk DIM The multiplier figures are chosen with some head room on the highest priority IO type, to let other IO's become even more optimized than the stated highest priority IO type as they become saturated. Essentially Page Write is multiplied such that at the low load point even a single track move looks longer that the longest possible VTOC track move, or even longest possible Page Read head move. Thus there is no interaction between the two until Page Writes start to become heavily loaded. This policy indicates that VTOCE Read is quite important, and will typically be done with highest priority. It is possible (since the multiplier is 100) that other IO could be even more highly optimized at some point in time. Note that VTOC Read will be very highly optimized if 3 are queued on this drive, since this may be blocking a number of eligibility slots. VTOCE Write is less important, since it is non-blocking, but becomes important if 3 become queued, since this starts to bottleneck VTOCE operations. Page Read is quite important, but less than VTOCE Read. If we have 5 Page Reads outstanding we will be getting out of a multi-programming situation and thus wish to increase the priority of Page Reads. Page Write is least important, and since we typically have a large core map and and reasonable size disk queues the optimization of Page Write does not need to occur until the queue fills significantly. Thus we only reach full optimization at 100 outstanding IO's per drive. Thus, as long as the disk system adequately keeps up with the IO load placed upon it, it will be as system responsive as possible. As loading starts to approach stated thresholds the system moves towards throughput optimization, as needed to prevent blocking system operation by loss of queued resources. 6.8 Systemic Optimization Criteria The optimizations as stated above are done on a per-drive basis, but simply specifying a number of per-drive Load Point limits will probably result in system-wide FREE QUEUE over-commitment and not produce the desired effect of reducing queue saturation. Additional system-wide optimization factors are maintained, one for each IO type, which track the queue commitment to each IO type, and a specified Max Depth tuning factor. The current system-wide commitment and the Max Depth factor are combined to Multics Technical Bulletin MTB-673 Disk DIM produce a system-wide Load Correction Factor to be applied to the multiplier derived from the drive loading. The system-wide Load Correction Factor is a fraction <= 1.0. It is determined by the formula: Factor = (Max_depth - Depth)/Max_depth This factor is calculated for each request queued or de-queued for each IO type. It is applied to the Multiplier determined from the straight-line optimizing formula as follows: Final_Multiplier = max (1.0, Factor * Multiplier) A system load optimizer depth counter may be mapped into another system load optimizer depth counter, to let them accumulate together. Thus the system wide loading will be the sum of the two (or more) loads, while the individual drive's IO type optimization factor will be calculated from the loading only of the one IO type on that drive. The effect, and flexibility, of the system load optimization factor, as applied to the individual IO type multipliers further enhances the load recovery characteristics of the Adaptive Optimizer, without significantly obscuring it purpose and methods. Since system load depth counters can be remapped,the depth is calculated such that it can never decrement below 0, and may be reset to 0, to be corrected by the normal operation of the system. Thus re-mapping of depth counters does not significantly disrupt normal system operations and can be done on the fly. 6.9 Stagnation Management The Nearest-Seek algorithm sufferes from the flaw that request arrival rates higher than the maximum drive service rate can result in request stagnation which can deadlock the system. Throughput testing on a stock MR10.2 has already produced this deadlock several times, requiring an Execute Fault to recover the system. Since the Adaptive Optimizer utilizes an Adaptive Nearest Seek Algorithm it could also suffer from thsi problem. The problem of stagnation is dealt with by altering the optimization strategy utilized, and switching to a Disk Combing algorithm when stagnation situations are detected. The Combing algorithm is stagnation free, but does produce a degraded request response time, since no optimization differentiation is made to MTB-673 Multics Technical Bulletin Disk DIM speed Blocking IO service. The stagnation time period is a site tunable parameter. The system manages stagnation situations by examining the first (oldest) element of the work queue for a device it will operate upon. If the first request has been in the queue for longer than a stagnation time period (defaults to 5 seconds), then the device will be managed with a disk-combing technique, rather than the ADAPTIVE OPTIMIZED NEAREST-SEEK. Thus when a drive becomes so clogged with requests that it cannot service some of them within an acceptable time period, we switch optimization strategies to correctly manage this overloaded situation in the most efficient manner possible. This provides a guarantee of service and continued system operation. Combing continues until no requests in the work queue are older than the stagnation time period. Load testing to levels which would previously have rendered the system essentially catatonic merely produce very slow service, but system operation continues. If extreme IO rates are maintained over a long time period another syndrome occurs, even with stagnation management. In this case IO rates are such that extremely long disk queues occur and service times become longer than memory lap time. In such a situation a secondary stagnation can occur in which processes thrash, even though they are guaranteed to get the pages they request. By the time the page being referenced has been moved to memory, the page holding the instruction which took the page fault has been written to disk or discarded. During testing this a situation occurred with the Initializer such that a console command to kill the loading jobs could not be completed. The system was recovered by spinning down a drive holding process directories, causing the test processes to be suspended long enough for the Initializer to get its command working set into memory. It is not expected that this level of system loading can be approached, or maintained, in normal service. 7 METERING DATA A performance dependant system such as paging IO requires good metering, to permit good tuning. The Adaptive Optimizing system will create additional queue management metering, providing information on average queue depth, allocation counts and high water mark of queue depth. This will be maintained for the system wide FREEQ and for the individual drive QUEUES. This will permit sufficient queue depth statistics to be kept to provide necessary tuning information for freeq allocation. Multics Technical Bulletin MTB-673 Disk DIM Additional metering information is maintained on channel use, indicating system utilization of channels and channel service information. The metering system also provides a number of service statistics for each IO type, for each drive, indicating total seek length, seek count, total queue wait time, and queue allocation count, total channel wait time and channel use count and comb count. The metering tools utilize this information to provide extensive meter output from which a number of observations can be directly made, and further operational characteristics inferred. 8 SEEK LENGTH AVERAGING The MR10.2 SEEK LENGTH meter is wildly inaccurate, due to decaying average algorithm. The current algorithm is: ave = ave + divide (seek - ave, 256, 35, 18); This will produce a long term average which will approach the correct value, but takes hundreds of iterations to do so. Individual long seeks are inconsequential and the result is an average which is always too high or too low, and usually quite a bit off. Adaptive Optimizer seek metering is done per device per IO type by dividing the integer summation of the physical seek lengths performed by the count of the number of seeks done. The accumulator for total seek length per IO type per drive is only a fixed bin (35) variable, and can potentially cause overflow faults which would result in system crashes. Using a maximum average seek length of 800 cylinders, a seek and service time of 0.050 sec and the accumulator size results in a worst case overflow time of: 9942 days = (2**35/800 tracks)/(.050 * 3600 * 24) Even if the service time were taken to be 1 milli-second, the overflow time would be 198 days. 9 AFFECTED PROGRAMS These modifications affect a fairly small set of sources, essentially those routines which reference dskdcl.incl.pl1 or dskdcl.incl.alm: MTB-673 Multics Technical Bulletin Disk DIM References to dskdcl.incl.pl1: (09/07/82 1300.0 mdt Tue) azm_display_fdump_events.pl1, device_meters.pl1, disk_control.pl1, disk_init.pl1, disk_meters.pl1, disk_queue.pl1, get_io_segs.pl1, ioi_init.pl1, process_dump_segments.pl1, spg_fs_info_.pl1, structure_library_2_.cds References to dskdcl.incl.alm: (09/07/82 1300.0 mdt Tue) dctl.alm 10 DYNAMIC TABLE ALLOCATION Storage allocation is done through the routine, get_io_segs, which looks for a DSKQ entry on the PARM card. If one is not found, then an allocation of 20 entries per drive is done. The FREEQ is allocated with an even number of slots, to ensure double word alignment for structures which follow it. The size of the FREEQ is passed in the disk_data structure by get_io_segs and is used by disk_init to create and initialize the free queue. Channel count is also done by get_io_segs, and space is allocated for channel entries. These entries must be individually double word aligned, and double word multiples in length. 11 FREEQ ISSUES Several issues occur with a system wide free queue. 1. Locking Since the freeq becomes a system wide resource, and exists outside the current sub-system lock, it must have its own lock to protect multiple-access in a multi-processor environment. This lock will be held for a very short time period, since it only protects in the order of 20 lines of straight-line code. Thus this is a spin-lock and is un-metered. 2. Run Polling During times that the DIM has been called from Page Control and must check for the completion of IO's, it calls a Disk RUN routine, which scans the channels to determine completion of IO's. This is necessary since Page Control runs in a masked environment, and cannot recieve interrupts. Multics Technical Bulletin MTB-673 Disk DIM With the MR10.2 software, Run Polling was performed only upon the channels of the current sub-system and could only detect IO completion for requests on that sub-system. With the FREE QUEUE modifications, Run Polling has been extended to scan all channels of the system, rather than just those of the sub-system. This is necessary to handle ALLOCATIONS LOCK situations, in which the completion of the IO which will unlock the Allocation Lock occurs on a sub-system other than that which sponsored the lock. If only the single sub-system were tested for IO completion, the Allocation Lock would be a deadly embrace. 3. Debugging Since the individual free queue elements are not tied to a particular sub-system, the ol_dump and AZM event display functions cannot find the drive reference left in a free queue entry and turn it back to a pvtx reference to determine drive name. Thus a pvtx field has been added to the freeq entry. In addition the time field has been expanded to a full 71 bits from 36-bits so no modulus operations are performed and it is much easier to determine request stagnation. 12 METERING TOOLS The new modifications have required the complete re-write of the disk_meters command, and the obsolecense of the device_meters command. The re-written disk_meters will provide a great deal more metering information, and will default to an output almost identical to the old disk_meters. However control options will permit a greatly extended metering output. A synopsis of this command is in the disk_meters.info appendix. 12.1 Meter Interpretation Interpretation of this metering output provides considerable information on the internal operation of the disk system, though much of it is determined by inference. Information is presented for system-wide data, sub-system data and individual drive data. If the long form of the command is selected, detailed information is provided down to the level of individual IO types, drive load characteristics and channel specific information. MTB-673 Multics Technical Bulletin Disk DIM 12.2 Queue Information For all queues the average depth is determined by summing the depth of a queue prior to insertion of the current request, and counting the number of requests which needed to be queued. Since much of the time drives are idle and do not have a queue it is typical to see a much smaller drive queue allocation count than the number of IO's which the drive has actually done. The ratio of IO's to allocations in comparison to the average IO rate of the drive can give an indication whether a drive is being supplied with infrequent large bursts of IO, or is running in a situation of steady requests. If infrequent large bursts are typical a drive will have a relatively high allocation count, but a relatively low IO rate, and may have a high maximum queue depth. The maximum queue depth of any queue is the peak number of requests which were outstanding in that queue, or for the system FREEQ, the peak number of requests which had been allocated at one time. 12.3 Stagnation The system information area also indicates the stagnation time limit. This is a factor governing how long a request may remain in a drive's work queue before it is considered to have stagnated. Stagnation of a disk request can result in situations where there is enough disk traffic in a small region of the disk to prevent requests on outlying areas from ever being serviced. This can result in a locked up system and very poor user response. The optimizing driver uses the stagnation time period it is supplied with to alter its optimization strategy from one in which fast service for the majority of requests is maintained, to one in which guaranteed service for all requests is maintained. The nearest seek algorithm will sustain higher peak IO rates, but disk combing will guarantee service to all disk requests, thus assuring continued system operation. Stagnation is handled per disk drive by the simple expedient of comparing the oldest disk request in each drive's queue with the current calendar/clock reading. If the request is more than the stagnation time old, then a disk combing algorithm is used, otherwise an optimized nearest-logical seek algorithm is used. When the queue has been reduced such that the oldest request is no longer older than the stagnation time, the drive returns to an optimized nearest-logical seek. Each request processed while combing Multics Technical Bulletin MTB-673 Disk DIM is counted for that drive, producing the COMB count figure in the detailed drive statistics. 12.4 System and Drive Load Optimization Two forms of IO optimization are in effect. Each IO type has a per-drive optimization algorithm and a system-wide optimization algorithm. The system information area of the meter output indicates the current status of the system-wide optimization. For each IO type it indicates a maximum IO depth value, a current depth and depth mapping and the optimization fraction. The maximum depth value is a tuning factor which indicates the maximum system-wide allocation of queue elements to each IO type before all drives are forced to maximum throughput optimization. It is used in conjunction with the current depth, which is the current sum of all requests of that IO type, to produce a fractional multiplier by which to reduce the ADAPTIVE OPTIMIZER's conversion multiplier. The smaller the system-wide fraction, the higher the throughput optimization afforded the IO type system-wide. The algorithm permits a tuner to merge system-wide counters to produce different optimization effects. This is done by MAPPING a number of counters into a single accumulator. Initially each IO type is given its own depth accumulator, but each type is also permitted to indicate which of the system-wide accumulators is to be used to accumulate current depth information. This lets a number of IO types share a single counter, each with their own max_depth values. The result is to artificially accentuate loading for an IO type which is a prime system response factor by the loading of a different IO type. For example one could map VTOC WRITES and PAGE READS into the same system wide counter, and supply a high response multiplier point for the VTOC WRITES, with a small throughput optimization load point. Then VTOC WRITES will be optimized if either they are loaded, or there is a high combined total of outstanding PAGE READS or VTOC WRITES. 12.5 Bailouts The system information area of the meter is also has a figure termed 'bailouts'. This is a count of the number of times that the fast ALM disk interrupt path called the slower MTB-673 Multics Technical Bulletin Disk DIM PL1 disk interrupt path to process an interrupt requiring some complex processing. The fast path only deals with completely straight-forward IO completions. 12.6 Run Polling Two forms of IO completion can occur within the DIM, which will result in meter output indicating: Run Polls, Get IO w/o Term, etc. If the system receives an IO completion, it will post an interrupt, and the interrupt path of the DIM will be called to process the completed IO. This is the typical path for IO completion and handles most of the traffic. However, if the DIM has already been called, either through the CALL side, or if a DISK RUN has been called, the processor will be running masked, and cannot receive the interrupt indicating completion of an IO. So the RUN code polls all channels of all subsystems to determine if IO's have completed, and to post them. It will also initiate a new IO on the channel. The number of times that an IO completion is seen and processed by RUN is indicated in the "run polls" figure for each channel. However, if an IO completion is processed by RUN, there may still be a resultant interrupt posted, which will probably show in the value "get_io w/o term", indicating that in processing an interrupt, there was no terminate bit on for the channel, and the interrupt was ignored. Two other terminate stealing situations may occur but usually require a multi-processor system to produce the appropriate race conditions. Multics Technical Bulletin MTB-673 Disk DIM 13 TEST PLAN Testing will be a multi-stage affair. There are several distinct areas of test: Test Free Queue modifications. This means storage allocation, queue linking and addressing, esd reset of queues, locking and statistics. This also means modifications to necessary meters and dump analysis tools. Test Channel modifications. This means all forms of channel addressing and error recovery, allocation of wierd channel groups and numbers, cross-sub-system channel use (since addressing has changed), esd resetting and channel breaking and recovery (requires cede and recover of channel to IOI for polts testing and firmware load). This also means modifications to necessary meters and dump analysis tools. Test queue field extensions. Test modification to queue field definitions for increased addressing ranges for FIPS drives and modifications for queue-per-drive and associated definition changes. Test adaptive optimization modifications and associated tools. Perform loading tests to determine response and throughput measures. Test disk formatting, firmware loading, error recovery for channels and drives, ESD shutdowns. Test locking in a multi-processor environment. This is necessary since the new disk_data area's require a separate lock management to that of the sub-system. Test channel recovery and ESD. 14 TESTING ACCOMPLISHED 1. Have run highly loaded tests, including crash recovery and ESD in highly loaded situations with no errors. 2. Have run disk formatting on a drive while dropping one of two MPC's on the sub-system. Dropped channels (on crashed MPC) broke correctly, no IO was lost and formatting continued with no problems. This was against a background of drive loading jobs. When formatting finished, and channel activity dropped MTB-673 Multics Technical Bulletin Disk DIM to normal, we were able to reload firmware in the crashed MPC and recover the channels with no problems. 3. Have run basic throughput tests with FREEQ modifications. Clearly demonstrate increased performance and no ALLOCATION LOCKS are possible with adequate queue allocation. In any situation of high IO, but no allocation locks paging overhead was quite low. Any situation with even a low percentage of allocation locks, paging overhead was high. Thus we can assure sites which are experiencing loaded systems with allocation lock problems of a greatly lowered paging overhead as a result of these modifications. 4. Have run ADAPTIVE OPTIMIZATION with complete sucess both single and multi-processor. System loads well and does function even in extremely high loading situations. We have seen one effect of stagnation where the system could lock up. This is due to page replacement occuring on all process pages before a faulted page could be brought in. In this case we recovered the Initializer by spinning down some PDIR drives and permitting IO to complete for the Initializer in his PDIR's. The problem here was that we were paging out the Initializer's pages by the time we could bring in a faulted page (drive service time in combing was longer than memory lap time) such that we reached a situation of deadly embrace. Such a situation is not amenable to a disk control solution, only more memory could solve it to increase the lap time. Multics Technical Bulletin MTB-673 Disk DIM GLOSSARY IOM Input-Output Multiplexor. It is a semi-intelligent processor connected to ALL the SCU's of the MULTICS system. It uses an assigned area of memory to communicate commands and status information with the MULTICS system, and an area of memory through which to vector faults. The fault and command mailbox address are defined on the IOM panel by switch settings, each IOM on a system is assigned a different area for faults and mailboxes. An IOM takes an SCU port, thus the mix of processors and IOM's on a system cannot total more than 8 (4MW SCU). LA Link Adaptor. It is the physical connection between an IOM and an MPC and consists of a set of boards in the MPC cabinet. The IOM has another set of boards (PSIA) in its cabinet to connect the other end of the cable joining the LA and the IOM. Logical Channel Unlike a Physical Channel, which is a physical/electrical object, a Logical Channel is not a physical entity. It is a logical sub-division of a Physical Channel which can hold a request. The IOM micro-program scans all its mailboxes for requests for Logical Channels and posts these down the Physical Channel. Replys from the MPC will be tagged with a Logical Channel number and the results posted back to the IOM. Thus a Logical Channel holds requests to be performed by an MPC and acts like a buffer area for a Physical Channel. MPC Micro-Programmed Controller. It is a semi-intelligent processor connected to an IOM which controls the operation of the disk drives cabled to it. It can accept multiple requests, one per Logical Channel and perform full seek overlap on all its connected disk drives. A 451 MPC can have two LAs, a 607 MPC (used for MSU500/501) disk drives has a single LA. Physcial Channel A Physical Channel is the physical connection between an MPC or FNP and an IOM. It is a set of two cables terminated on one end in the MPC and the other end in the IOM. It carries all electrical signals between the two devices. MTB-673 Multics Technical Bulletin Disk DIM PSIA Peripheral Serial Interface Adaptor. This is the name for a high speed connection in the IOM, typically used for disk or tape MPC's. Each PSIA transforms the physical channel it terminates into a set of Logical Channels. The channel numbers and the number of channels addressable on the single physical channel is determined by resistor packs or switches on the PSIA boards. These define the Base Channel number and indicate the number of Logical Channels which follow it. Base Channel numbers must be a multiple of the number of channels defined on the Base Channel in power-of-two multiples. Thus if a physical channel has one logical channel, it can occur on any boundary. A physical with two logicals must occur on an even address, a physical with 3 or 4 channels must be on a quad boundary, a physical with 5,6,7 or 8 channels must be on an octal boundary. SCU System Control Unit. This is a system and memory controller. The current 4MW controller also houses up to 4 million words of memory in the same cabinet. An SCU provides multi-ported access to the memory it controls to all ports connected on it. Each SCU has 8 port slots, each processor takes a slot, each IOM takes a slot. Each SCU is cabled to ALL processors and IOMs of the system, and they should all have the same port numbers on all SCUs. Multics Technical Bulletin MTB-673 Disk DIM Appendix A - disk_meters.info 06/08/84 disk_meters, dskm Syntax: dskm sub_sys1 sub_sys2 {-control_args} Function: Prints metering output from MULTICS disk management. Arguments: sub_sysn Multiple position-independant sub-system identifiers may be specified to select specific sub-system information. If no sub-system identifiers are supplied, all sub-systems are listed. A sub-system identifier takes the form of: dskm dska Control Arguments: -channels, -chn Requests sub-system channel information. This will be of the form: dska: Channel information. A8 : 42530 connects, 163 int w/o term, 385 run polls A12: 4559 connects, 13 int w/o term, 129 run polls A9 : 1307 connects, 15 int w/o term, 225 run polls A13: 86 connects connects - Number of channel connections made. int w/o term - Number of interrupts without terminate status. run polls - Number of IO's seen by RUN polling. get_io w/o term - Number of io_manager calls not returning terminate. term not active - Number of interrupts with terminate on an inactive channel. IOI - The channel has been released to IOI use. INOP - The channel is deemed to be inoperative. BROKEN - The channel is deemed to be broken. -detail, -dtl Requests detailed printout of drive information. It is of the form: MTB-673 Multics Technical Bulletin Disk DIM dska_04 #Seeks AveSeek Queue-wait Channel-wait Queued Multiplier PageRd 11338 70.38 44.5 0.8% 27.3 0 119.8 PageWt 1482 51.98 1153.7 0.1% 32.4 0 50239.2 VtocRd 1712 90.71 26.3 0.1% 24.4 0 23.8 VtocWt 1518 89.38 26.4 0.1% 20.8 0 49.9 TEST 0 UNLOADs, 39 TESTs Channels 1.05% busy, 212 Combs. -long, -lg Requests all of -dtl, -chn, -q, -sys. -queues, -q Requests inclusion of drive queue information, of the form: dska_04 Queue: Ave 16.1, Alloc 99, Max Depth 50/280, Cur Depth 0 This indicates the average queue depth for the specified number of queue allocations, the maximum depth since max_depth_meters were last reset and the current depth in the queue. Requests are only queued if a drive is busy and/or it already has requests queued. -report_reset, -rr Requests normal statistics to be printed, according to the other control arguments, and then meters to be reset to this point in time (see reset). -unreset, -urs Requests that disk_meters reset its meters to boot time, by releasing its temporary meters segment. -reset, -rs Requests that disk_meters reset its meters to this point in time, and not print statistics. A reset is accomplished by making a copy of the statistics as of the reset time; future invocations of the command will display the difference between current statistics and the copy. -system, -sys Requests that system statistics and optimizing information be printed, in the form: FREE Queue: Ave 6.2, Alloc 60237, Max Depth 99/280, Cur Depth 1 System Factors: stagnate time 5.000 seconds, 378 bail outs. Maximum Depth Meters reset at: 06/07/84 2207.7 mdt Thu PageRd Max Load 6, Depth 0 (PageRd), Fraction 1.0000 PageWt Max Load 210, Depth 0 (PageWt), Fraction 1.0000 VtocRd Max Load 6, Depth 0 (VtocRd), Fraction 1.0000 VtocWt Max Load 12, Depth 0 (VtocWt), Fraction 1.0000 Multics Technical Bulletin MTB-673 Disk DIM This indicates FREE Queue use, stagnation time beyond which the system does disk combing and the number of times that the ALM driver had to call the PL1 driver to process complex interrupt information. The time that max_depth meters were last reset at is given, as is the current status of the system-wide load optimization algorithm. Default Information: The default invokation of disk_meters will provide information like: Metering Time: 11:23:01 Subsystem dski: Locks Waits %Calls Average %CPU Call Lock: 11768 0 0.0000% 0.000 0.00000% Int Lock: 11764 0 0.0000% 0.000 0.00000% Run Lock: 3091 0 0.0000% 0.000 0.00000% Alloc Lock: 11755 3 0.0255% 12.223 0.00009% Drive Reads Writes Seek ATB ATB ATB Distance Reads Writes I/O 1 3064 1831 68 13375 22382 8372 2 729 36 9 56216 1138375 53570 3 3065 1934 77 13370 21190 8197 4 922 26 11 44448 1576212 43229 This indicates the metering period, the sub-system and lock information for the sub-system, and individual drive IO information for all drives which have performed IO in the metering period. Typically 0 counts are suppressed to highlight useful information. Multics Technical Bulletin MTB-673 Disk DIM Appendix B - tune_disk.info 08/16/84 tune_disk, td Syntax: tune_disk drive_name io_type {-load | -ld} n {-response | -rsp} m tune_disk reset_max tune_disk reset_sys tune_disk stagnate seconds tune_disk system io_type {-max n} {-map io_type} Function: Permits a user with hphcs_ access to alter disk tuning parameters. Arguments: io_type An io_type is the name of a type of IO tunable by tune_disk. If tune_disk is envoked without arguments it will print a usage message which includes the valid io_type names. drive_name Is the name of a disk drive to be tuned. Drive names must begin with the three characters "dsk", followed by a letter, an underline and one or two numeric digits. -load n, -ld n This argument pair defines the optimization maximum queue loadpoint for the specified drive. It is one of the two points which define the optimization line. If -load 1 is specified, the initial response value is the optimizing multiplier and no load optimization is performed. -response m, -rsp m This argument pair defines the optimization maximum response value, which is the multiplier to be used for an IO type queue load of a single request. reset_max This argument requests that all queue maximum depth meters be reset in the disk_seg database. The time and date at which the meters were last reset is also maintained in the database. This argument is useful to permit a new/lower max depth to be seen after altering tuning parameters, or after an Allocation Lock has occurred. MTB-673 Multics Technical Bulletin Disk DIM reset_sys This argument requests that all system depth counters be reset to 0. This is useful after altering system depth counter mapping. If counter mapping has been changed while requests were in the queue, the counter which had been used may be left artificially high. Resetting back to 0 lets the system correct the value. stagnate seconds This argument pair specifies a change of the system wide stagnation time period to the specified number of seconds. Tune_disk sets a maximum stagnation time period of 6 minutes. system This argument indicates modification of a system-wide optimization factor. The maximum depth and/or mapping for the specified io_type will be altered. If neither a maximum depth value, nor a mapping is altered an error message is issued. -map io_type This argument specifies that the current depth counting for the specified system-wide optimization entry should be done using the counter for io_type. For example: tune_disk system PageRead -map PageWrite Would have the depth counter for PageWrite used to accumulate the number of PageRead IO's currently outstanding. -max n This argument pair indicates that the maximum depth for the specified system-wide optimization entry should be set to n. If this depth is reached then full optimization of this IO type will be done system wide for all drives. Notes: Optimization is performed by determining a multiplier to be used to convert a Phsical Seek Length into a Logical Seek Length, for the purposes of determining the Nearest Logical Seek to perform on a disk drive. The Response Point determines what this multiplier is for a situation with a single request of that IO type in the queue, and is the multiplier required to produce best system response. The Load Point specifies the number of requests permitted in the queue of the specified IO type before full optimization occurs, Logical Seek Length = Physical Seek Length. These two values define the two endpoints of a straight line. The optimization Multics Technical Bulletin MTB-673 Disk DIM multiplier is determined by the current load of the queue and its corresponding position on the straight line. System-wide queue loading optimization is determined by looking at the system-wide load of an IO type and the maximum depth it should be permitted before becoming fully optimized. The fraction produced by: fraction = max (0.0, (max_depth - depth)/max_depth) is used to alter the individual drive's IO type multiplier to determine the system-wide queue loading effect on individual drive optimization. The system-wide optimization utilizes a max_depth specified for the IO type, and a counter of the current depth to determine the system-wide loading optimization. Depth counters can be mapped together to form an aggregate system-wide queue loading effect. When decrementing, counters are not permitted to become negative, but if re-mapped while non-zero they may remain > 0 with no load. The tuning tools permit resetting the current depth counters for system-wide optimization back to 0, to let the system correct them to a true load indication. All queues have a high-water-mark accumulator. This can be reset through the tuning tools to permit a new high-water-mark to be determined.