Multics Technical Bulletin MTB-639 DM: dm_error_util_ To: Distribution From: Matthew C. Pierret Date: 11/15/83 Subject: Data Management: Error Handling 1 ABSTRACT Modules of the Data Management External and Collection Access Layers use a new approach to handling errors. The new approach, based on the dm_error_util_ module and the dm_sub_error_ condition, is primarily aimed at improving the maintainability of the code which uses it without incurring a performance penalty. Comments should be sent to the author: via Forum: >udd>m>lls>mtg>DMS_Development. via Multics Mail: Pierret.Multics on either MIT Multics or System M. via telephone: (HVN) 261-9338 or (617) 492-9338 _________________________________________________________________ Multics project internal working documentation. Not to be reproduced or distributed outside the Multics project without the consent of the author or the author's management. CONTENTS Page 1 Abstract . . . . . . . . . . . . . . i 2 Introduction . . . . . . . . . . . . 1 2.1 Information loss . . . . . . . 1 2.2 Incomplete information . . . . 2 2.3 Lack of ancillary error detection aids . . . . . . . . . . 2 2.4 The status code parameter . . . 3 3 Overview of the dm_error_util_ mechanism . . . . . . . . . . . . . . 3 3.1 The signalling mechanism . . . 3 3.2 Error objects . . . . . . . . . 4 3.3 dm_error_util_ . . . . . . . . 4 3.4 Protocol . . . . . . . . . . . 5 4 Detailed proposal . . . . . . . . . 5 4.1 The basic model . . . . . . . . 5 4.1.1 Modules which detect errors . . . . . . . . . . . . 5 4.1.2 Modules which handle errors . . . . . . . . . . . . 6 4.1.3 The default_error_handler_ and error handling commands . . 7 4.2 The dm_sub_error_ condition . . 7 4.3 The dm_error_object structure . 8 5 Performance implications . . . . . . 9 6 Description of the operations. . . . 10 dm_error_util_ . . . . . . . . . . 11 $signal . . . . . . . . . . . . 12 $continue_to_signal . . . . . . 14 $handle . . . . . . . . . . . . 15 $display . . . . . . . . . . . . 17 Multics Technical Bulletin MTB-639 DM: dm_error_util_ 2 INTRODUCTION The Data Storage and Retrieval subsystem of the Data Management System has a need for a more powerful error handling mechanism than the standard status code-based mechanism. The current status code mechanism is inadequate for the DS&R in basically four ways: - information loss; - incomplete information; - lack of debugging and error detection aids; - possible performance degradation incurred carrying the "code" parameter from module to module to module. The DS&R uses an error handling mechanism based on the signalling mechanism to overcome the limitations of the status code mechanism. Actually, only the user ring portion of the DS&R (all except the file_manager_) uses this new mechanism. The file_manager_ and the Integrity Services subsystem are excluded from this discussion so as not to require dealing with issues relative to running in an inner ring. This is not to say that it is not desirable to include all of DMS. Comments on how to modify the error mechanism described in this MTB to be usable by al of DMS are welcome. 2.1 Information loss Due to the layered design of the DS&R subsystem, errors must be reported in a manner meaningful to the caller at each layer. The result of this requirement is that errors encountered at low levels are potentially translated at each layer's interface, each time losing information about the actual error. The following call/return sequence demonstrates such a loss of information through code translation: Call relation_manager_$get_tuple Call index_manager_$get_key Call collection_manager_$get_header Call collection_manager_$get_element Return with error code no_element Return with error code collection_not_found Return with error code collection_not_found Return with error code index_not_in_relation The final code, index_not_in_relation, alerts the caller to the problem on the caller's level, i.e., the index specified in the call is not in the relation. However, the actual error, no MTB-639 Multics Technical Bulletin DM: dm_error_util_ element was found, is lost. The DS&R needs a way to report errors without losing such information. 2.2 Incomplete information This example also illustrates the lack of information reported by the status code mechanism. The routine collection_manager_$get_header knows the element for which collection_manager_$get_element was looking, but it has no convenient way to convey that information to its caller or to a user investigating the problem. Information such as that which can be provided via sub_err_ would be very helpful in detecting errors, especially if there were information associated with each interesting level on the stack. The following is an example of the type of information desired: (at relation_manager_$get_tuple): The specified index does not exist in the relation. The index with the identifier of 16o could not be found in the relation with opening identifier of 340561o. (at collection_manager_$get_header): The specified collection could not be found. There is no collection_header at control interval 0, slot index 14. (at collection_manager_$get_element): The specified element was not found. The element at control interval 0, slot 16 has been freed. This information quickly points out the existence of a programming error, as collection_manager_$get_header and collection_manager_$get_element think they are looking at different locations (slots 14 and 16 respectively). 2.3 Lack of ancillary error detection aids The manner in which the DS&R is used, both in production and in debug, lends itself to some additional debugging and error detection aids. Two such aids are the ability to log errors and the ability to maintain/report information about the process state at the time an error occurred. Logging certain errors is desirable during debug to spot all occurrences of certain errors and in production to track commonly encountered errors. Debugging the DS&R has required heavy use of long absentee processes. Errors encountered in an absentee are very difficult to investigate since developers can't examine the process which Multics Technical Bulletin MTB-639 DM: dm_error_util_ encountered the problem. A trace of the stack taken at the time of the error provides a good deal of helpful and timely information about the process state. Both logging and producing stack traces would both help developers debug problems and Beta test sites report problems accurately. 2.4 The status code parameter The DS&R contains very many modules and a typical call to the relation_manager_ produces a large number of calls to lower level modules. Currently each call includes a code parameter and most calls are followed by the standard cliche: if code ^= 0 then call ERROR_RETURN; The large majority of modules do not care what the code is, other than that is is zero or non-zero, and pass the code on to the caller. Many modules, then, incur an unnecessary expense when calling modules by having to pass an argument of little use, and further after the call by having to check the value of that argument. Most DS&R modules effectively only want to be unwound past if an error of any type has occurred, and returned to if no error has occurred. 3 OVERVIEW OF THE DM_ERROR_UTIL_ MECHANISM The dm_error_util_ mechanism is designed to meet some of the needs of the DS&R which standard status code mechanisms fail to satisfy. The major components of the mechanism are the signalling mechanism, dm_error_objects, the dm_error_util_ module and a protocol by which modules use the mechanism. 3.1 The signalling mechanism Modules are alerted of errors via the Multics signalling mechanism. The majority of DS&R modules do not handle errors. Currently most modules that receive a non-zero status code from a called module simply pass the status code on to their callers. The signalling mechanism approach frees these modules from dealing with status codes, removes code-checking cliches and removes status codes from their calling sequences. Those modules which actually handle errors set up "on units" for the dm_sub_error_ condition. The dm_error_util_ mechanism uses the dm_sub_error_ condition in a similar fashion as the sub_err_ system routine uses the sub_error_ condition. The MTB-639 Multics Technical Bulletin DM: dm_error_util_ dm_sub_error_ condition is used instead of the sub_error_ condition so as to avoid possible confusions between dm_error_util_ and sub_err_. The performance penalty paid by those modules that must set up dm_sub_error_ on units is not expected to be very high. Because the number of modules needing such an on unit is considered to be low, the relative price should be nomimal. The savings gained by freeing all modules of processing status codes may even be greater than the performance penalty. The reader is reminded that this analysis is not based on actual measurements. 3.2 Error objects Associated with each instance of the dm_sub_error_ condition is a linked list of dm_error_objects. These dm_error_objects are structures which contain information about the error which resulted in the signalling of dm_sub_error_. The linked list contains one dm_error_object created as a result of signalling the error and potentially one dm_error_object for each module which handles the error. This allows for information about an error to be described in high-level terms (say, at level of relation_manager_$get_tuple) without discarding information about the error described in terms of lower levels (say, record_manager_$get_record and collection_manager_$get_element). The primary pieces of information in an dm_error_object are an error code and a message string. The message describes the error in the context of the module handling the error. The message is just like the kind of message supported by the sub_err_ system routine. Each module which handles errors uses the error code to determine if the error is the kind of error the particular module wants to handle. The error codes in all dm_error_objects of a single list need not be the same. A module may translate the code to a more appropriate code, such as the translation of no_element to collection_not_found to index_not_in_relation shown in an earlier example. The code is translated not by changing the value of the error code in the dm_error_object, but by creating a new dm_error_object with the new code. This behavior prevents the loss of information by code translation. 3.3 dm_error_util_ The DS&R modules use dm_error_util_ operations to deal with errors. The dm_error_util_ module contains four entries, as follows: Multics Technical Bulletin MTB-639 DM: dm_error_util_ $signal creates a dm_error_object and signals the dm_sub_error_ condition. $continue_to_signal creates a dm_error_object and calls the system routine continue_to_signal_. This entry is called from a dm_sub_error_ handler. $handle handles a specified error. This entry is called from a dm_sub_error_ on unit. $display displays information about dm_error_objects. This entry is called by the default_error_handler_ and by error reporting commands. 3.4 Protocol That power of the signalling mechanism is great enough to allow for many complex situations. To simplify the dm_error_util_ approach, restrictions are placed on the use of the signalling mechanism and a strict protocol is defined for proper use of the dm_error_util_ operations. These restrictions do not actually prevent the use of any aspect of the signalling mechanism; rather they spell out those uses which may produce non-intuitive or problematic results that are not under the control of dm_error_util_. 4 DETAILED PROPOSAL 4.1 The basic model The modules which use dm_error_util_ are easily classified into three groups: modules that detect an error and wish to report it; DS&R modules that wish to handle errors; and the default_error_handler_ and commands to examine errors. The dm_error_util_ mechanism is easily discussed by describing how each group uses dm_error_util_. 4.1.1 MODULES WHICH DETECT ERRORS Any DS&R module which detects an error reports that error via dm_error_util_$signal. The module supplies in the call an error code, the module's name, an error message and action flags, as when calling sub_err_. dm_error_util_$signal creates a dm_sub_error_info condition info structure and an dm_error_object structure in the "dm free area" (the area returned from MTB-639 Multics Technical Bulletin DM: dm_error_util_ get_dm_free_area_) and signals dm_sub_error_ (via the signal_ system routine). If the caller of dm_error_util_$signal has an enabled dm_sub_error_ on unit, that on unit will catch the condition, so the module must understand that it should continue the signal without handling it. Before returning, or if unwound, dm_error_util_$signal frees all of the dm_error_objects allocated as a result of the error. This is necessary because the objects are not allocated in the stack, so are not automatically released when the stack is unwound. 4.1.2 MODULES WHICH HANDLE ERRORS Any module which wishes to handle errors must have a dm_sub_error_ on unit enabled. The on unit should have at least one call to dm_error_util_$handle, passing an error code and an entry variable to a handler routine for the error. If the error code matches the code in the most recent dm_error_object, dm_error_util_$handle invokes the handler with a standard calling sequence. The handler can, in fact, do anything it wants to do, but some restrictions are necessary to guarantee well-defined behavior. The following four types of action can be taken, in the manner described: - Continue the signal without adding any information. The handler should call the continue_to_signal_ system subroutine and return. - Continue the signal after adding information. The handler can add an dm_error_object to the list of dm_error_objects by calling dm_error_util_$continue_to_signal. This entry creates an dm_error_object, fills it with the information supplied in the parameters, links the dm_error_object to the previous dm_error_object, and calls continue_to_signal_. The handler should return after calling dm_error_util_$continue_to_signal. - Stop the signal. The handler can stop the signal via a non-local transfer of control or via a simple return without having called continue_to_signal_. In the former case, all stack frames more recent than the one into which control is transferred are unwound from the stack, causing cleanup handlers to be invoked. In the latter case, execution continues from the point of the original signal, i.e, from the statement after the call to dm_error_util_$signal. Multics Technical Bulletin MTB-639 DM: dm_error_util_ - Re-signal. Any action which could cause dm_sub_error_ to be signalled should be avoided unless the on unit has a dm_sub_error_ on unit of its own enabled. This is because the signalling mechanism will search all stack frames for on units, including those that have already handled the prior instance of the dm_sub_error_ condition. Such actions include calling dm_error_util_$signal, calling signal_ with the dm_sub_error_ condition, or calling any module which might directly or indirectly signal dm_sub_error_. 4.1.3 THE DEFAULT_ERROR_HANDLER_ AND ERROR HANDLING COMMANDS The default_error_handler_ will handle the dm_sub_error_ condition by first calling dm_error_util_$display to display information about the error, then getting to a new command level. dm_error_util_$display finds the last (most recent) dm_error_object in the list of dm_error_objects associated with the condition and displays the information in that dm_error_object. Existing error reporting commands, such as reprint_error, can be changed or new ones written to exploit the ability of dm_error_util_$display to display optionally several dm_error_objects. 4.2 The dm_sub_error_ condition Following is a Reference Guide-style description of the dm_sub_error_ condition: dm_sub_error_ Cause: a Data Management subroutine has detected an error situation for which it wants to signal a condition, often with the possibility of continuing, rather than returning a status code. The dm_error_util_$signal subroutine signals this condition. Default action: prints a message and returns to command level; however, the condition name printed is not dm_sub_error_ but the module name from the dm_error_object in the data structure. Restrictions: none. Restartability: immediately restartable, conditionally restartable, or not restartable depending on the particular situation and how the action MTB-639 Multics Technical Bulletin DM: dm_error_util_ flags in the data structure are set. Data structure: dcl 1 dm_sub_error_info aligned, 2 header like condition_info_header, 2 dm_error_object_ptr ptr; where: dm_error_object_ptr points to an dm_error_object structure created by dm_error_util_$signal. 4.3 The dm_error_object structure The dm_error_object structure, found in dm_error_object.incl.pl1, has the following format and meaning: dcl 1 dm_error_object aligned based (dm_error_object_ptr), 2 version char (8) init (ERROR_OBJECT_VERSION_1), 2 next_error_object_ptr ptr init (null), 2 prev_error_object_ptr ptr init (null), 2 dm_sub_error_info_ptr ptr init (null), 2 flags, 3 begins_new_error bit (1) unal init ("0"b), 3 mbz1 bit (33) unal init ("0"b), 2 signalling_program_name char (32) varying init (""), 2 message char (256) varying init (""); where: version is equal to ERROR_OBJECT_VERSION_1 in dm_error_object.incl.pl1. next_error_object_ptr points to the next most recent dm_error_object in this chain of dm_error_objects. prev_error_object_ptr points to the next least recent dm_error_object in this chain of dm_error_objects. dm_sub_error_info_ptr points to the dm_sub_error_info condition info data structure for the instance of dm_sub_error_ with Multics Technical Bulletin MTB-639 DM: dm_error_util_ which this dm_error_object is associated. flags.begins_new_error if on indicates that this is the first dm_error_object associated with an instance of the dm_sub_error_ condition. If dm_error_util_$signal is called when there is already an instance of dm_sub_error_, and hence already a chain of dm_error_objects, the new dm_error_object created by dm_error_util_$signal is added to the chain with this flag on to show that a new error has occurred. flags.mbz1 must be zero ("0"b). signalling_program_name is the name of the module which created this dm_error_object, i.e., the last module to signal or continue to signal dm_sub_error_. message is a message describing the error. 5 PERFORMANCE IMPLICATIONS Although the main reason for adopting the dm_error_util_ model of error handling is for maintainability, it is expected that a performance enhancement may be a welcome side-effect. If it becomes clear that a performance degradation will result, dm_error_util_ will not be used. Performance degradations could result in two ways: in the added expense of setting up a dm_sub_error_ on unit and in the added expense of signalling and handling the dm_sub_error_ condition. It is speculated that the savings re-couped by the removal of code parameters and code checking will offset any increase in time spent enabling on units. The rationalization for this argument lies in the believe that so few modules will need to enable the on units. A cursory look at the index_manager_ modules revealed that three of the thirty-three modules would require a dm_sub_error_ on unit if the index_manager_ were converted directly to using dm_error_util_. The cost of signalling and handling conditions is only a problem if the error ultimately is found to not be an error. An example of including an error in the normal and common course of events is the index_manager_'s reliance on the dm_error_$long_element. In some cases, index_manager_ determines if a key will fit in a control interval by attempting to put it MTB-639 Multics Technical Bulletin DM: dm_error_util_ in the control interval. If dm_error_$long_element is returned, index_manager_ shifts keys around until a space is found for the new key. It would be very expensive for collection_manager_ to signal dm_sub_error_ simply to give index_manager_ a small piece of information, especially since this is a very common occurence. This expense can be bypassed in this case by changing the relavent collection_mnager_ entry to return a failure indicator if there is not room for the element and to report all other errors via dm_error_util_. In fact, such a scheme would eliminate all requirements for error handling in the index_manager_. In short, there are known performance penalties for using a signalling-based model, some of which are offset by performance gains and some, possibly the rest, of which are eliminated by minor interface changes to a few select DS&R modules. 6 DESCRIPTION OF THE OPERATIONS. ______________ ______________ dm_error_util_ dm_error_util_ ______________ ______________ Name: dm_error_util_ This module is for reporting, handling and displaying errors in the Data Management System. The report of an error is made by calling the $signal entry. These error signals can be selectively caught and handled by using the $handle entry. ______________ ______________ dm_error_util_ dm_error_util_ ______________ ______________ Entry: dm_error_util_$signal This entry is for creating and signalling error objects. Signalling an error object means signalling the dm_sub_error_ condition where the condition info structure points to a dm_error_object structure. An error object can be caught using the $handle entry from inside of a dm_sub_error_ on unit. If there are more than one error objects which have been signalled, they are all chained together in a single list, the most recently signalled at the head of the list. The default_error_handler_ can be convinced to display any number of the error objects in such a list. It can also be specified how much about each error object is displayed (by default) by the default_error_handler_. The $display entry can be used directly to display the current error object list. Usage dcl dm_error_util_$signal entry options (variable); call dm_error_util_$signal (code, signalling_program_name, control_flags, message, message_args); where: code (Input) is a standard system error code, declared fixed bin (35). signalling_program_name (Input) is the name of the program signalling the error object, declared char (*). control_flags (Input) is a set of flags controlling how the signalling of the error object is to be handled (e.g. whether to log the error object in the DM system log, whether to create a trace_stack, what ACTION flags to set defining the restartability of the condition). This is declared bit (36) aligned, and is interpreted according to the dm_error_flags structure in the dm_error_flags.incl.pl1 include file. The flags can be set by or-ing the DM_ACTION constants in dm_error_flags.incl.pl1, as in: DM_ACTION_CANT_RESTART | DM_ACTION_TRACE DM_ACTION_QUIET_RESTART | DM_ACTION_LOG message (Input) ______________ ______________ dm_error_util_ dm_error_util_ ______________ ______________ is an ioa_ control string for a message to be associated with the error object being signalled. message_args (Input) is any number of arguments for the message ioa_ control string. Examples call dm_error_util_$signal ( dm_error_$ci_already_allocated, cm_allocate_ci, (DM_ACTION_QUIET_RESTART | DM_ACTION_LOG), "^/Control interval ^d in file ^3bo, collection ^3bo, was marked as free in the file_reservation_map, but was already allocated.", control_interval_id, file_opening_id, collection_id); call dm_error_util_$signal ( dm_error_$key_out_of_order, im_rotate_insert, (DM_ACTION_CANT_RESTART | DM_ACTION_TRACE), "^/The key in node ^d, slot ^d has a value less than the key in node ^d, slot ^d. The former should be greater than the latter.", new_key_id.control_interal_id, new_key_id.index, old_key_id.control_interval_id, old_key_id.index); ______________ ______________ dm_error_util_ dm_error_util_ ______________ ______________ Entry: dm_error_util_$continue_to_signal This entry is for adding an error object to a list of error objects and continuing to signal the most recent error object. Continuing to signal an error object means calling continue_to_signal_ from inside a handler invoked by the $handle entry. where the condition info structure points to a dm_error_object structure. An error object can be caught using the $handle entry from inside of a dm_sub_error_ on unit. The default_error_handler_ can be convinced to display any number of the error objects in such a list. It can also be specified how much about each error object is displayed (by default) by the default_error_handler_. The $display entry can be used directly to display the current error object list. Usage dcl dm_error_util_$continue_to_signal entry options (variable); call dm_error_util_$continue_to_signal (code, signalling_program_name, message, message_args); where: code (Input) is a standard system error code, declared fixed bin (35). signalling_program_name (Input) is the name of the program signalling the error object, declared char (*). message (Input) is an ioa_ control string for a message to be associated with the error object being signalled. message_args (Input) is any number of arguments for the message ioa_ control string. ______________ ______________ dm_error_util_ dm_error_util_ ______________ ______________ Entry: dm_error_util_$handle This entry is used to invoke error handlers when the current dm_error_object contains an error of some particular type. The error handler invoked is program with a particular calling sequence which can do anything the caller of $handle errors desires. However, the handler should obey the restrictions cited in "Notes". The call of the $handle entry is made from the on unit for the dm_sub_error_ condition. Usage dcl dm_error_util_$handle entry (char (*), entry variable, ptr, bit(1)aligned); call dm_error_util_$handle (error_type, handler_entry, handler_info_ptr, handled_sw); where: error_type (Input) is the name of an error type, currently this must be the same as the name of an error code, and only matches error objects with that error code. handler_entry (Input) is an entry to be invoked if there is an error of the specified type in the current dm_error_object list. The syntax of the handler is: dcl handler entry (char (*), ptr, ptr); call handler (error_type, dm_error_object_ptr, handler_info_ptr); handler_info_ptr (Input) is a pointer to a caller-defined info structure for use by the caller-specified handler_entry. handled_sw (Output) is an output flag which indicates, if on, that the error was handled. This is useful if an on unit has multiple calls to $handle, and wants to stop after one such calls handles the error. ______________ ______________ dm_error_util_ dm_error_util_ ______________ ______________ Examples The following code fragment illustrates a use of the $handle entry to catch the dm_error_$no_element error: my_no_element_handler_info.return_label = EXIT; on dm_sub_error_ call dm_error_util_$handle ("dm_error_$no_element", MY_NO_ELEMENT_HANDLER, my_no_element_handler_info_ptr); call foo; EXIT: return; MY_NO_ELEMENT_HANDLER: proc (p_error_type, p_dm_error_object_ptr, p_my_no_element_handler_info_ptr); goto p_my_no_element_handler_info_ptr -> my_no_element_handler_info.return_label; end MY_NO_ELEMENT_HANDLER; ______________ ______________ dm_error_util_ dm_error_util_ ______________ ______________ Entry: dm_error_util_$display This entry displays information from the current list of error objects. Usage dcl dm_error_util_$display entry (fixed bin (17) aligned); call dm_error_util_$display (depth); where: depth (Input) is the number of error objects in the current error list (counting from the "top", or most recently signalled) which are displayed.