Multics Technical Bulletin MTB-564 To: Distribution From: André Bensoussan Date: 02/04/83 Subject: Phasing Page Control and Before Journal ABSTRACT This MTB describes how Page Control and Data Management cooperate in implementing the protocol known as the "Write Ahead Log" (WAL) protocol. When a data management file is modified, a "Before Image" is logged in a Before Journal; that is, the portion of the file about to be modified is saved in the journal, and used later to undo the modification if a rollback is requested. In order for the rollback to operate properly even after an emergency shutdown failure, it is necessary to hold the data base modification in main memory until its associated before image is actually physically written to disk. This is the essence of the WAL protocol. Since the first implementation of data management files will be done using Multi Segment Files, whose pages are moved to disk by Page Control, the enforcement of this protocol cannot be done without Page Control's participation. This MTB describes the respective responsabilties of Page Control, File Manager and Before Journal Manager in their contract to enforce the WAL protocol. _________________________________________________________________ Multics project internal working documentation. Not to be reproduced or distributed outside the Multics project. MTB-564 Multics Technical Bulletin Comments should be sent to the author: via Multics Mail: Bensoussan.Multics on System M. via US Mail: André Bensoussan Honeywell Information Systems, inc. 575 Tech Square Cambridge, Massachusetts 02139 via telephone: (HVN) 261-9334, or (617) 492-9334 CONTENTS Page Abstract . . . . . . . . . . . . . . . i 1 Introduction . . . . . . . . . . . . 1 2 Abbreviations . . . . . . . . . . . 1 3 Background information . . . . . . . 2 4 Description of the protocol . . . . 3 4.1 Before Journal Manager protocol 4 4.2 File Manager protocol . . . . . 4 4.3 Page Control protocol . . . . . 5 5 Extension to several Before Journals 6 Multics Technical Bulletin MTB-564 1 INTRODUCTION In the first release of the new Data Management, files will still be implemented as MSF's and their pages will be written out at page control's discretion. In order to be able to undo a set of modifications done by a transaction, the Data Management uses the "Before Journal" technique: Before modifying any portion of a file, its original value is recorded in a so-called "Before Image" (BI), appended as a logical record to a sequential file called the "Before Journal". If a modified page is written out to disk before its before image is safe on disk, the rollback mechanism becomes vulnerable to a system crash with ESD failure. This MTB describes the methode used to make Page Control cooperate with Data Management in such a way as to have Page Control write out data pages to disk only after their before images are safe on disk. If this can be achieved, it gives the recovery mechanism of the Data Management an enormous advantage: it can rollback all unfinished transactions EVEN AFTER A SYSTEM CRASH WITH ESD FAILURE. If it could not be achieved, recovery after ESD failure would require reloading the files that were open at the time of the crash, using their last dumps, and applying all after images recorded in the after journal(s). This is a very expensive procedure compared to rolling back unfinished transactions. A different proposal to achieve the same goal has been described in MTB-563: "Data Management: Ordering of disk I/O's", but has not been implemented. The method implemented in the Data Management Sytem for MR10 is the method explained in this memo. 2 ABBREVIATIONS The following abbreviations are used in this document: BJM = Before Journal Manager BI = Before Image CI = Control Interval FM = File Manager ESD = Emergency Shut Down MSF = Multi Segment File MTB-564 Multics Technical Bulletin 3 BACKGROUND INFORMATION When the before journal manager is called to journalize a before image, it enters the before image information in the current CI of the journal, but it does not write the BI out to disk at the time it records it. The CI is, in fact, a page of an MSF and it will be written out to disk by Page Control. However, when a transaction commits, the before journal manager causes all CI's of the before journal to be flushed (written to disk) up to the CI containing the last BI generated by the committing transaction, and waits for these I/O's to complete. The BJM is not informed each time a CI (page) of the before journal has been written on disk; the interrupt is handled by page control. But it can however keep track of up to what control interval the journal is completely on disk, each time it requests the journal to be flushed. Multics Technical Bulletin MTB-564 4 DESCRIPTION OF THE PROTOCOL Let us assume that there is only one before journal in the system; the extension to several journals is simple and is discussed at the end of this document. It is convenient, for the description of the protocol, to use the following definitions: o A BI is "safe" if it is completely on disk, and all previous BI's are also safe. A BI is "unsafe" if it is not safe. o A CI of the before journal is " safe" if it is completely on disk, and all previous CI's of the journal are also safe. A journal CI is "unsafe" if it is not safe. Conceptually,the journal can be broken up into two contiguous parts: a safe part, which contains all the safe BI's, follwed by an unsafe part, which contains all the other BI's, still unsafe. The line that separates the two parts may very well fall in the middle of a safe CI, if it happens that this CI contains a portion of a still unsafe BI. If each BI was time stamped at the time it is entered in the journal, the time stamp of the last safe BI would always be higher than the time stamp of any other safe BI, and always lower than the time stamp of any unsafe BI. If, in addition, each data page modified and in main memory had the time stamp of the last BI associated with its modification, it would be possible to determine if the data page could be written out to disk or if it had to be held in main memory, until its BI becomes safe. The proposed method can be sketched as follows: o The BJM maintains the time stamp of the last safe BI in a wired down location available for Page Control to examine. o The FM stores in the standard header of each file CI the time stamp of the BI produced the last time the CI was modified. o Page Control writes out a file CI only if the time stamp in the CI header is smaller than or equal to the time stamp of the last safe BI maintained by the BJM. MTB-564 Multics Technical Bulletin 4.1 Before Journal Manager protocol a. When recording a BI: o Record the BI, starting at the current position in the before journal; the BI may span several CI's. o Generate a time stamp for this BI (the time stamp need not be recorded in the BI). o For each unsafe CI, the BJM remembers the time stamp of the last BI that will become safe when the CI becomes safe. In order to do so, the BJM associates the time stamp of this BI with the CI that happens to contain the end of the BI. o Return the time stamp of the BI to the caller, i.e., the FM. b. When committing: o The BJM remembers the last safe CI from the last commit. It knows the CI number n in which the committing transaction produced its last BI. It causes the journal to be flushed up to CI n, and waits for completion of all I/O's. When all I/O's are completed, CI n becomes safe, as well as all BI's entirely contained in the flushed CI's. o The BJM kept track of the time stamp of the last BI that would become safe when CI n would become safe. It stores this time in the wired down location containing the time stamp of the last safe BI of the journal, to be used by Page Control. 4.2 File Manager protocol o Before modifying a CI of a protected file, the FM calls the BJM to record the necessary BI information and gets back the time stamp of the BI generated by the BJM. o It then stores this time stamp in the standard header of the CI about to be modified. o Only then can it start modifying the control interval. Note -- The standard CI header contains the time the CI was last modified. The BI time stamp can be used to also be the time last modified. Multics Technical Bulletin MTB-564 4.3 Page Control protocol Page Control must be able to know that a page is a CI of a protected file. The FM, when creating an MSF component for a protected file, will set the "protected file switch" (a new switch) in the VTOC entry. At segment activation, this switch is moved in the ASTE. With this assumption, Page Control would have to do the following: o When Page Control decides to write out a page, it should now check in the ASTE if the page is part of a protected file. If not, it proceeds as if does today. o If the page does belong to a protected file, it compares the time stamp stored in the CI with the highest safe time stamp. If it is greater, the page must not be written out because its BI is not safe yet; if it is not greater, the page may be written out, but first its PTW must be faulted to prevent any new modification to be done to the page while it is written out. This protocol must be followed by all programs that write out pages to disk, that is: - by Page Control in the normal case - by the ESD procedure, and - by the program that flushes memory every 15 minutes. Since page control makes the decision to defer the writing out of a page using non ring zero information, it must rely on some kind of safety valves to prevent the pressure on main memory from becoming too high. o First, it could validate time stamps found in data pages as well as the time stamp associated with the before journal; all time stamps must be smaller than the current time. o Next, Page Control could inform BJM each time it has to skip a page by adding 1 to a count associated with the before journal. This causes the BJM to flush the journal when the count becomes "too high," instead of waiting until a transaction commits to do it. o Finally, if it happens that the BJM has not been invoked for a long time, the count may increase beyond its threshold value without triggering any corrective action. In this case, page control should have a way to force the invocation of the BJM to flush the journal. MTB-564 Multics Technical Bulletin 5 EXTENSION TO SEVERAL BEFORE JOURNALS If there are more than 1 before journal, the BJM maintains an array of safe time stamps, one for each journal. When returning the time stamp of the BI, it also returns the index of the journal, which is stored in the CI header with the time stamp by the FM; Page Control then uses this index to access the appropriate time stamp in the array.