Multics Technical Bulletin MTB-550 DM: index_manager_ design To: Distribution From: Lindsey Leroy Spratt Date: 01/12/83 Subject: Data Managment: Index Manager Design 1 ABSTRACT This document describes various facets of the implementation of the index manager. _________________________________________________________________ Multics project internal working documentation. Not to be reproduced or distributed outside the Multics project. MTB-550 Multics Technical Bulletin DM: index_manager_ design Comments should be sent to the author: via Multics Mail: Spratt.Multics on either MIT Multics or System M. via US Mail: Lindsey Spratt Honeywell Information Systems, inc. 575 Tech Square Cambridge, Massachusetts 02139 via telephone: (HVN) 261-9321, or (617) 492-9321 CONTENTS Page 1 Abstract . . . . . . . . . . . . . . i 2 Introduction . . . . . . . . . . . . 1 3 B-Trees . . . . . . . . . . . . . . 1 4 Index manager utilities . . . . . . 2 5 Storage techniques . . . . . . . . . 2 5.1 Index control intervals . . . . 3 5.2 Key structure . . . . . . . . . 3 5.3 Key prefixes . . . . . . . . . 4 5.4 Determining an appropriate set of prefixes . . . . . . . . . . . 6 6 Interface for manipulating keys in control intervals . . . . . . . . . . 6 7 Key layout . . . . . . . . . . . . . 6 7.1 The field_table structure . . . 7 8 Allocation of keys . . . . . . . . . 9 8.1 Allocation of keys to control intervals . . . . . . . . . . . . 9 8.1.1 Rotating keys . . . . . . 9 8.1.2 Splitting control intervals . . . . . . . . . . . 10 8.1.3 Overlength keys . . . . . 10 8.2 Allocation of a key within a control interval . . . . . . . . . 10 9 The search algorithm . . . . . . . . 11 9.1 Basic approach . . . . . . . . 11 Multics Technical Bulletin MTB-550 DM: index_manager_ design 2 INTRODUCTION The index_manager_ subsystem maintains a "sorting" index, where the keys of which the index is composed may be composed, in turn, of several fields of differing data types. Strategies are implemented which reduce the number of pages touched in a search of the index, and which reduce the overall storage requirements. The algorithm for building and maintaining the index is a modified form of the B-tree algorithm. The storage reduction is due to the use of a "prefix extraction" algorithm. The basic concept behind this is to store data only once which is at the beginning of several keys. The data which is stored only once in this fashion is a "prefix". 3 B-TREES In a survey of B-tree algorithms, "The Ubiquitous B-Tree" by Douglas Comer in Computing Surveys, Vol. 11, No. 2, June 1979, the various B-tree algorithms are divided into several major categories; the basic B-tree, the B*-tree and the B+-tree. The technique employed by index_manager_ is in the B+-tree category. There are two basic ideas in the B-tree. First, keys are allocated in "control intervals" or "pages". A control interval is a piece of storage which can be read from or written to disk in a single operation. Consecutive keys are placed in the same control interval. If a key wont fit in the control interval of the keys which are to either side of it, then it is placed in another control interval which is pointed to by the "branch" pointer which lies between the lesser and greater keys which this key was to be placed between. "Nodes" in a B-tree are the control intervals in which keys are placed. The second idea is to allocate the storage of keys in such a way as to keep the tree of control intervals, or nodes, balanced. The nodes at the bottom of the tree are the leaf nodes. In the basic B-tree, a key can be stored in a branch node or a leaf node, depending on where there happens to be room and where the key falls in the sort of the keys already in the index. The B+-tree has the primary difference that all user-specified keys are stored in the leaf nodes. The keys in the branch nodes serve only to provide a roadmap to the keys in the leaves. The leaf nodes are linked together to speed sequential access of the keys. Since a key in a branch node may have any value as long as its value discriminates between the keys which are its children, the primary criterion for choosing a value for a branch key is to choose the shortest value. A convenient technique for determing the shortest value is to find MTB-550 Multics Technical Bulletin DM: index_manager_ design the common prefix which the two keys being discriminated between have (which may well be nothing) and add one more piece of data beyond the prefix from the higher of the two keys. Depending on the data type involved, this one more piece of data may be a bit, a character, or an entire field. Comer refers to this as a "Prefix B+-tree". ("Prefix extraction", as used elsewhere in this MTB, refers to a technique of key compression and the term "prefix" will henceforth only be used to refer to this key compression technique and not in the sense which Comer uses it.) 4 INDEX MANAGER UTILITIES The index manager uses three utilities, collection_manager_, data_mgmt_util_, and vector_util_. The storage management utility, collection_manager_, deals directly with the page file manager and talks in terms of "elements" of the index collection. These elements are undifferentiated bit strings, they are allocated in "control intervals". A control interval is 1024 words long, one Multics page. At some point in the future this size will become settable on a per-page file basis, approximately in the range of 512 to 4096 words. The collection_manager_ is documented in MTBs 551, "Data Management: Collection Manager Functional Specification.", and 552, "Data Management: Collection Manager Design.". The key data is presented to the caller of the index_manager_ as a typed_vector structure (see MTB-541 "Toward a unification of data manipulation on Multics." for a brief discussion of data vectors). The vector_util_ module provides tools for creating and manipulating vectors of data. The internal representation of keys is as a continuous, peculiarly formatted, bit string. This representation is sometimes referred to as the "string" representation, in contrast to the "vector" representation used to communicate with the caller of the index_manager_. Converting between the "vector" representation and the "string" representation, and manipulating and interpreting the contents of the "string" representation are all functions performed by data_mgmt_util_ (which is also used for this purpose by the record_manager_). 5 STORAGE TECHNIQUES Keys are stored as elements in collections using collection_manager_. The elements are a storage abstraction managed as undifferentiated bit strings. The control intervals in an index collection are either branch control intervals or leaf control intervals. The branching keys stored in the same Multics Technical Bulletin MTB-550 DM: index_manager_ design control interval may have some amount of data in common in their "most significant bits". The optimum set of common prefixes between keys in a control interval is used to minimize storage. 5.1 Index control intervals Each of the two kinds of index control intervals has an associated data structure. The branch control interval uses the branch_ci_header structure and the leaf control intervals uses the leaf_ci_header structure. These control intervals are threaded together in a tree. Each level of the tree, in addition to being threaded to its parents, is threaded in a doubly-linked list to its immediate siblings to aid sequential access. The headers are stored as elements with slot number 1. Every control interval in an index collection has either a leaf or a branch header. The headers are declared in dm_im_ci_header.incl.pl1 as shown below: dcl 1 common_ci_header based (common_ci_header_ptr), 2 flags unaligned, 3 is_leaf bit (1) unaligned, /* ON for leaf_ci. */ 3 pad bit (17) unaligned, /* Must be zero. */ 2 key_tail_space_used_since_last_prefix_compaction fixed bin (18) unsigned unal, 2 key_range unaligned, 3 first fixed bin (18) unsigned, 3 last fixed bin (18) unsigned, 2 parent_id_string bit (36) aligned, 2 previous_id fixed bin (24) unsigned unaligned, 2 next_id fixed bin (24) unsigned unaligned, 2 pad bit (24) unaligned; dcl 1 leaf_ci_header based (leaf_ci_header_ptr), 2 common like common_ci_header; dcl 1 branch_ci_header based (branch_ci_header_ptr), 2 common like common_ci_header, 2 low_branch_id fixed bin (24) unsigned unaligned, 2 pad bit (12) unaligned; 5.2 Key structure The keys are allocated within the control interval structures. Every key which is inserted into the index appears in its entirety in the leaf control intervals. The data in the branch control intervals is only present to guide the searches against the keys. The values of the branching keys are chosen to MTB-550 Multics Technical Bulletin DM: index_manager_ design be the minimum necessary length to differentiate between leaf keys. Every branching key has a branch control interval id associated with it. This identifies the control interval which contains keys which are between the branching key and the next higher branching key in the same control interval. There are different key structures for the branch and the leaf control intervals. The keys in the branch control intervals are formatted according to the branch_key structure, the keys in the leaf control intervals are formatted according to the leaf_key structure. In both cases, the field_table must be used to decode the "contents" of the key. The key structures are declared in dm_im_key.incl.pl1 as shown below: dcl 1 leaf_key based (leaf_key_ptr) unaligned, 2 string bit (lk_string_length) unal; dcl 1 branch_key based (branch_key_ptr) unaligned, 2 branch_id fixed bin (24) unsigned unaligned, 2 last_field_idx fixed bin (12) unaligned unsigned, 2 string bit (bk_string_length) unal; 5.3 Key prefixes There may be several prefix groupings of keys, where a "prefix grouping" is all of the keys in a control interval with the same prefix. This prefix is stored as an element of the branch_index_control_interval, with an element_position_table slot less than the value of first_key_in_element_position_table. If there aren't enough slots, the keys have to be shifted down one in the element_position_table to make room for the new prefix. The prefixes are placed first in the element_position_table because they are less volatile than keys. dcl 1 prefix based (prefix_ptr) aligned, 2 first_slot fixed bin(12) unsigned unaligned, 2 last_slot fixed bin(12) unsigned unaligned, 2 number_of_fields fixed bin (17) unaligned, 2 final_field_is_fragment bit(1) unaligned, 2 pad bit(5) unaligned, 2 contents bit (prefix_contents_length); Only the "tails" of the branching keys need be stored, now, Multics Technical Bulletin MTB-550 DM: index_manager_ design since the entire branching key can be reconstructed at need by prefixing the "tail" portion with its prefix. It is possible to break a prefix and a tail across a field of the key if the field is of a string data-type. Since the prefix may include any number of fields, from none to all, the number of fields actually present in the prefix must be stored with the prefix. If there are N fields present in the prefix, they must be the fields with field ids of 1 through N. Also, a bit is set to indicate whether the Nth field of the prefix is complete, or a fragment completed in each of the key tails. Since only the Nth field can be fragmented, there only needs to be one "field_is_fragmented" bit. The worst case possible for a set of keys, with respect to number of keys having a common prefix, is 0. However, a subset may exist for which there is a common prefix. This prefix is "factored" out of the keys This is done for each subset of keys with a common prefix in the control interval. In this fashion it is possible to have several prefixes for the keys in a single control interval (each key having only one prefix, of course). Since the intent of the prefixing technique is to save storage space by not replicating data, it is appropriate to only do prefix-extraction when space will actually be saved. The space saved is the amount of space the replication of the data would take minus the amount of space for storing the prefix, or: prefix_space_saving = replication_space_use - prefix_space_use, = (prefix_length * number_of_keys_with_prefix) - (prefix_overhead + prefix_length), = prefix_length * (number_of_keys_with_prefix - 1) - prefix_overhead. Prefixes are only extracted if the prefix_space_saving is great enough to make it worthwhile. This means not only that the prefix_space_saving must be greater than 0, but it must be great enough to compensate for the extra indirection (to construct the whole key). Prefixes can provide a small savings in computation, when two successive comparisons are against keys with the same prefix. In this situation, the result of the previous comparison can be re-used. In general, however, the use of prefixes will increase cpu cost. Prefixes are determined under two circumstances. When a key is being inserted, it is known (from the search process used to determine where to insert the key) what, if any, of the existing MTB-550 Multics Technical Bulletin DM: index_manager_ design prefixes apply to it. The appropriate value is recorded in the previous_prefix_index variable. The other case is when a compaction is being done of the entire control interval. Compaction is done when there is insufficient room in the free space pool to do an insertion. In the simplest situation compaction is only used to recover scatterred free space left by deleting key contents. Prefix extraction is done during compaction, but only if scattered free space compaction is insufficient and the key_tail_space_used_since_last_prefix_extraction is greater than or equal to the reqired amount of space to successfully insert the new key. This will require rebuilding the set of prefixes and the key "tails". 5.4 Determining an appropriate set of prefixes The algorithm for selecting the set of prefixes to be used within a given control interval starts with a sorted list of all of the keys in the control interval. This list is passed over, from first key to last, building a list of prefixes which are applicable to the keys seen so far, testing the prefixes of the list against the current key to keep the current prefix list up to date. When done with this set of comparisons, there is a list of all of the prefixes which are sufficiently common to bear extraction. The mapping of keys-to-prefixes is used to get a set of key tails. Finally, the whole mess of key tails and prefixes is put into the control interval using collection_manager_$put_element. 6 INTERFACE FOR MANIPULATING KEYS IN CONTROL INTERVALS When a key is stored in the index collection, it is broken up into as many datums as necessary, where each datum is contained within a control interval. All of the datums together make a single "element" of the index collection. A set of primitives exists for allocating keys, freeing them, rewriting and retrieving. The key is known to these primitives as only a _________________________________________________________________ The key_tail_space_used_since_last_prefix_extraction is an approximate value, which is only useful as an optimization aid in determining when to do a prefix analysis. Multics Technical Bulletin MTB-550 DM: index_manager_ design string of bits, an "element". The structure of the key as a set of fields is known and used by the data_mgmt_util_ utility. 7 KEY LAYOUT The key is layed out within the element. The layout of the key is the either the branch_key or the leaf_key structure, as appropriate. The layout of the contents is described in the "field_table" structure, described below. The layout of the key is divided into the fixed portion of the key and the variable portion, this being divided yet again among prefixes and the "tail". The fixed portion is composed of all of the fixed-length fields, and the length data for the variable fields. This minimizes the expense of accessing the fixed fields, as intermingling fixed-length and varying-length data requires calculating each field's position for each key and separating the fixed and varying field data eliminates the need for calculating the location of the fixed fields. The variable portion of the key consists only of the contents of the variable length fields. The field table describes the beginning location for the fixed length fields and the location of the length value for the variable length fields. It also describes the type of each field, and the location where the first variable field's contents start. 7.1 The field_table structure The field_table structure is stored in the index_manager_'s per-index-collection header information, along with the location of the root node of the index and a set of counts of the keys in the index. The field_table is declared in dm_field_table.incl.pl1 as shown below: dcl 1 field_table aligned based (field_table_ptr), 2 version fixed bin (17) unal, 2 number_of_fields fixed bin (17) unal, 2 maximum_field_name_length fixed bin (17) unal, 2 location_of_first_varying_field fixed bin (17) unal, 2 field (ft_number_of_fields refer (field_table.number_of_fields)), 3 name char (ft_maximum_field_name_length refer (field_table .maximum_field_name_length)) var, 3 flags unal, 4 descriptor_is_varying MTB-550 Multics Technical Bulletin DM: index_manager_ design bit (1) unal, 4 length_is_in_characters bit (1), 4 must_be_zero bit (7) unal, 3 location fixed bin (35), 3 descriptor bit (36) aligned, 3 length_in_bits fixed bin (35), 2 varying_field_map (ft_number_of_fields refer (field_table .number_of_fields)), 3 field_id fixed bin (17) unal, 3 varying_field_index fixed bin (17) unal; The field_table.field array has one entry for each field in the key, varying and non-varying. The varying and non-varying fields may be intermingled in this array. The field_table.field.location variable, for a non-varying field, identifies the starting offset in the key_string of the data for the field. field_table.field.type identifes the data type of the field which is sufficient to know the size of the value for most types. For some types, e.g. char_nonvar, the field_table.field.size value is used to determine the length of the value. For a varying field, field_table.field.location identifies the location of the length value for the field (in the key_string). The actual value of the field is found (for the first varying field) at field_table.location_of_first_varying_field in the key_string. The second varying field is found at field_table.location_of_first_varying_field + (length of first varying field). The length and contents of varying fields are separated out in this fashion to help speed references to specific varying fields, by allowing quicker references to the lengths of the preceding varying fields than that allowed by the obvious alternative of storing: length1 + contents1 + length2 + contents2 + length3 + contents3 + ... where: to pick up contents3 one positions to length1, reads it, adds it in and positions to length2, reads it, skips over contents2, reads length3, reads contents3. Also, this allocation technique allows storing the fields in the same order as their definition by the caller at index-creation time. Multics Technical Bulletin MTB-550 DM: index_manager_ design The varying_field_map array speeds the location of any given field, which is of a varying type. The varying_field_map (ordinality_of_varying_field).field_id is used to determine the index into field_table.field array for the Nth varying field. The varying_field_map (field_array_index).varying_field_index is used to determine the ordinality among varying fields of a varying field given it's field_id, it's position in the field_table.field array. 8 ALLOCATION OF KEYS The technique for allocating keys has two parts; finding an appropriate control interval and allocating keys within control intervals. 8.1 Allocation of keys to control intervals The basic scheme for determining which control interval in which to allocate a key follows: The tree is searched to determine the correct position for the new key. This results in a control interval id and a slot being selected for the new key. If there is sufficient room for the new key in the control interval selected, a single element containing the key is allocated in this control interval at the selected slot. There are two cases which arise when the key doesn't fit in the selected control interval: there is enough room in the control interval on the same level of the index to either the right or the left of the selected control interval, or there isn't. In the first case, a rotation of keys is used to make room in the selected control interval. In the second case the selected control interval is split into two. 8.1.1 ROTATING KEYS If room exists in the preceding or following "sibling" control intervals, then keys are "rotated" out of the target control interval and into the one with room. This rotation requires changing the value of the parent key which was previously used to discriminate the control intervals involved in the rotation, which is dealt with as a new "insertion" in the parent's level. The key being rotated into the parent control interval is placed in the same slot as the old key it is replacing. MTB-550 Multics Technical Bulletin DM: index_manager_ design 8.1.2 SPLITTING CONTROL INTERVALS If there is no room for a rotation, then the target control interval must be "split". This consists of a new control interval being allocated, half of the keys of the target control interval being placed in the new control interval, and a new key being inserted in the parent control interval to discriminate between the target control interval and the new control interval. 8.1.3 OVERLENGTH KEYS The above scenario is somewhat complicated by the possibility that a key will be too long to fit in one control interval. To store an overlength key it is necessary to break it into multiple elements. The number of elements required to store a key is equal to the least greater integer of the length of the key divided by the maximum allowed element size. (i.e., number_of_elements = ceil (key_length / maximum_allowed_element_size) If key_length is not evenly divisible by the maximum_allowed_element_size, then one of the elements of the key is less than maximum_allowed_element_size in length. Otherwise, all of the elements are maximum sized. If there is a less-than-maximum sized element, it is allocated first, and in the same fashion as an element which is the entire key as discussed above. For the maximum-sized elements, which necessarily occupy entire control intervals, a set of empty control intervals is used. 8.2 Allocation of a key within a control interval Once the control interval has been selected in which the key is to be inserted, it may be necessary to alter the construction of the control interval to give the key the correct element_position_table index, and to make room for the key contents. The first situation arises because the key order is kept by storing keys within a control interval in element_position_table index order. If a key is being inserted which must have an index already in use, then the existing keys with that index and higher must be shifted down. The second situation, making room for the key's contents, arises when there is insufficient room in the "unused" space of the control interval for its contents. This is solved by "compacting" the control interval. There are two kinds of compaction, used according to circumstances; scatterred free space compaction and common prefix compaction. Scatterred free space compaction is accomplished by making a "logical" copy of the control interval by inserting each of the Multics Technical Bulletin MTB-550 DM: index_manager_ design keys in the control interval in a temp control interval, then physically copying this "compacted" version of the control interval back into the original control interval. This increases the unused space of the control interval because there was scattered free space in the original copy. The scattered free space comes from deleting (or rewriting larger) keys and simply leaving the storage their contents occupy behind to be recovered in this fashion. Common prefix compaction is extracting from the keys in the control interval all of the common prefixes, for storage as mentioned above. The basic problem in this kind of compaction is to determine what the common prefixes are which, when extracted, will realize a savings in space. 9 THE SEARCH ALGORITHM The algorithm for searching the index has several determining factors; the clustering of keys in control intervals to reduce paging, the presence of multiple fields, the use of ranges in constraining the search, the use of search constraints which require sequential comparisons of keys, and the possiblity of more than one set of constraints (the results of which are to be or'ed together) identifying the same key. An example of the next to last determining factor is: "last name ends in ""s""". An example of the last determining factor is: "(name = Daniel & age > 15) or (age < 30)". The algorithm for satisfying a search specification must take into account all of these things. 9.1 Basic approach The search_specification provided by the caller of index_manager_ is analyzed to produce an "interval_specification". The interval_specification consists of an array of "intervals" (about one per and_group in the search_specification). Each interval has a low value and a high value. An interval is specified to be all of the keys greater than the low value and less than the high value (keys equal to the end points are optionally included in an interval, depending on the and_group(s) associated with the interval). The key closest to each end point (and on the appropriate "side") is found using the basic search algorithm (implemented by im_basic_search). It may be necessary to complete the determination of which keys satisfy the search_specification by doing a sequential search over one or more of the intervals of keys found by the above application of the interval_specification.