NASD Programmer's Documentation
Mapping

The mapping module performs the task of storing and retrieving the physical block numbers which correspond to the logical blocks of an object. It is this module which understands how to parse the ptrs field of a nasd_od_node_t, as well as how to parse indirect blocks.

Block pointers

Much like the Berkeley Fast Filesystem, the NASD drive's filesystem uses varying levels of indirection to keep track of the actual data blocks of each NASD object. Within the inode structure, there is a field called ptrs. This field is a number of pointers to blocks on disk. Some of these pointers are direct pointers. Direct pointers are pointers to blocks whose contents are the contents of a logical block of a NASD object. A direct pointer has type nasd_od_direct_pointer_t. The blkno field of this structure is the physical block number of the pointed-to block. Indirect pointers are pointers to blocks full of other pointers. Indirect pointers have type nasd_od_indirect_ptr_t, which contains a blkno field that is the physical block number of the pointed-to block.

The NASD drive supports multiple levels of indirection. The level of indirection to point to a data block is defined as the number of indirect blocks one must read to determine the block number of that data block, given the inode block. A block with level zero is pointed to by a direct pointer stored within the inode itself. The ptrs field of the inode is divided into a number of regions equal to the number of levels of indirection (including zero) supported by the drive. Each region contains a number of pointers. The first region is a set of direct pointers to data blocks. The remaining regions are indirect pointers which ultimately lead to blocks at a level of indirection corresponding to that region minus one. That is, the first region points directly to blocks, so that is level zero. The second region points to indirect blocks which in turn point to data blocks - that is level one. The third region points to indirect blocks which point to more indirect blocks which in turn point to data blocks - that is level two, and so forth.

The number of pointers in each of these regions is defined in nasd_od.h. The portion of the code which defines these values, as well as many others derived from these values, is generated by a program called blockparam which may be found in the drive/ subdirectory of the NASD tree. The number of levels of indirection is NASD_OD_ILVLS. The ptrs field of the inode is defined as unsigned char. The first portion of this is coerced to nasd_od_cirect_ptr_t structures. The remaining portion is coerced to nasd_od_indirect_ptr_t structures. The number of pointers at each level of indirection is defined by the output of blockparam (the size of ptrs, nasd_od_direct_ptr_t, nasd_od_indirect_ptr_t, and the number of pointers in the inode at each level of indirection forms the input to blockparam).

Any time a blkno valued at zero is encountered, it is treated as a pointer to a virtual block whose contents are all zeroes. This enables zero-fill for sparsely-written objects.

The mapping module itself is implemented in drive/nasd_bmap.c. This module provides three key interfaces:

nasd_status_t nasd_od_bmap(
  nasd_odc_ent_t  *ne,
  nasd_oblkno_t    in_lblkno,
  nasd_oblkcnt_t   in_lblkcnt,
  nasd_blkcnt_t    in_beforemax,
  nasd_blkcnt_t    in_aftermax,
  int              partnum,
  int              flags,
  nasd_blkrec_t   *blkrecp,
  nasd_blkcnt_t   *blocks_beforep,
  nasd_blkcnt_t   *blocks_afterp,
  int             *blocks_to_alloc);

nasd_status_t nasd_od_bunmap(
  nasd_odc_ent_t  *ne,
  nasd_oblkno_t    in_lblkno,
  nasd_oblkcnt_t   in_lblkcnt,
  int              partnum);

nasd_status_t nasd_od_bfind_last_block(
  nasd_odc_ent_t  *ne,
  int              partnum,
  nasd_uint64      object_len);


nasd_od_bmap()

The primary routine for obtaining and altering mappings is nasd_od_bmap(). The first argument to this function is a pointer to a cache block that is the inode block of the object to perform mapping operation on. The caller should hold a reference on and the mutex of this cache block. The caller should not hold a lock on the object's partition. The second argument of nasd_od_bmap(), in_lblkno, is the logical block number (within the object) of the first block to obtain the mapping of. The third argument, in_lblkcnt, is the number of blocks to obtain this mapping for.

The next arguments specify how far ahead the bmap module is permitted to look to locate physically contiguous blocks within the object at the beginning (in_beforemax and end (in_aftermax). The number of blocks found will be returned in *blocks_beforep and *blocks_afterp, respectively. Physically contiguous blocks are blocks whose physical and logical block numbers are adjacent and sequential with corresponding ordering. For example, assume the caller makes a request with in_blkno=N, in_lblkcnt=4, in_beforemax=16, and in_aftermax=16. This is a request for the mapping of logical blocks {N, N+1, N+2, N+3}. Further assume that this mapping is to blocks {M, M+1, K-1, K}. If block N-1 maps to block M-1, then block N-1 is physically contiguous to block N. Note that the ordering requirement specifies that if block N maps to block J-1 and block N-1 maps to block J, these are not considered physically contiguous for the purposes of this mapping discovery. In our example, *blocks_beforep may be 1, indicating that there is one block before N which is physically contiguous. The value returned might be larger, up to in_beforemax, depending upon how many such physically contiguous blocks there are. The mapping module will not perform blocking operations to retreieve this contiguity information, so the values returned in *blocks_beforep and *blocks_afterp may be lower than the actual values. The intention of these values is to simplify sequential readahead (or readbehind) within objects.

The next argument of nasd_od_bmap(), partnum, is the partition number that the object represented by ne is a member of.

The next argument is a flags word indicating how the mapping operation should behave. If NASD_ODC_B_ALIGN is set in this word, in_beforemax and in_aftermax are treated as alignment masks, not counts of blocks. That is, if in_aftermax is K and NASD_ODC_B_ALIGN is specified, nasd_od_bmap() will search past the last mapped block for contiguous blocks until the physical block pointed to crosses a K-aligned boundary (or, of course, a non-contiguous block is found). If NASD_ODC_B_ALLOC is specified, nasd_od_bmap() must ensure that the target mapping block exists (that is, does not have a block number of zero, indicating a zero-fill block). If necessary, more blocks will be allocated, and their block numbers will be stored in the inode or various indirect blocks as is appropriate. The mapping module calls upon the layout module to determine which physical blocks will be assigned to the task.

The final argument, *blocks_to_allocp, is used to return the number of new blocks which need to be allocated to ensure that every mapped block exists. By calling nasd_od_bmap() with a non-NULL blocks_to_allocp and not specifying NASD_ODC_B_ALLOC, a caller may determine how many blocks would be needed to complete a write request without actually allocating any blocks.

Mapping operations are performed by recursively descending through the indirect blocks until the direct block numbers are located. This recursive descent is performed by the internal function nasd_od_ibmap(), which is called both by nasd_od_bmap() and by itself. Each call to nasd_od_ibmap() represents a level of block pointers. This operation keeps track of how many zero-valued block pointers it has encountered so that nasd_od_bmap() may return this value in *blocks_to_allocp, and so it may know how many blocks must be allocated to satisfy a NASD_ODC_B_ALLOC request. After completing this initial mapping operation, if NASD_ODC_B_ALLOC is specified and not all target blocks existed, nasd_od_bmap() will call upon nasd_od_layout_get_prealloc() to obtain blocks from the range of blocks preallocated to this object. If not enough blocks could be obtained, nasd_od_bmap() will then call nasd_od_layout_alloc_blocks() to obtain enough blocks to complete the request. After that, it will call nasd_odc_ref_ranges() to indicate that these blocks are now in-use. Once this is done, nasd_od_bmap() will call upon nasd_od_fbmap() to perform the actual mapping operation. nasd_od_fbmap() performs a recursive operation similar to nasd_od_ibmap(), except that it takes those newly-allocated blocks and inserts them in the object mapping as indirect or direct blocks as necessary to ensure that all mapped blocks exist.


nasd_od_bunmap()

When an object is deleted or truncated, nasd_od_bunmap() is used to deallocate the blocks that were formerly used to store this object. This function takes ne as a pointer to a cache entry representing the inode block. As with nasd_od_bmap(), the caller should hold a reference on and the mutex of this cache block, but not the partition lock. The next two arguments, in_lblkno and in_lblkcnt, specify the first logical block to deallocate and how many blocks to deallocate respectively. The final argument, partnum, is the number of the partition which the object is a member of. nasd_od_bunmap() uses nasd_od_ibunmap to recursively iterate through the levels of indirection and zeroes out the pointers to deallocated blocks. It also aggressively deallocates indirect blocks whose contents are entirely zero. nasd_od_ibunmap() assembles a list of extents to deallocate, which is then passed to nasd_odc_ref_ranges() with a delta of (-1), which releases physical references on these blocks and deallocates them if their refcount goes to zero.


nasd_od_bfind_last_block()

To assist the layout engine, various components of the drive keep track of the physical block number of the last block in each NASD object. At times, it is necessary to rediscover this block number (currently, this only occurs when an object is truncated). When this is necessary, nasd_od_bfind_last_block() may be called on a cache entry representing the inode of the object in question (again, the caller should hold a reference on and the mutex of this cache block; the state of the partition lock is don't-care, however). This function also takes the number of the partition to which this object belongs as partnum, and the new logical length of the object as object_len. The new last block number is stored in the last_block field of the inode structure.


<--- ---> ^<br>|<br>|
Changing physical refcounts Layout NASD Programmer's Documentation