The mapping module performs the task of storing and retrieving the physical block
numbers which correspond to the logical blocks of an object. It is this module
which understands how to parse the ptrs
field of a
nasd_od_node_t
, as well as how to parse indirect blocks.
Much like the
Berkeley Fast Filesystem, the NASD drive's filesystem uses
varying levels of indirection to keep track of the actual data blocks
of each NASD object. Within the inode
structure, there is a field called ptrs
. This field is a number
of pointers to blocks on disk. Some of these pointers are direct
pointers. Direct pointers are pointers to blocks whose contents are
the contents of a logical block of a NASD object. A direct pointer has
type nasd_od_direct_pointer_t
. The blkno
field
of this structure is the physical block number of the pointed-to block.
Indirect pointers are pointers to blocks full of other pointers.
Indirect pointers have type nasd_od_indirect_ptr_t
, which
contains a blkno
field that is the physical block number
of the pointed-to block.
The NASD drive supports multiple levels of indirection. The level of
indirection to point to a data block is defined as the number of indirect
blocks one must read to determine the block number of that data block, given
the inode block. A block with level zero is pointed to by a direct pointer
stored within the inode itself. The ptrs
field of the inode
is divided into a number of regions equal to the number of levels of indirection
(including zero) supported by the drive. Each region contains a number of
pointers. The first region is a set of direct pointers to data blocks. The
remaining regions are indirect pointers which ultimately lead to blocks at
a level of indirection corresponding to that region minus one. That is, the
first region points directly to blocks, so that is level zero. The second
region points to indirect blocks which in turn point to data blocks - that
is level one. The third region points to indirect blocks which point to
more indirect blocks which in turn point to data blocks - that is level two,
and so forth.
The number of pointers in each of these regions is defined in
nasd_od.h
. The portion of the code which defines these
values, as well as many others derived from these values, is generated
by a program called blockparam
which may be found in the
drive/
subdirectory of the NASD tree. The number of levels of
indirection is NASD_OD_ILVLS
. The ptrs
field of
the inode is defined as unsigned char
. The first portion of
this is coerced to nasd_od_cirect_ptr_t
structures. The
remaining portion is coerced to nasd_od_indirect_ptr_t
structures. The number of pointers at each level of indirection is defined
by the output of blockparam
(the size of ptrs
,
nasd_od_direct_ptr_t
, nasd_od_indirect_ptr_t
,
and the number of pointers in the inode at each level of indirection forms
the input to blockparam
).
Any time a blkno
valued at zero is encountered, it is treated
as a pointer to a virtual block whose contents are all zeroes. This enables
zero-fill for sparsely-written objects.
The mapping module itself is implemented in drive/nasd_bmap.c
.
This module provides three key interfaces:
nasd_od_bmap()
nasd_od_bmap()
. The first argument to this function
is a pointer to a cache block that is the inode block of the
object to perform mapping operation on. The caller should hold
a reference on and the mutex of this cache block. The caller should
not hold a lock on the object's partition. The second
argument of nasd_od_bmap()
, in_lblkno
,
is the logical block number (within the object) of the first block
to obtain the mapping of. The third argument, in_lblkcnt
,
is the number of blocks to obtain this mapping for.
The next arguments specify how far ahead the bmap module is permitted to look
to locate physically contiguous blocks within the object at the beginning
(in_beforemax
and end (in_aftermax
).
The number of blocks found will be returned in *blocks_beforep
and *blocks_afterp
, respectively. Physically contiguous blocks are blocks
whose physical and logical block numbers are adjacent and sequential with
corresponding ordering. For example, assume the caller makes a request
with in_blkno=N
, in_lblkcnt=4
, in_beforemax=16
,
and in_aftermax=16
. This is a request for the mapping of
logical blocks {N, N+1, N+2, N+3}
. Further assume that this
mapping is to blocks {M, M+1, K-1, K}
. If block N-1
maps to block M-1
, then block N-1
is physically
contiguous to block N
. Note that the ordering requirement specifies
that if block N
maps to block J-1
and block N-1
maps to block J
, these are not considered physically contiguous
for the purposes of this mapping discovery. In our example, *blocks_beforep
may be 1
, indicating that there is one block before N
which
is physically contiguous. The value returned might be larger, up to
in_beforemax
, depending upon how many such physically contiguous blocks there
are. The mapping module will not perform blocking operations to retreieve this contiguity
information, so the values returned in *blocks_beforep
and
*blocks_afterp
may be lower than the actual values. The intention of these
values is to simplify sequential readahead (or readbehind) within objects.
The next argument of nasd_od_bmap()
, partnum
, is the partition
number that the object represented by ne
is a member of.
The next argument is a flags word indicating how the mapping operation should
behave. If NASD_ODC_B_ALIGN
is set in this word, in_beforemax
and in_aftermax
are treated as alignment masks, not counts of blocks.
That is, if in_aftermax
is K
and NASD_ODC_B_ALIGN
is specified, nasd_od_bmap()
will search past the last mapped block for
contiguous blocks until the physical block pointed to crosses a
K
-aligned boundary (or, of course, a non-contiguous block is found).
If NASD_ODC_B_ALLOC
is specified, nasd_od_bmap()
must ensure
that the target mapping block exists (that is, does not have a block number of zero,
indicating a zero-fill block). If necessary, more blocks will be allocated, and their
block numbers will be stored in the inode or various indirect blocks as is appropriate.
The mapping module calls upon the layout module
to determine which physical blocks will be assigned to the task.
The final argument, *blocks_to_allocp
, is used to return the number
of new blocks which need to be allocated to ensure that every mapped block exists.
By calling nasd_od_bmap()
with a non-NULL
blocks_to_allocp
and not specifying NASD_ODC_B_ALLOC
,
a caller may determine how many blocks would be needed to complete a write
request without actually allocating any blocks.
Mapping operations are performed by recursively descending through the indirect
blocks until the direct block numbers are located. This recursive descent is
performed by the internal function nasd_od_ibmap()
, which is called
both by nasd_od_bmap()
and by itself. Each call to nasd_od_ibmap()
represents a level of block pointers. This operation keeps track of how many
zero-valued block pointers it has encountered so that nasd_od_bmap()
may return this value in *blocks_to_allocp
, and so it may know
how many blocks must be allocated to satisfy a NASD_ODC_B_ALLOC
request. After completing this initial mapping operation, if NASD_ODC_B_ALLOC
is specified and not all target blocks existed, nasd_od_bmap()
will call upon
nasd_od_layout_get_prealloc()
to obtain blocks from the range of
blocks preallocated to this object. If not enough blocks could be obtained,
nasd_od_bmap()
will then call nasd_od_layout_alloc_blocks()
to obtain enough blocks to complete the request. After that, it will call
nasd_odc_ref_ranges()
to indicate that these blocks are now in-use.
Once this is done, nasd_od_bmap()
will call upon
nasd_od_fbmap()
to perform the actual mapping operation.
nasd_od_fbmap()
performs a recursive operation similar to
nasd_od_ibmap()
, except that it takes those newly-allocated blocks
and inserts them in the object mapping as indirect or direct blocks as necessary
to ensure that all mapped blocks exist.
nasd_od_bunmap()
nasd_od_bunmap()
is used
to deallocate the blocks that were formerly used to store this object. This
function takes ne
as a pointer to a cache entry representing the
inode block. As with nasd_od_bmap()
, the caller should hold
a reference on and the mutex of this cache block, but not the partition lock.
The next two arguments, in_lblkno
and in_lblkcnt
,
specify the first logical block to deallocate and how many blocks to deallocate
respectively. The final argument, partnum
, is the number of the
partition which the object is a member of. nasd_od_bunmap()
uses nasd_od_ibunmap
to recursively iterate through the levels
of indirection and zeroes out the pointers to deallocated blocks. It also
aggressively deallocates indirect blocks whose contents are entirely zero.
nasd_od_ibunmap()
assembles a list of extents to deallocate, which
is then passed to nasd_odc_ref_ranges()
with a delta
of (-1), which releases physical references on these blocks
and deallocates them if their refcount goes to zero.
nasd_od_bfind_last_block()
nasd_od_bfind_last_block()
may be called on a cache entry
representing the inode of the object in question (again, the caller should
hold a reference on and the mutex of this cache block; the state of the
partition lock is don't-care, however). This function also takes
the number of the partition to which this object belongs as partnum
,
and the new logical length of the object as object_len
. The
new last block number is stored in the last_block
field of the
inode structure.
![]() | ![]() | ![]() |
---|---|---|
Changing physical refcounts | Layout | NASD Programmer's Documentation |