Internet DRAFT - draft-welch-pnfs-ops
INTERNET-DRAFT Brent Welch
Panasas Inc.
Benny Halevy
Panasas Inc.
David Black
EMC Corporation
Andy Adamson
CITI University of Michigan
Dave Noveck
Network Appliance
Document: draft-welch-pnfs-ops-00.txt October 2004
Expires: April 2005
pNFS Operations Summary
October 2004
Status of this Memo
By submitting this Internet-Draft, I certify that any applicable
patent or other IPR claims of which I am aware have been disclosed,
or will be disclosed, and any of which I become aware will be
disclosed, in accordance with RFC 3668.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note
that other groups may also distribute working documents as
Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at
any time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Copyright Notice
Copyright (C) The Internet Society (2004). All Rights Reserved.
Abstract
This Internet-Draft provides a description of the pNFS extension
for NFSv4.
The key feature of the protocol extension is the ability for clients
to perform read and write operations that go directly from the
client to individual storage system elements without funneling
all such accesses through a single file server. Of course, the
file server must coordinate the client I/O so that the file system
retains its integrity.
welch-pnfs-ops Expires - April 2005 [Page 1]
Internet-Draft pNFS Operations Summary October 2004
The extension adds operations that query and manage layout
information that allows parallel I/O between clients and storage
system elements. The layouts are managed in a similar way as
delegations in that they have leases and can be recalled by the
server, but layout information is independent of delegations.
Table of Contents
1. Introduction 3
2. General Definitions 3
2.1 Metadata 3
2.2 Storage Device 4
2.3 Storage Protocol 4
2.4 Management Protocol 4
2.5 Layout 4
3. Layouts and Aggregation 5
4. Security Information 6
4.1 Object Storage Security 6
4.2 File Security 6
4.3 Block Security 7
5. pNFS Typed data structures 7
5.1 pnfs_layoutclass4 7
5.2 pnfs_deviceid4 7
5.3 pnfs_devaddr4 7
5.4 pnfs_devlist_item4 8
5.5 pnfs_layouttype4 8
5.6 pnfs_layout4 8
6. pNFS File Attributes 9
6.1 pnfs_layoutclass4<> LAYOUT_CLASSES 9
6.2 pnfs_layouttype4 LAYOUT_TYPE 9
6.3 pnfs_layouttype4 LAYOUT_HINT 9
7. pNFS Error Definitions 9
8. pNFS Operations 9
8.1 LAYOUTGET - Get Layout Information 9
8.2 LAYOUTCOMMIT - Commit writes made using a layout 11
8.3 LAYOUTRETURN - Release Layout Information 13
8.4 GETDEVICEINFO - Get Device Information 14
8.5 GETDEVICELIST - Get List of Devices 15
9. Callback Operations 16
9.1 CB_LAYOUTRECALL 16
10. Usage Scenarios 17
10.1 Basic Read Scenario 17
10.2 Multiple Reads to a File 17
10.3 Multiple Reads to a File with Delegations 17
10.4 Read with existing writers 18
10.5 Read with later conflict 18
10.6 Basic Write Case 18
10.7 Large Write Case 19
10.8 Create with special layout 19
11. Layouts and Aggregation 19
11.1 Simple Map 19
11.2 Block Map 19
11.3 Striped Map (RAID 0) 20
11.4 Replicated Map 20
welch-pnfs-ops Expires - April 2005 [Page 2]
Internet-Draft pNFS Operations Summary October 2004
11.5 Concatenated Map 20
11.6 Nested Map 20
12. Issues 21
12.1 Storage Protocol Negotiation 21
12.2 Crash recovery 21
12.3 Storage Errors 21
13. References 22
14. Acknowledgments 22
15. Author's Addresses 22
16. Full Copyright Notice 23
welch-pnfs-ops Expires - April 2005 [Page 2]
Internet-Draft pNFS Operations Summary October 2004
1. Introduction
The pNFS extension to NFSv4 takes the form of new operations that
return data location information called a "layout". The layout
is protected by layout delegations. When a client has a layout
delegation, it has rights to access the data directly using
the location information in the layout. There are both read and
write layouts and they may only apply to a sub-range of the file's
contents.
The layout delegations are managed in a similar fashion as NFSv4
data delegations (e.g., they are recallable and revocable), but they
are distinct abstractions and are manipulated with new operations
as described below. To avoid any confusion between the existing
NFSv4 data delegations and layout delegations, the term "layout"
implies "layout delegation".
There are new attributes that describe general layout
characteristics. However, attributes do not provide all we need
to support layouts, hence the use of operations instead.
Finally, there are issues about how layout delegations interact
with the existing NFSv4 abstractions of data delegations and byte
range locking. These issues (and more) are also discussed here.
2. General Definitions
This protocol extension partitions the file system protocol into
two parts, the control path and the data path. The control path is
implemented by the extended (p)NFSv4 file server, while the data
path may be implemented by direct communication between the file
system client and the storage devices. This leads to a few new
terms used to describe the protocol extension.
2.1 Metadata
This is information about a file, like its name, owner, where it
stored, and so forth. The information is managed by the File server
welch-pnfs-ops Expires - April 2005 [Page 3]
Internet-Draft pNFS Operations Summary October 2004
(sometimes called the metadata manager). Metadata also includes
lower-level information like block addresses and indirect block
pointers. Depending the storage protocol, block-level metadata may
or may not be managed by the File server, but is instead managed
by Object Storage Devices or other File servers acting as a Storage
Device.
2.2 Storage Device
This is a device, or server, that controls the file's data, but
leaves other metadata management up to the file server (i.e.,
metadata manager). A Storage Device could be another NFS server,
or an Object Storage Device (OSD) or a block device accessed over a
SAN (either FiberChannel or iSCSI SAN). The goal of this extension
is to allow direct communication between clients and storage devices.
2.3 Storage Protocol
This is the protocol between the client and the storage device
used to access the file data. There are three primary types:
file protocols (such as NFSv4 or NFSv3), object protocols (OSD),
and block protocols (SCSI-block commands, or "SBC"). These protocols
are in turn layered over transport protocols such as RPC/TCP/IP or
iSCSI/TCP/IP or FC/SCSI. We anticipate there will be variations on
these storage protocols, including new protocols that are unknown
at this time or experimental in nature. The details of the storage
protocols will be described in other documents so that pNFS clients
can be written to use these storage protocols.
2.4 Management Protocol
This is the protocol between the File server and the Storage devices.
This protocol is outside the scope of this draft, and is used
for various management activities that include storage allocation
and deallocation. For example, the regular NFSv4 OPEN operation
is used to create a new file. This is applied to the File Server,
which in turn uses the management protocol to allocate storage on
the storage devices. The file server returns a layout for the
new file that the client uses to access the new file directly.
The management protocol could be entirely private to the File server
and Storage devices, and need not be published in order to implement
a pNFS client that uses the associated Storage protocol.
2.5 Layout
(Also, "map") A layout defines how a file's data is organized on one
or more storage devices. There are many possible layout types. They
vary in the storage protocol used to access the data, and in the
aggregation scheme that lays out the file data on the underlying
storage devices. Layouts are described in more detail below.
welch-pnfs-ops Expires - April 2005 [Page 4]
Internet-Draft pNFS Operations Summary October 2004
3. Layouts and Aggregation
The layout, or "map", is a typed data structure that has variants
to handle different storage protocols (block, object, and file).
A layout describes a range of a file's contents. For example,
a block layout might be an array of tuples that store (deviceID,
block_number, block count) along with information about block size
and the file offset of the first block. An object layout is an
array of tuples (deviceID, objectID) and an additional structure
(i.e., the aggregation map) that defines how the logical byte
sequence of the file data is serialized into the different objects.
A file layout is an array of tuples (deviceID, file_handle), along
with a similar aggregation map.
The deviceID is a short name for a storage device. In practice, a
significant amount of information may be required to fully identify
a storage device. Instead of embedding all that information in
a layout, a level of indirection is used. Layouts embed device
Ids, and a new op (GETDEVICEINFO) is used to retrieve the complete
identity information about the storage device. For example, the
identity of a file server or object server could be an IP address
and port. The identity of a block device could be a volume label.
Due to multipath connectivity in a SAN environment, agreement on a
volume label is considered the reliable way to locate a particular
storage device.
Aggregation schemes can describe layouts like simple one-to-one
mapping, concatenation, and striping. A general aggregation
scheme allows nested maps so that more complex layouts can be
compactly described. The canonical aggregation type for this
extension is striping, which allows a client to access storage
devices in parallel. Even a one-to-one mapping use useful for
a file server that wishes to distribute its load among a set of
other file servers. There are also experimental aggregation types
such as writeable mirrors and RAID, however these are outside the
scope of this document.
The file server is in control of the layout for a file, but the
client can provide hints to the server when a file is opened or
created about preferred layout parameters. The pNFS extension
introduces a LAYOUT_HINT attribute that the client can query at
anytime, and can set with a compound SETATTR after OPEN to provide
a hint to the server for new files.
While not completely specified in this summary, there must be
adjunct specifications that precisely define layout formats to allow
interoperability among clients and metadata servers. The point is
that the metadata server will give out layouts of a particular class
(block, object, or file) and aggregation, and the client needs to
select a "layout driver" that understands how to use that layout.
The API used by the client to talk to its drivers is outside the
scope of the pNFS extension, but is an important notion to keep in
mind when thinking about this work. The storage protocol between
the client's layout driver and the actual storage is covered by
welch-pnfs-ops Expires - April 2005 [Page 5]
Internet-Draft pNFS Operations Summary October 2004
other protocols such as SBC (block storage), OSD (object storage)
or NFS (file storage).
4. Security Information
All existing NFS security mechanisms apply to the operations added by
this extension. However, this extension is used in conjunction with
other storage protocols for client to storage access. Each storage
protocol introduces its own security constraints. Clients may need
security information in order to complete direct data access. The
rest of this section gives an overview of the security schemes used
by different storage protocols. However, the details are outside the
scope of this protocol extension and private to the storage protocol.
We only assume that the file server returns security tokens to the
client that uses them when accessing storage. The file server does
permission checking before issuing the security tokens.
4.1 Object Storage Security
The object storage protocol relies on a cryptographically secure
capability to control accesses at the object storage devices.
Capabilities are generated by the metadata server, returned to the
client, and passed to the object storage device, which verifies
that the capability allows the requested operation.
Each capability is specific to a particular object, an operation
on that object, a byte range w/in the object, and has an explicit
expiration time. The capabilities are signed with a secret key
that is shared by the object storage devices (OSD) and the metadata
managers. Typically each OSD has a set of master keys and working
keys, and the working keys are rotated periodically under the
control of the metadata manager. Clients do not have device keys
so they are unable to forge capabilities. Capabilities need to
be protected from snooping, which can be done by using facilities
such as Ipsec to create a secure VPN that contains the clients,
the file server, and the storage devices.
4.2 File Security
The file storage protocol has the same security mechanism between
the client and metadata server as between the client and data server.
This implies that the files that store the data need the same ACL as
the metadata file that represents the "control point" for the file.
This ensures that access control decisions are consistent between
the metadata server and the data server.
One alternative that was briefly discussed was the introduction
of special file handles that essentially have the properties of
capabilities so they can be generated by the metadata servers and
checked by the data servers. (Peter Corbett described "one shot"
file handles.) To be effective, these need all the properties of a
capability so the data server can efficiently and securely enforce
the access control decisions made by the metadata manager.
welch-pnfs-ops Expires - April 2005 [Page 6]
Internet-Draft pNFS Operations Summary October 2004
[We need to elaborate on this section. We should be able to
leverage the NFSv4 GSS context between the client and the NFSv4
"Storage Devices".]
4.3 Block Security
The block model relies on SAN-based security, and trusts that
clients will only access the blocks they have been directed to use.
In these systems, there may not need to be any additional security
information returned with the map. There are LUN masking/unmapping
and zone-based security schemes that can be manipulated to fence
clients from each other's data. These are fairly heavy weight
operations that are not expected to be part of the normal execution
path for pNFS. But, a metadata server can always fall back to these
mechanisms if it needs to prevent a client from accessing storage
(i.e., "fence the client".)
5. pNFS Typed data structures
5.1 pnfs_layoutclass4
uint16_t pnfs_layoutclass4;
A layout class specifies a family of layout types. The implication
is that clients have "layout drivers" for one or more layout classes.
The file server advertises the layout classes it supports through
the LAYOUT_CLASSES file system attribute. A client asks for layouts
of a particular class in LAYOUTGET, and passes those layouts to its
layout driver. A layout is further typed by a pnfs_layouttype4
that identifies a particular layout in the family of layouts of
that class. Custom installations should be allowed to introduce
new layout classes.
[There is an IANA issue here for the initial set of well known
layout classes. There should also be a reserved range for custom
layout classes used in local installations.]
5.2 pnfs_deviceid4
unsigned uint32_t pnfs_deviceid4; /* 32-bit device ID */
Layout information includes device IDs that specify a data server
with a compact handle. Addressing and type information is obtained
with the GETDEVICEINFO operation.
5.3 pnfs_devaddr4
struct pnfs_devaddr4 {
uint16_t type;
string r_netid<>; /* network ID */
string r_addr<>; /* Universal address */
};
This value is used to set up a communication channel with the
welch-pnfs-ops Expires - April 2005 [Page 7]
Internet-Draft pNFS Operations Summary October 2004
storage device. For now we borrow the structure of a clientid4,
and assume we will be able to specify SAN devices as well as TCP/IP
devices using this format. The type is used to distinguish between
known types.
[TODO: we need an enum of known device address types. These include
IP+port for file servers and object storage devices. There may be
several types for different variants on SAN volume labels.
Do we need a concrete definition of volume labels for
SAN block devices? We have discussed a scheme where the volume
label is defined as a set of tuples <offset, length, value> that
allow matching on the initial contents of a SAN volume in order to
determine equality. If we do this, is this type a discriminated
union with a fixed number of branches? One type would be an IP/port
combination for an NFS or iSCSI device. Another type would be this
volume label specification.]
5.4 pnfs_devlist_item4
struct pnfs_devlist_item4 {
pnfs_deviceid4 id;
nfs_deviceaddr4 addr;
};
An array of these values is returned by the GETDEVICELIST operation.
They define the set of devices associated with a file system.
5.5 pnfs_layouttype4
struct pnfs_layouttype4 {
pnfs_layoutclass4 class;
uint16_t type;
};
The protocol extension enumerates known layout types and their
structure. Additional layout types may be added later. To allow
for graceful extension of layout types, the type is broken into
two fields.
[TODO: We should chart out the major layout classes and
representative instances of them, then indicate how new layout
classes can be introduced. Alternatively, we can put these
definitions into the document that specifies the storage protocol.]
5.6 pnfs_layout4
union pnfs_layout4 switch (pnfs_layouttype4 type) {
default:
opaque layout_data<>;
};
This opaque type defines a layout. As noted, we need to flesh out
this union with a number of "blessed" layouts for different storage
protocols and aggregation types.
welch-pnfs-ops Expires - April 2005 [Page 8]
Internet-Draft pNFS Operations Summary October 2004
6. pNFS File Attributes
6.1 pnfs_layoutclass4<> LAYOUT_CLASSES
This attribute applies to a file system and indicates what layout
classes are supported by the file system. We expect this attribute
to be queried when a client encounters a new fsid. This attribute is
used by the client to determine if it has applicable layout drivers.
6.2 pnfs_layouttype4 LAYOUT_TYPE
This attribute indicates the particular layout type used for a file.
This is for informational purposes only. The client needs to use
the LAYOUTGET operation in order to get enough information (e.g.,
specific device information) in order to perform I/O.
6.3 pnfs_layouttype4 LAYOUT_HINT
This attribute is set on newly created files to influence the file
server's choice for the file's layout.
7. pNFS Error Definitions
NFS4ERR_LAYOUTUNAVAILABLE Layouts are not available
for the file or its containing file system.
NFS4ERR_LAYOUTTRYLATER Layouts are temporarily
unavailable for the file, client should retry later.
8. pNFS Operations
8.1 LAYOUTGET - Get Layout Information
SYNOPSIS
(cfh), storage_type, iomode, sharemode, offset, length ->
layout_stateid, layout
ARGUMENT
enum layoutget_iomode4 {
LAYOUTGET_READ = 1,
LAYOUTGET_WRITE = 2,
LAYOUTGET_RW = 3
};
enum layoutget_sharemode4 {
LAYOUTGET_SHARED = 1,
LAYOUTGET_EXCLUSIVE = 2
};
struct LAYOUTGET4args {
/* CURRENT_FH: file */
pnfs_layoutclass4 layout_class;
welch-pnfs-ops Expires - April 2005 [Page 9]
Internet-Draft pNFS Operations Summary October 2004
layoutget_iomode4 iomode;
layoutget_sharemode4 sharemode;
offset4 offset;
length4 length;
};
RESULT
struct LAYOUTGET4resok {
stateid4 layout_stateid;
pnfs_layout4 layout;
};
union LAYOUTGET4res switch (nfsstat4 status) {
case NFS4_OK:
LAYOUTGET4resok resok4;
default:
void;
};
DESCRIPTION
Requests a layout for reading or writing the file given by the
filehandle at the byte range given by offset and length. The client
requests either a shared or exclusive sharing mode for the layout
to indicate whether it provides its own synchronization mechanism.
A shared layout allows cooperating clients to perform direct I/O
using a layout that potentially conflicts with other clients.
The clients are asserting that they are aware of this issue and
can coordinate via an external mechanism (either NFSv4 advisory
locks or, e.g., MPI-IO toolkit). An exclusive layout means that
the client wants the server to prevent other clients from making
conflicting changes to the part of the file covered by the layout.
An exclusive read layout, for example, would not be granted at the
same time as there was an outstanding write layout that overlapped
the range. Multiple exclusive read layouts can be given out for the
same file range. An exclusive write layout can only be given out
if there are no other outstanding layouts for the specified range.
Issue - there is some debate about the default value for sharemode
in client implementations. One view is that the safest scheme is
to require applications to request shared layouts explicitly via,
e.g., an ioctl() operation. Another view is that shared layouts
during concurrent access provide the same risks and guarantees that
NFS does today (i.e., there is only open-to-close sharing semantics)
and that applications "know" they should use advisory locking to
serialize access when they anticipate sharing. By specifying the
sharemode in the protocol, we support both points of view.
The LAYOUTGET operation returns layout information for the specified
byte range. To get a layout from a specific offset through the
end-of-file (no matter how long the file actually is) use a length
field with all bits set to 1 (one). If the length is zero, or if
a length which is not all bits set to one is specified, and length
welch-pnfs-ops Expires - April 2005 [Page 10]
Internet-Draft pNFS Operations Summary October 2004
when added to the offset exceeds the maximum 64-bit unsigned integer
value, the error NFS4ERR_INVAL will result.
The format of the returned layout is specific to the underlying
file system and is specified outside of this document.
If layouts are not supported for the requested file or its containing
filesystem the server should return NFS4ERR_LAYOUTUNAVAILABLE.
If layout for the file is unavailable due to transient conditions,
e.g. file sharing prohibits layouts, the server should return
NFS4ERR_LAYOUTTRYLATER.
On success, the current filehandle retains its value.
IMPLEMENTATION
Typically, LAYOUTGET will be called as part of a compound RPC
after an OPEN operation and results in the client having location
information for the file. The client specifies a layout class that
limits what kind of layout the server will return. This prevents
servers from issuing layouts that are unusable by the client.
ERRORS
NFS4ERR_INVAL
NFS4ERR_NOTSUPP
NFS4ERR_LAYOUTUNAVAILABLE
NFS4ERR_LAYOUTTRYLATER
TBD
8.2 LAYOUTCOMMIT - Commit writes made using a layout
SYNOPSIS
(cfh), layout_stateid, offset, length, neweof, newlayout ->
layout_stateid
ARGUMENT
union neweof4 switch (bool eofchanged) {
case TRUE:
length4 eof;
case FALSE:
void;
}
struct LAYOUTCOMMIT4args {
/* CURRENT_FH: file */
stateid4 layout_stateid;
neweof4 neweof;
offset4 offset;
length4 length;
opaque newlayout<>;
};
welch-pnfs-ops Expires - April 2005 [Page 11]
Internet-Draft pNFS Operations Summary October 2004
RESULT
struct LAYOUTCOMMIT4resok {
stateid4 layout_stateid;
};
union LAYOUTFLUSH4res switch (nfsstat4 status) {
case NFS4_OK:
LAYOUTFLUSH4resok resok4;
default:
void;
};
DESCRIPTION
Commit changes in the layout represented by the current filehandle
and stateid.
The LAYOUTCOMMIT operation indicates that the client has completed
writes using a layout obtained by a previous LAYOUTGET. The client
may have only written a subset of the data range it previously
requested. LAYOUTCOMMIT allows it to commit or discard provisionally
allocated space and to update the server with a new end of file.
The layout argument to LAYOUTCOMMIT describes what regions have been
used and what regions can be deallocated. The resulting layout is
still valid after LAYOUTCOMMIT and can be referenced by the returned
stateid for future operations.
The layout information is more verbose for block devices than
for objects and files because the later hide the details of block
allocation behind their storage protocols. At the minimum, the client
needs to communicate changes to the end of file location back to
the server, and its view of the file modify and access times. For
blocks, it needs to specify precisely which blocks have been used.
The client may use a SETATTR operation in a compound right after
LAYOUTCOMMIT in order to set the access and modify times of the file.
Alternatively, the server could use the time of the LAYOUTCOMMIT
operation as the file modify time.
On success, the current filehandle retains its value.
ERRORS
TBD
welch-pnfs-ops Expires - April 2005 [Page 12]
Internet-Draft pNFS Operations Summary October 2004
8.3 LAYOUTRETURN - Release Layout Information
SYNOPSIS
(cfh), layout_stateid ->
ARGUMENT
struct LAYOUTRETURN4args {
/* CURRENT_FH: file */
stateid4 layout_stateid;
};
RESULT
struct LAYOUTRETURN4res {
nfsstat4 status;
};
DESCRIPTION
Returns the layout represented by the current filehandle and
layout_stateid. After this call, the client must not use the layout
and the associated storage protocol to access the file data. Before
it can do that, it must get a new layout delegation with LAYOUTGET.
Layouts may be returned when recalled or voluntarily (i.e.,
before the server has recalled them). In either case the client
must properly propagate state changed under the context of the
layout to storage or to the server before returning the layout.
On success, the current filehandle retains its value.
If a client fails to return a layout in a timely manner, then the
File server should use its management protocol with the storage
devices to fence the client from accessing the data referenced by
the layout.
[TODO: We need to work out how clients return error information if
they encounter problems with storage. We could return a single
OK bit, or we could return more extensive information from the
layout driver that describes the error condition in more detail.
It seems like we need an opaque "layout_error" type that is defined
by the storage protocol along with its layout types.]
ERRORS
TBD
welch-pnfs-ops Expires - April 2005 [Page 13]
Internet-Draft pNFS Operations Summary October 2004
8.4 GETDEVICEINFO - Get Device Information
SYNOPSIS
(cfh), device_id -> device_addr
ARGUMENT
struct GETDEVICEINFO4args {
pnfs_deviceid4 device_id;
};
RESULT
struct GETDEVICEINFO4resok {
pnfs_devaddr4 device_addr;
};
union GETDEVICEINFO4res switch (nfsstat4 status) {
case NFS4_OK:
GETDEVICEINFO4resok resok4;
default:
void;
};
DESCRIPTION
Returns device type and device address information for a specified
device. The returned device_addr includes a type that indicates
how to interpret the addressing information for that device. [TODO:
or, it is a discriminated union.] At this time we expect two main
kinds of device addresses, either IP address and port numbers,
or SCSI volume identifiers. The final protocol specification will
detail the allowed values for device_type and the format of their
associated location information.
Note, it is possible that address information for a deviceID
changes dynamically due to various system reconfiguration events.
Clients may get errors on their storage protocol that causes them
to query the metadata server with GETDEVICEINFO and refresh their
information about a device.
welch-pnfs-ops Expires - April 2005 [Page 14]
Internet-Draft pNFS Operations Summary October 2004
8.5 GETDEVICELIST - Get List of Devices
SYNOPSIS
(cfh) -> device_addr<>
ARGUMENT
/* Current file handle */
RESULT
struct GETDEVICELIST4resok {
pnfs_devlist_item4 device_addr_list<>;
};
union GETDEVICEINFO4res switch (nfsstat4 status) {
case NFS4_OK:
GETDEVICEINFO4resok resok4;
default:
void;
};
DESCRIPTION
In some applications, especially SAN environments, it is convenient
to find out about all the devices associated with a file system.
This lets a client determine if it has access to these devices,
e.g., at mount time. This operation returns a list of items that
establish the association between the short pnfs_deviceid4 and the
addressing information for that device.
welch-pnfs-ops Expires - April 2005 [Page 15]
Internet-Draft pNFS Operations Summary October 2004
9. Callback Operations
9.1 CB_LAYOUTRECALL
SYNOPSIS
stateid, fh ->
ARGUMENT
struct CB_LAYOUTRECALLargs {
stateid4 stateid;
nfs_fh4 fh;
};
RESULT
struct CB_LAYOUTRECALLres {
nfsstat4 status;
};
DESCRIPTION
The CB_LAYOUTRECALL operation is used to begin the process of
recalling a layout and returning it to the server.
If the handle specified is not one for which the client holds a
layout, an NFS4ERR_BADHANDLE error is returned.
If the stateid specified is not one corresponding to a valid layout
for the file specified by the filehandle, an NFS4ERR_BAD_STATEID
is returned.
Issue: We have debated about another kind of callback to push new
EOF information to the client. May not be necessary. The client
could discover that via polling for attirbutes.
IMPLEMENTATION
The client should reply to the callback immediately. Replying does
not complete the recall except when an error was returned. The recall
is not complete until the layout is returned using a LAYOUTRETURN.
The client should complete any in-flight I/O operations using
the recalled layout before returning it via LAYOUTRETURN. If the
client has buffered dirty data, it may chose to write it directly
to storage before calling LAYOUTRETURN, or to write it later using
normal NFSv4 WRITE operations.
ERRORS
NFS4ERR_BADHANDLE
NFS4ERR_BAD_STATEID
TBD
welch-pnfs-ops Expires - April 2005 [Page 16]
Internet-Draft pNFS Operations Summary October 2004
10. Usage Scenarios
This section has a description of common open, close, read, write
interactions and how those work with layout delegations. [TODO:
this section feels rough and I'm not sure it adds value in its
present form.]
10.1 Basic Read Scenario
Client does an OPEN to get a file handle.
Client does a LAYOUTGET for a range of the file, gets back a layout.
Client uses the storage protocol and the layout to access the file.
Client returns the layout with LAYOUTRETURN
Client closes stateID and open delegation with CLOSE.
This is rather boring as the client is careful to clean up all server
state after only a single use of the file.
10.2 Multiple Reads to a File
Client does an OPEN to get a file handle.
Client does a LAYOUTGET for a range of the file, gets back a layout.
Client uses the storage protocol and the layout to access the file.
Client closes stateID and with CLOSE.
Client does an OPEN to get a file handle.
Client finds cached layout associated with file handle.
Client uses the storage protocol and the layout to access the file.
Client closes stateID and with CLOSE.
A bit more interesting as we've saved the LAYOUTGET operation, but
we are still doing server round-trips.
10.3 Multiple Reads to a File with Delegations
Client does an OPEN to get a file handle and an open delegation.
Client does a LAYOUTGET for a range of the file, gets back a layout.
Client uses the storage protocol and the layout to access the file.
Application does a close(), but client keeps state under the
delegation.
(time passes)
Application does another open(), which client handles under the
delegation.
Client finds cached layout associated with file handle.
Client uses the storage protocol and the layout to access the file.
(pattern continues until open delegation and/or layout is recalled)
This illustrates the efficiency of combining open delegations and
layouts to eliminate interactions with the file server altogether.
Of course, we assume the client's operating system is only allowing
the local open() to succeed based on the file permissions. The use
of layouts does not change anything about the semantics of open
delegations.
welch-pnfs-ops Expires - April 2005 [Page 17]
Internet-Draft pNFS Operations Summary October 2004
10.4 Read with existing writers
NOTE: This scenario was under some debate, but we have resolved
that the server is able to give out overlapping/conflicting layout
information to different clients. In these cases we assume
that clients are using an external mechanism such as MPI-IO to
synchronize and serialize access to shared data. One can argue that
even unsynchronized clients get the same open-to-close consistency
semantics as NFS already provides, even when going direct to storage.
Client does an OPEN to get an open stateID and open delegation
The file is open for writing elsewhere by different clients and so
no open delegation is returned.
Client does a LAYOUT get and gets a layout from the server.
Client either synchronizes with the writers, or not, and accesses data
via the layout and storage protocol. There are no guarantees about
when data that is written by the writer is visible to the reader.
Once the writer has closed the file and flushed updates to storage,
then they are visible to the client.
[TODO: we really aren't explaining the sharemode field here.]
10.5 Read with later conflict
ClientA does an OPEN to get an open stateID and open delegation.
ClientA does a LAYOUTGET for a range of the file, gets back a map
and layout stateid.
ClientA uses the storage protocol to access the file data.
ClientB opens the file for WRITE
File server issues CB_RECALL to ClientA
ClientA issues DELEGRETURN
ClientA continues to use the storage protocol to access file data.
If it is accessing data from its cache, it will periodically
check that its data is still up-to-date because it has no open
delegation. [This is an odd scenario that mixes in open delegations
for no real value. Basically this is a "regular writer" being mixed
with a pNFS reader. I guess this example shows that no particular
semantics are provided during the simultaneous access. If the server
so chose, it could also recall the layout with CB_LAYOUTRECALL to
force the different clients to serialize at the file server.]
10.6 Basic Write Case
Client does an OPEN to get a file handle.
Client does a LAYOUTGET for a range of the file, gets back a layout
and layout stateid.
Client writes to the file using the storage protocol.
Client uses LAYOUTCOMMIT to communicate new EOF position.
Client does SETATTR to update timestamps
Client does a LAYOUTRETURN
Client does a CLOSE
Again, the boring case where the client cleans up all of its server
state by returning the layout.
welch-pnfs-ops Expires - April 2005 [Page 18]
Internet-Draft pNFS Operations Summary October 2004
10.7 Large Write Case
Client does an OPEN to get a file handle.
(loop)
Client does a LAYOUTGET for a range of the file, gets back a layout
and layout stateid.
Client writes to the file using the storage protocol.
Client fills up the range covered by the layout.
Client updates the server with LAYOUTCOMMIT, communicating about new
EOF position.
Client does SETATTR to update timestamps.
Client releases the layout with LAYOUTRELEASE
(end loop)
Client does a CLOSE
10.8 Create with special layout
Client does an OPEN and a SETATTR that specifies a particular layout
type using the LAYOUT_HINT attribute.
Client gets back an open stateID and file handle.
(etc)
11. Layouts and Aggregation
This section describes several layout formats in a semi-formal way
to provide context for the layout delegations. These definitions
will be formalized in other protocols. However, the set of
understood types is part of this protocol in order to provide for
basic interoperability.
The layout descriptions include <deviceID, objectID> tuples
that identify some storage object on some storage device.
The addressing formation adsociated with the deviceID is obtained
with GETDEVICEINFO. The interpretation of the objectID depends on
the storage protocol. The objectID could be a filehandle for an
NFSv4 data server. It could be a OSD object ID for an object server.
The layout for a block device generally includes additional block
map information to enumerate blocks or extents that are part of
the layout.
11.1 Simple Map
The data is located on a single storage device. In this case the
file server can act as the front end for several storage devices
and distribute files among them. Each file is limited in its size
and performance characteristics by a single storage device. The
simple map consists of <deviceID, objectID>.
11.2 Block Map
The data is located on a LUN in the SAN. The layout consists of
an array of <deviceID, blockID, blocksize> tuples. Alternatively,
the blocksize could be specified once to apply to all entries in
the layout.
welch-pnfs-ops Expires - April 2005 [Page 19]
Internet-Draft pNFS Operations Summary October 2004
11.3 Striped Map (RAID 0)
The data is striped across storage devices. The parameters of the
stripe include the number of storage devices (N) and the size of
each stripe unit (U). A full stripe of data is N * U bytes. The
stripe map consists of an ordered list of <deviceID, objectID>
tuples and the parameter value for U. The first stripe unit (the
first U bytes) are stored on the first <deviceID, objectID>, the
second stripe unit on the second <deviceID, objectID> and so forth
until the first complete stripe. The data layout then wraps around
so that byte (N*U) of the file is stored on the first <deviceID,
objectID> in the list, but starting at offset U within that object.
The striped layout allows a client to read or write to the component
objects in parallel to achieve high bandwidth.
The striped map for a block device would be slightly different.
The map is an ordered list of <deviceID, blockID, blocksize>, where
the deviceID is rotated among a set of devices to achieve striping.
11.4 Replicated Map
The file data is replicated on N data servers. The map consists of
N <deviceID, objectID> tuples. When data is written using this map,
it should be written to N objects in parallel. When data is read,
any component object can be used.
This map type is controversial because it highlights the issues with
error recovery. Those issues get interesting with any scheme that
employs redundancy. The handling of errors (e.g., only a subset
of replicas get updated) is outside the scope of this protocol
extension. Instead, it is a function of the storage protocol and
the metadata management protocol.
11.5 Concatenated Map
The map consists of an ordered set of N <deviceID, objectID,
size> tuples. Each successive tuple describes the next segment of
the file.
11.6 Nested Map
The nested map is used to compose more complex maps out of simpler
ones. The map format is an ordered set of M sub-maps, each submap
applies to a byte range within the file and has its own type such
as the ones introduced above. Any level of nesting is allowed in
order to build up complex aggregation schemes.
welch-pnfs-ops Expires - April 2005 [Page 20]
Internet-Draft pNFS Operations Summary October 2004
12. Issues
12.1 Storage Protocol Negotiation
Clients may want to negotiate with the metadata server about
their preferred storage protocol, and to find out what storage
protocols the server offers. Client can do this by querying the
LAYOUT_CLASSES file system attribute. They respond by specifying
a particular layout class in their LAYOUTGET operation.
12.2 Crash recovery
We use the existing client crash recovery and server state recovery
mechanisms in NFSv4. This includes that layouts have associated
layout stateids that "expire" along with the rest of the client
state. The main new issue introduced by pNFS is that the client
may have to do a lot of I/O in response to a layout recall.
The client may need to remember to send RENEW ops to the server
during this period if it were to risk not doing anything within
the lease time. Of course, the client should only reply with its
LAYOUTRETURN after it knows its I/O has completed.
12.3 Storage Errors
As noted under LAYOUTRETURN, there is a need for the client to
communicate about errors it has when accessing storage directly.
welch-pnfs-ops Expires - April 2005 [Page 21]
Internet-Draft pNFS Operations Summary October 2004
13. References
1 Gibson et al, "pNFS Problem Statement", ftp://www.ietf.org/
/internet-drafts/draft-gibson-pnfs-problem-statement-01.txt,
July 2004.
14. Acknowledgments
Many members of the pNFS informal working group have helped
considerably. The authors would like to thank Gary Grider, Peter
Corbett, Dave Noveck, and Peter Honeyman. This work is inspired
by the NASD and OSD work done by Garth Gibson. Gary Grider of
the national labs (LANL) has been a champion of high-performance
parallel I/O.
15. Author's Addresses
Brent Welch
Panasas,Inc.
6520 Kaiser Drive
Fremont, CA 94555 USA
Phone: +1 (510) 608 7770
Email: welch@panasas.com
Benny Halevy
Panasas, Inc.
1501 Reedsdale St., #400
Pittsburgh, PA 15233 USA
Phone: +1 (412) 323 3500
Email: bhalevy@panasas.com
David L. Black
EMC Corporation
176 South Street
Hopkinton, MA 01748
Phone: +1 (508) 293-7953
Email: black_david@emc.com
Andy Adamson
CITI University of Michigan
519 W. William
Ann Arbor, MI 48103-4943 USA
Phone: +1 (734) 764-9465
Email: andros@umich.edu
David Noveck
Network Appliance
375 Totten Pond Road
Waltham, MA 02451 USA
Phone: +1 (781) 768 5347
Email: dnoveck@netapp.com
welch-pnfs-ops Expires - April 2005 [Page 22]
Internet-Draft pNFS Operations Summary October 2004
16. Full Copyright Notice
Copyright (C) The Internet Society (2004). This document is subject
to the rights, licenses and restrictions contained in BCP 78,
and except as set forth therein, the authors retain all their rights.
This document and the information contained herein are provided
on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE
REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE
INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Intellectual Property
The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed to
pertain to the implementation or use of the technology described in
this document or the extent to which any license
under such rights might or might not be available; nor does it
represent that it has made any independent effort to identify any
such rights. Information on the procedures with respect to rights in
RFC documents can be found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any
assurances of licenses to be made available, or the result of an
attempt made to obtain a general license or permission for the use of
such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository at
http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at ietf-
ipr@ietf.org.
Acknowledgement
Funding for the RFC Editor function is currently provided by the
Internet Society.
welch-pnfs-ops Expires - April 2005 [Page 23]