As storage capacity continues to increase, users find it increasingly
difficult to manage their files using traditional directory hierarchies.
Attribute-based naming enables powerful search and organization tools
for ever-increasing user data sets. However, such tools are only useful
in combination with accurate attribute assignment. Existing systems
rely on user input and content analysis, but they have enjoyed minimal
success. We propose several new approaches to automatically assigning
attributes to files through context analysis, a technique that has been
successful in the Google web search engine. With extensions like application
hints (e.g., web links for downloaded files) and inter-file relationships,
it should be possible to infer useful attributes for many files, making
attribute-based search tools more effective.
As storage capacity increases, the amount of data belonging to an individual user increases accordingly. Soon, storage capacity will reach a point where there will be no reason for a user to ever delete old content in fact, the time required to do so would be wasted. The challenge has shifted from deciding what to keep to finding particular information when it is desired. To meet this challenge, we need to improve our approach to personal data organization.
Today, most systems provide a tree-like directory hierarchy to organize files. Although this is easy for most users to understand, it does not provide the flexibility required to scale to large numbers of files. In particular, the strict hierarchy provides only a single categorization with no cross-referenced information.
Alternatives to the standard directory hierarchy systems generally assign attributes to files, providing the ability to cluster and search for files by their attributes. An attribute can be any metadata that describes the file, although most systems use keywords or <category, value> pairs. The key challenge is assigning useful, meaningful attributes to files.
Unfortunately, the two most prevalent methods of attribute assignment,
user input and content analysis, have been largely unsuccessful. Although
users often have a good understanding of the files they create, it can
be time-consuming and unpleasant to distill that information into the
right set of keywords. As a result, users are understandably reluctant
to do so. On the other hand, content analysis takes none of the users
time, and can be performed entirely in the background to eliminate any
potential performance penalty. However, the complexity of language parsing,
combined with the large number of proprietary file formats and non-textual
data types, restricts the effectiveness of content analysis.
Early web search-engines, (e.g. Lycos), relied upon user input (user submitted web pages) and content analysis (word counts, word proximity, etc.). Although valuable, the success of these systems has been eclipsed by the success of Google.
To provide better search results, Google utilizes two forms of context analysis. First, it uses the text associated with a link to determine attributes for the linked site. This text gives the context of both the creator of the linking site and the user who clicks on the link at that site. The more times that a particular word links to a site, the higher that word is ranked for that site. Second, Google uses the actions of a user after a search to decide what the user wanted from that search. For example, if a user clicks on the first four links of a given search, and then does not return, it is likely that the fourth link was the best match, providing the users context for those search terms.
Unfortunately, Googles approach to indexing does not translate
directly into the realm of file systems. Much of the information that
Google relies on does not exist within a file system. Also, Googles
query feedback mechanism relies on two properties: users are normally
looking for the most popular sites when they perform a query, and they
have a large user base that will repeat the same query many times. Conversely,
in file systems, users usually search for files that have not been accessed
in a long time, because they usually remember where recently accessed
files reside, and there is generally only a single user for each set
of files, making it unlikely that frequent queries will be generated
for any given file.
We are investigating four approaches to automatically gathering context information for use in file systems. The first two focus on gathering attributes when a file is created or accessed. The second two focus on propagating attributes among related files to increase the coverage of attribute assignment. Together, these techniques should categorize a much broader set of files than creation-based attribute assignment alone.
Application assistance: Although computers provide a vast array of functionality, most people use their computer for a limited set of tasks using a small set of applications that, in turn, access and create most of the users files. Modifying these applications to provide hints about the users context could provide invaluable attribute information.
Existing user input: Although most users are not willing to input additional information, they are willing to choose a directory and name for their files. Each of the sub-directories along the path and the file name itself probably contain context information that can be used to assign attributes. For example, if the user stores a file in /home/papers/FS/Attribute-based/Semantic91.ps, then it is likely that they believe the file is a paper having to do with FS, attribute-based, and semantic.
User access patterns: As users access their files, the pattern of their accesses provides a set of temporal relationships between files. A possible use of this information is to help propagate information between related files. For example, accessing SemanticFS.ps and Gopal.ps followed by updating related.tex may indicate a relationship between the three files. Subsequently, accessing related.tex and creating FindingFiles.ps may indicate a transitive relationship.
Inter-file content analysis: Content analysis will continue
to be an important part of automatically assigning attributes. In addition
to existing per-file analysis techniques, our focus on creating context-based
connections between files suggests another source of attributes: content-based
relationships. For example, some current file systems use hashing to
eliminate duplicate blocks within a file system, or even locate similarities
on non-block aligned boundaries. Such content overlap could also be
used to identify related files, by treating files with large matching
data sets as related. Similarly, users (or the system) will often keep
several slightly different versions of a file. Although these files
generally contain differences, often the inherent information contained
within does not change (e.g., a user may keep three instances of their
resume, each focused for a different type of job application). This
gives the system two opportunities for content analysis. First, content
comparison can identify related files. Second, by performing content
analysis solely on the differences between versions, it may be possible
to determine version-specific attributes, making it easier for users
to locate individual version instances.
![]() |
A prototype system for evaluation context-based attribute assignment schemes. |
This figure shows an overview of a prototype system for evaluating context-based
attribute assignment schemes. The system is composed of four main parts:
the tracer, the application interface, the analyzer, and the database. The tracer keeps a trace of all file system activity in
the system. Any file system calls made by applications are tracked and
stored in a file for later offline analysis. This allows a single system
to employ a variety of different analysis techniques. The application
interface allows applications to pass context information into the system,
such as email header information or link information from a web browser.
This information is used by the analyzer to generate attributes for
files. The analyzer combines application information, and offline trace
analysis to generate attributes for files. All updated attribute information
is passed to the database, which provides the search interface to the
application. It allows applications to locate files using the file attributes
assigned by the analyzer. Feedback from the search results is pushed
to the analyzer for further attribute refinement.
This design could include multiple databases. In order to compare the
results of different trace analysis algorithms, the analyzer could maintain
a database for each, and users could compare the results of the different
approaches. For more information on this project see Soules [1].