Let it Be(OS)

The BeOS file system, an OS geek retrospective

From the archives: In the '90s, OS geeks adored BFS's ahead-of-its-time feature set.

Andrew Hudson – Jul 5, 2018 3:22 PM | 150

HD, so like... a high-definition floppy?

It's the day after Independence Day in the US, and much of our staff is just returning to their preferred work machines. If this was 1997 instead of 2018, that would mean booting up BeOS for some. The future-of-operating-systems-that-never-was arrived just over 20 years ago, so in light of the holiday, we're resurfacing this geek's guide. The piece originally ran on June 2, 2010; it appears unchanged below.

The Be operating system file system, known simply as BFS, is the file system for the Haiku, BeOS, and SkyOS operating systems. When it was created in the late '90s as part of the ill-fated BeOS project, BFS's ahead-of-its-time feature set immediately struck the fancy OS geeks. That feature set includes:

A 64-bit address space
Use of journaling
Highly multithreaded reading
Support of database-like extended file attributes
Optimization for streaming file access

A dozen years later, the legendary BFS still merits exploration—so we're diving in today, starting with some filesystem basics and moving on to a discussion of the above features. We also chatted with two people intimately familiar with the OS: the person who developed BFS for Be and the developer behind the open-source version of BFS.

A little history

BFS was created in 1997 by Dominic Giampaolo and Cyril Meurillon, both of whom worked at Be. It was designed to be multi-threaded and lightweight, and to support high-volume, streaming multimedia. It was also designed to support the database features of the previous Be file system. Even though it was written at a time when systems typically had only 8MB of RAM and a mere 9GB of disk storage, many of the forward-thinking design decisions made then are still valid today.

BFS didn't quite end when Be shut its doors after failing to get bought by Apple. In 2002, Axel Dörfler re-implemented BFS for Haiku as an open-source project. The last part of this article features an interview with Axel.

Before we can talk about what made BFS so special, we first have to cover some file system basics.

A screenshot of BeOS — Behold, BeOS in all its glory. Credit: Wikimedia Commons

File system basics

At the basic level, a file system exists to manage the data on permanent storage devices. Functions common to most file systems include:

Ars Video

Creating files and directories
Opening, reading, writing, deleting, and renaming files
Reading, writing, and updating file metadata or attributes

Additional features include symbolic links, access control lists, and memory mapping.

For a high-level overview of numerous file systems, refer to the article From BFS to ZFS: Past, Present and Future File Systems. For a more in-depth look at HPFS, NTFS, EXT2, and XFS circa 2000, refer to chapter 3 of Practical File System Design.

The following terms are common in file system discussions, so you should glance through the list and familiarize yourself with any you don't already know:

Disk: A permanent storage medium of a certain size. A disk also has a sector or block size, which is the minimum unit that the disk can read or write. The traditional block size for disks has been 512 bytes, but it now approaches 4096 bytes for newer disks.

Block: The smallest unit writable by a disk or ﬁle system. Everything a ﬁle system does is composed of operations done on blocks. A ﬁle system block is always the same size as or larger (in integer multiples) than the disk block size.

Bitmap: not a picture, but a data structure that determines which blocks on a disk are free or used.

Partition: A subset of all the blocks on a disk. A disk can have several partitions.

Volume: The name given to a collection of blocks on some storage medium (i.e., a disk). That is, a volume may be all of the blocks on a single disk, some portion of the total number of blocks on a disk, or it may even span multiple disks and be all the blocks on several disks. The term “volume” is used to refer to a disk or partition that has been initialized with a ﬁle system.

Superblock: The area of a volume where a ﬁle system stores its critical, volume-wide information. A superblock usually contains information such as how large a volume is, the name of a volume, and so on

Metadata: A general term referring to information that is about something but not directly part of it. For example, the size of a ﬁle is very important information, but it is not part of the data that's stored in the file itself. So file metadata is data about a file, and not in a file.

Journaling: A method of insuring the correctness of ﬁle system metadata even in the presence of power failures or unexpected reboots.

I-node: The place where a ﬁle system stores all the necessary metadata about a ﬁle. The i-node also provides the connection to the contents of the ﬁle and any other data associated with the ﬁle. The term “i-node” (which we will use in this article) is historical and originated in Unix. An i-node is also known as a ﬁle control block (FCB) or ﬁle record.

Attribute: A name (as a text string) and value associated with the name. The value may have a deﬁned type (string, integer, etc.), or it may just be arbitrary data. (1)

BFS features

Now that we've got our file system basics down, let’s look at some of the features that make BFS unique.

First off, BFS's 64-bit addressing means that no matter how large disks get in the future, you will be able to format the entire disk with BFS. You can create partitions in excess of 8 exabytes and, depending on the block size used, you can create files that are greater than 30 GB in size.

One of BFS's most important and widely touted features is its support for extended attributes. An example of the importance of attributes is illustrated with an example of MP3 files. Information fields important to an MP3 file would be: song title, band, album, release date, encoding rate, length, number of times played. If you want to associate this information with each MP3 file using a conventional file system, you might have to create your own database to support searching, creating, updating, or deleting these attributes as your music collection grows and changes. With BFS, in contrast, these attributes, or any other attributes, can be added to the file system itself. This means that a program for editing or playing MP3s does not need to create or maintain a database, because the file system will handle these functions for you. BFS supports associating attributes with a file, either under program control or from the command line. Attributes can be searched and sorted by the file system, as an extension of any application. How this is done will be discussed in detail later.

BFS supports the ability to create a persistent or ‘live’ query that watches for file changes. This is a query that hooks into the file system, checking for files that match search criteria. Under Haiku, these queries are easy to create and surprisingly light on system resources.

BFS is journaled, which means that it keeps track of certain file system consistencies on-the-go, and does not need file system consistency tools like fsck or chkdsk. Journaling also contributes to faster system booting after an unexpected shutdown.

Internally, BFS uses UTF-8 characters for directory and file names. This means you can use virtually any language, natively, in Haiku. With no extra effort you can localize file names to Chinese, German characters with umlauts, or cursive Arabic.

BFS gives special performance consideration to large file access. Creating and reading large video, audio, or image files are optimized operations under BFS.

BFS architecture: the Superblock

The superblock is typically the highest-level data structure for a file system. The BFS superblock describes the physical disk, the journal area, and indexing. For obvious performance reasons, the superblock is kept in RAM after the system boots.

 
typedef struct disk_super_block { 
	char name[B_OS_NAME_LENGTH]; 
	int32 magic1; 
	int32 fs_byte_order; 
	uint32 block_size; 
	uint32 block_shift; 
	off_t num_blocks; 
	off_t used_blocks; 
	int32 inode_size; 
	int32 magic2; 
	int32 blocks_per_ag; 
	int32 ag_shift; 
	int32 num_ags; 
	int32 flags; 
	block_run log_blocks; 
	off_t log_start; 
	off_t log_end; 
	int32 magic3; 
	inode_addr root_dir; 
	inode_addr indices; 
    int32 pad[8]; 
} disk_super_block;

The name struct holds the file system name. Three magic numbers are used for consistency checking as well as version numbering. Fs_byte_order holds the byte ordering, and block_size holds the explicit byte count; block_shift, used as an exponent of 2, will also calculate the block size. This is a purposeful redundancy used for consistency checking of a file system. Num_blocks holds the number of available blocks for the file system, and used_blocks holds the number currently in use. The flags field determines whether the state of the superblock is clean or dirty. Root_dir points to the root of all files and directories. Indices points to the beginning of the index section of extended attributes. The bfsinfo utility can be used to dump the superblock for a system’s superblock.

I-nodes

Like most UNIX-derived file systems, BFS uses a node structure to keep track of disk allocations. The bfs_inode structure is defined below. The i-node handles basic file metadata including file creation time, owner, size of file, group ownership, where the file data is stored on the disk, etc. BFS does not update file size until a file is closed. In testing it was found that this gave a substantial performance gain.

 
typedef struct bfs_inode { int32 magic1;
    inode_addr inode_num;
    int32 uid;
    int32 gid;
    int32 mode;
    int32 flags;
    bigtime_t create_time;
    bigtime_t last_modified_time;
    inode_addr parent;
    inode_addr attributes;
    uint32 type;
    int32 inode_size;
    binode_etc *etc;
    data_stream data;
    int32 pad[4];
    int32 small_data[1];
} bfs_inode;

A magic number is used again in bfs_inode for sanity checking and error recovery. The magic numbers for BeOS and Haiku are the same for compatibility, but different for the SkyOS implementation. BFS uses the file’s disk sectors as the i-node value, making sector mappings a straight lookup. UID, GID, and mode are used to maintain POSIX compliance.

Data_streams

The i-node structure holds the basic attributes of a file but not the actual file data itself. This is done through the data member structure. The data member is defined by the data_stream struct:

 
typedef struct data_stream {
	block_run direct[NUM_DIRECT_BLOCKS]; 
	off_t max_direct_range; block_run indirect; 
	off_t max_indirect_range; 
	block_run double_indirect; 
	off_t max_double_indirect_range; 
	off_t size; 
} data_stream;

The data_stream struct maps data from the physical disk to the file stream API. Access using data_streams is optimized for throughput, bypassing the system cache, and using DMA into and out of user memory. In benchmark tests, BFS is able to achieve high throughput that approaches peak disk transfer rates.

Database functions using extended attributes

As mentioned earlier, handling attributes is an important file system function. The Mac HFS was the first file system to extensively use file attributes in support of GUIs. Consider that a windowed OS needs to persist and manage many GUI attributes such as frame size, location, coloring, text, etc., and needs to optimize access for a quick response time.

BFS supports extended attributes in the form of key/value associations with files. Keys have a fixed type and can be added at any time. Valid types are string, time, double, float, int, boolean, raw, and image. If a key is indexed, basic searching on a key is greatly optimized. The following are command line tools for managing extended file attributes.

addattr key value filename: adds the key/value pair to the specified file
catattr key filename: displays the specified fieldname value.
listattr filename: lists a file’s associated attributes, their types, and their sizes, in bytes.
rmattr key filename: removes an attribute from the specified file.

New fields are created globally with mkindex. For instance,

mkindex --type=indextype index: creates a new index, of type long, on the current volume.
reindex sourcefile filename: adds a file’s key to an index if the index is created after the file’s attributes.
rmindex index: removes an attribute from the current volume, making it unavailable for use.
lsindex: lists all attributes

In Haiku, access to file attributes is supported graphically through the Tracker, as well as with keyboard shortcuts. Any object in Haiku that has a graphical representation has the _trk/pinfo_le attribute set with its file icon position. The BEOS:TYPE attribute holds the application association for a file. More attribute usage detail can be found here.

Additional information and examples of using file attributes are provided here and here.

Searching with Query

Query is the command line tool used to search for matching attributes. It is easier to use than the "find" UNIX command line utility.

Here are some examples of the query syntax:

query "((name="**")&&(BEOS:TYPE=="audio/x-wav"))" - finds all files with MIME type of "wav". Useful if you have wav files that are missing the .wav extension.

query "(last_modified >= %yesterday%)" - finds files older than yesterday

The output from query can be used with scripting tools from the command line as follows:

touch 'query ((last_modified< %yesterday%)&&(BEOS:TYPE=="audio/mpeg"))'

This will update the last modified time on all MP3 files with a last modified time older than yesterday.

Additional information on scripting with query is provided here.

Its Haiku GUI counterpart is "find," found in the Deskbar and documented here. All find queries are cached for 7 days and appear in the drop-down list.

The BeBox was a dual-processor PC made to run BeOS (this one is running with an aftermarket monitor). Credit: Wikimedia Commons

Live queries

An especially nice feature of BFS is the Live Query feature. When using Find, drag a query name from the select/pick list and drag it to the desktop or to Tracker. This hooks the query into the file system and saves it. Any time a file matching the query criteria is created or deleted, the query list is updated. Live queries are supported natively in BFS and use surprisingly little resources.

Applications use attributes

The Haiku Mail Kit is an example of an application that makes extensive use of attributes. Haiku mail does not have its own database for storing and managing email files. Instead, it stores each email directly into the BFS file system and uses more than 15 email-specific attributes for searching and sorting. Can you imagine having 8 exabytes worth of email? Haiku makes this theoretically possible.

This is a great example of abstracting functionality out of individual applications and locating it in the operating system, or in this case the file system. Because BFS supports extended attributes, any application can use the powerful database capability of the file system without having to reinvent the wheel.

Optimizing attribute lookups

Since every file on BFS could potentially have many extended attributes, and there could potentially be hundreds of thousands of files, there is a great need to optimize the access of attributes. In BFS, each index is implemented as a B+tree data structure. The B+tree is a balanced, sorted tree data structure that provides very fast lookup and scales up very well. Not surprisingly it's also used to manage directory structures and it is widely used in other file systems, such as NTFS. BFS indices are very optimized for lookups of the form:

query “(name==”temp”)” or query “(name >”111”)”

BFS indices are not optimized for regular expression searches of the form:

query “(name==[aA][bB][cC])”

These searches degrade from a binary lookup to a sequential lookup and could potentially take much longer in a large file system.

Queries that use a regular expression but start with a fixed expression are optimized in Haiku, e.g., query “(name==temp[1234])” will run faster than query “(name==[tT]emp[1234])”.

Journaling

Computers with disk storage can experience many kinds of failure modes. The magnetic material of a disk sector can go bad, the servo mechanism that moves the disk head can fail, the electronics that interface with the computer can die, or the user can reboot the machine in the middle of a disk write operation. If you run a disk long enough, eventually the electronics or mechanics of a disk will fail. You generally don’t have to wait very long for an impatient user to reboot during a file write operation. This is unfortunately a fairly common occurrence and it can have devastating results on a file system.

Consider the following situation. A user has created a file and it is the process of saving it to disk. Maybe it’s a developer working on an overdue project. The file system must make several updates before this file is completely saved. It must first save the contents of the file. It must save the metadata for the file, creation date, owner, file size, etc. It must also update the superblock. Consider what happens when a system powers down before these operations complete. In addition to losing your valuable work, the data structures can become corrupted and point to i-nodes and blocks that don’t exist. In addition to being even more late, your file system may subsequently fail to boot, losing all of your other files.

Journaling, or file system logging, alleviates some of these problems. While journaling doesn’t guarantee that a premature reboot won’t lose your file, it does guarantee that it won’t corrupt your file system.

Consider the following example of how journaling works. Let’s say a user creates a new file and saves it. The data must be saved to disk, and a new directory entry with correct metadata must be saved as well. Before these disk operations occur, BFS locks the unwritten blocks into RAM, and makes log entries in the file system journal for each block about to be written. After each block is successfully flushed to disk, the journal entries are marked as completed. If the system shuts down before the blocks are successfully flushed, the journal entries will list the incomplete operations. When the system restarts, it inspects the journal for entries that are not completed. The remaining entries can be ‘replayed’ when the system reboots to successfully complete the write operation, or the whole thing can be "rolled back," and the operation aborted. Either way, the file system is left in a consistent state where the directory structure and metadata accurately match the file data.

In BFS, the journal logs any changes made to the directory, the bitmap block, i-nodes, and extended attributes. It does not journal the actual user data. In this way, journaling protects file system consistency but does not provide data recovery features like a redundant disk array, or RAID.

While not immune to all disk errors, BFS is resistant to the common failure mode of a premature system shutdown. File systems without journaling, such as Linux EXT2, are vulnerable to file system inconsistencies and rely on lengthy scanning processes for recovery. BFS does not need disk scanning, and Haiku can start up quickly after a premature shutdown.

Improvements of Haiku BFS over BeOS

Haiku’s version of BFS has a number of improvements over the original BeOS BFS implementation. The B+tree is more robust. Haiku BFS uses a file cache for file data in addition to a block cache. This resulted in a factor of 10 speed improvement. Haiku’s BFS implements status changed time for files, and also has more fine-grained file status capability. The POSIX atime file was omitted from BeOS BFS for performance's sake. Haiku BFS includes a query optimized for hybrid regex that allows mixing of a static string with a regular expression. New inspection tools bfsinfo, bfswhich, chkindex, and recover were added for Haiku BFS. The reindex command was added to improve indexing of extended attributes.

A quick interview with a Senior Developer at BeOS, who worked on BFS

(He asked that his name not be used, to comply with the wishes of his current employer)

The book Practical File System Design describes the BeOS development environment as being short on time and scarce on developers. What were some of the positive aspects of that environment?

It was just a fun place to work at the time. We all got along really well, loved what we were doing and all fed off of each other's energy. Everyone was top-notch and it was just non-stop. Usually you get one or two of those aspects in a company, but when you have them all it's a very intoxicating environment: change is rapid, progress is good, you feel like you're really doing something important and it's just a lot of fun.

From conception to release, how long did it take to code and debug BFS?

Ten or eleven months. I had help from one other engineer to write the code that handled writing to the indirect blocks in a file.

What tools did you use to develop and debug BFS?

Emacs, make, and gcc. Initially I did some prototyping in user space and thus was able to use gdb but once it went into the kernel it was all "Welcome to Kernel Debugging Land" from there on out.

What was the biggest challenge in developing BFS?

Trying to get it done in such a short timeframe while supporting all the features we wanted.

Did BFS influence the design of any subsequent file systems?

Unlikely.

What are some hot topics of file systems today?

Data integrity and data de-duplication are probably the biggest areas of interest right now. People are also spending a lot of time trying to figure out how to provide reliable storage in the face of unreliable components. RAID-5/6 are OK, but as the size of drives go up, a lot of people are worried that when a failure happens, they're vulnerable to a total failure if another drive fails before they're done with reconstruction.

What new features or activities will the next generation of file systems support?

I'm not sure. I think it will be a little while yet before we sort out which layer is the right place for all the different parts of the problem of "storage." ZFS has gone down the path of putting everything in the FS. Other folks are putting in a different set of things. It's pretty clear that some of the functionality that existed in BFS does *not* belong in the FS.

In your opinion, what science fiction movie has the best use of a computer?

Hmmm, not sure. Star Trek? Their computers do everything and have a multitouch interface and they don't spend a lot of time futzing around with them.

A Quick Interview With Axel Dörfler, Developer of Open Source BFS

Before BFS, did you have previous experience working with file systems?

I had to write an application to recover some data from a BFS partitioned hard disk. That gave me all the knowledge I needed to write BFS.

What was the hardest part of rewriting BFS?

Making sure the B+tree implementation behaved correctly, as the one used in the original BFS (as mentioned in Dominic's book) was quite unstable, and inefficient.

What was the easiest part?

The actual BFS read-only implementation, as I could reuse most of the code I had written for recovery.

Did you find any surprises when rewriting BFS?

Yes, there are quite some illogical things like the log using block_runs, but requires them to have a length of one, or the almost superfluous double indirect block implementation.

Did you ever have any discussions with Dominic about BFS?

Yes, but they didn't really detail its flaws, rather my implementation.

What would you change in BFS 2.0?

A lot.

How long did the coding take?

I honestly don't remember. I think the read-only part took only two weeks or so, while the write part took a lot longer, and it took still longer to make it fully compliant to Be's BFS.

What was the biggest bug and how did you solve it?

The most annoying bug was the "vnode ID already exists" bug - there were like a dozen or so reasons why this one could happen, and that kind of minimized the joy of having found another issue with it (only a part of them were in BFS itself, though).

It's still there in a new incarnation as bug #5262, BTW, although this could have completely unrelated reasons (like memory corruption).

What lessons did you learn from rewriting BFS?

That it's much easier to have a working VFS when writing a file system.

Lastly, what is your current development PC?

Depending on where I am it's either a ThinkPad T60, or two year old Core 2 Quad with onboard Intel graphics.

Acknowledgements

This article relied heavily on the book Practical File System Design with the Be File System. It’s seldom that a file system is documented in such detail and in such a readable fashion. Also thanks to Axel Dörfler for fact checking.

References

About the Author

Andrew Hudson is a freelance technical project manager living in Florida, USA. In his abundant spare time he enjoys exploring caves and restoring antique cars.

150 Comments