CS480 Notes

Explore

Tasks/Study Notes

Week 5 (Ch.4 Notes)

Eddie Coda

Last edited 489 days ago by Eddie Coda

Learning Objectives

By the end of this week, students should be able to:

Understand the basic terms: file, directory, file attributes, i-node, block, superblock, softlink, hardlink and etc.

Understand the fundamentals of file systems implementation;

Understand the example file systems such as NFS and CD-ROM File Systems.

Chapter Notes/Readings

⁠

Topics/ Concepts

Files and Directories

Prof Notes

Chapter 4 File Systems (Page 263

Processes store data in memory, but since memory is volatile and [relatively] small, a persistent storage system is needed. Units of storage are known as files; the files and their method of organization is known as the file system. Some UNIX-specific information from Chapter 10 is also considered here. 4.1 Files (Page 265)

4.1.1 File Naming (Page 265)

Names are associated with files so that they can be accessed by processes other than the process that originally created the data. Different file systems place different restrictions on the naming conventions, depending on the scheme. In file systems with directories [i.e., most systems], the directory-separation character ['/' in UNIX, '\' in Windows] cannot be part of a file name, or the path names could not be parsed reliably. Some systems insist on extensions [MS-DOS, for example, requires a '.' near the end of the name, followed by up to 3 more characters], and others [UNIX] do not. There isn't much that is disallowed in a UNIX filename beyond '/' and the null character; it is legal to put a '<' in a filename, but most users avoid this, since '<' means redirection to the shell, and so users must type '\<' each time they want to specify a 'real' '<'. For similar reasons, '!', spaces, and wildcards are generally avoided in UNIX -- even though they are legal. Windows attaches meaning to file extensions; the extension determines which program is launched when a file is double-clicked. Relatively few UNIX applications are picky about the extension [or lack thereof]: rcs [the Revision Control System] requires repositories to end in ',v' and C compilers generally insist that source code be in a file that ends in '.c'. In UNIX, multiple extensions are allowed, and often give a hint as to the nature of the file: The file various.h.tar.gz,v indicates an rcs repository (,v) for a series of C header files (.h) that have been rolled into a tape archive (.tar), and then compressed with gzip (.gz).

Instead of relying on extensions, UNIX files tend to have a 'magic number' in the first two bytes of most files (postscript files, for example, begin with the ASCII codes for the characters "%!", and shell scripts begin with "#!"). Windows depends on the file having .mpg as the extension (see Figure 4-1 on Page 266).

4.1.2 File Structure (Page 267) Both UNIX and Windows regard files as merely a sequence of bytes, but older systems often organized data into records (See Figure 4-2 on Page 267).

4.1.3 File Types (Page 268) A file system may in fact support many different kinds of nodes [for example, directories, which organize the file system structure]. Data files are termed 'regular files', to distinguish them from more specialized files. UNIX also has named pipes, block-special files and character-special files. A named pipe is a FIFO queue that multiple processes can use (that is, one process can write to the pipe and another can read from it, much like the unnamed pipes that UNIX shells use). Character-special files are essentially portals to various hardware: if you send data to /dev/tty [redirect output there, or open() it and write() to it], it is the job of the UNIX device driver that monitors this filesystem node to ensure that the data appears on your screen. Block-special files are nodes in the file system that serve as portals to block devices, such as the physical disk drives.

'Normal' text files used to come in two flavors, encoded in EBCDIC [IBM's format, based on their punched-card encodings] or ASCII [the standard format used by everyone else]. Punched cards had 80 columns with 10 or so potential holes in each column. The card would be very flimsy if it contained too many holes, so the most-used characters were represented by punched patterns that consisted of only one or two holes. A blank was represented by not having any holes in the column [so a pristine card was essentially 80 blanks], the digits 0-9 were the one-hole patterns, and the upper and lowercase alphabetic characters were various two-hole combinations. Unfortunately, this meant that the binary codes for the characters were not very rational. If you increased the code for the letter 'A' by one, you did get the code for 'B', but increasing the code for 'I' does NOT result in the code for 'J'.

ASCII is a more rational scheme. The letters A-Z are assigned to 26 contiguous binary codes, and a-z are in another contiguous range, so converting uppercase to lowercase is simply a matter of adding a constant [namely 32] to any uppercase letter. ASCII also was designed with lexicographical order in mind, so the code for a space is smaller than any code for a letter. That way, when words are ordered based on their character encodings, 'to' will be earlier than 'top' like it is supposed to be in 'dictionary order'.

Once punched cards were obsolete, there was no reason to use the inconvenient EBCDIC encoding any more. Nowadays, a text file comes in only one flavor: ASCII. Line termination is really the only aspect that is not completely standardized. Some systems use the linefeed character, some use carriage return, and some use both.

Having an agreed-upon universal type has many advantages: all editors can handle this file, programs can pass data between them without worrying about encoding issues, etc. (Contrast this to a .doc file, which contains all sorts of formatting gibberish, making it hard to extract the actual text without special processing.)

Files which have a complex internal structure are known as 'binary' files [to distinguish them from text files]. Archives and machine-executable files also fall into this category. File-transfer protocols that transport files between hosts often have a 'friendly' feature that will, for example, turn the code for every linefeed [decimal 10] into the code for carriage return [decimal 13], to allow ASCII text files to be read correctly on a host that uses different conventions for line termination. However, using this feature on a binary file, such as an executable, would be disastrous: you may find that your 'constant' 10 has been turned into a 13, or [perhaps worse] some machine-code instruction has been transformed into an entirely different command. Thus, programs such as ftp have an ASCII and a binary transfer mode, and it is important to choose the correct one.

Figure 4-3 on Page 270 illustrates the structure of some binary files. On the left is the layout of a UNIX executable. It begins with a magic number; were you to transfer an executable from an Intel PC to a Sun Sparc machine [which use different types of CPUs], and try to execute it, the operating system would note that the magic number was inappropriate for this architecture and refuse to run it. Other important header data includes the entry point, which [when combined with the text size information] specifies the location of the first machine-code instruction to begin executing.

The second illustration (Figure 4-3(b) on Page 270) shows the layout of an archive, which is essentially a collection of .o files. These archives are wrapped up with some header information, saying where each module starts and ends, and are generally stored under /usr/lib. Once the compiler has taken your main.c and created a main.o from it, the linker takes the various .o files it needs from the archive, and combines them to form an executable a.out file. (For example, your programs usually use printf(), but you have not written the printf.c module yourself -- instead, this has been pre-compiled into a printf.o and stored in an archive.) In UNIX, archives have the extension .tar (or .tgz, if it's a compressed archive), and could be a compendium of any set of files (not just .o files).

The text discusses the strategy of using strongly-typed files on Page 269, wherein file extensions are mandatory and applications pathologically insist on having the proper extensions. Worse, the operating system will not allow you to change the type of extension (study the text examples on Page 269).

4.1.4 File Access (Page 270)

Initially, computers had to access information on storage devices by starting at the beginning and reading sequentially, even if the information you really wanted was near the end of the file. This reflected the physical reality of the way tape drives had to work. If you suddenly needed information that came before what you were currently reading, the only option was to 'rewind' and start reading from the beginning again. Once there were drums and disk drives, it became possible to go directly to the portion of the data that was desired [with a 'seek' operation that went to a specified place in the file]. This was termed random access [as opposed to sequential access]. Nowadays sequential access has fallen by the wayside, since random access allows both methods. The most common strategy is to by default read sequentially from wherever the current position pointer is. The current pointer position is automatically updated as you read, and can be explicitly changed via the seek command to 'jump' to a non-sequential location within the file.

4.1.5 File Attributes (Page 271)

Besides the data that comprise a file, the file system records various attributes. Many of the possibilities are shown in the figure on Page 272. In UNIX, the name of the file is kept [only] in a directory, and additional attributes are all stored in the file's inode [short for "index node"]. The structures are described in Figure 10-33 on Page 788 and Figure 10-34 on Page 790, which we will consider next.

A few of the fields in Figure 10-33 require some explanation. The "Nlinks" field counts the number of times this inode is associated with a directory name. It is possible to have different directory names referring to the same inode, and hence to the same file. This is illustrated in Figure 10-24 on Page 777. Directory entries are essentially ordered pairs of names and inode numbers, and it is perfectly legal to associate several different names [perhaps in different directories] with a single inode number. This is done via the ln command [which uses the link() system call]. Directory entries such as these are known as 'hard links', because they create a new directory pair, effectively cloning an existing pair, keeping the same inode but potentially using a different name (though Figure 10-24 happens to use the name 'x' in both directory entries). The soft link is just a special type of file, and the data it contains is merely the path to the file to which you are linking. This is illustrated in Figure 10-24 on Page 777; note that this implies that the file system tree is not technically a tree. [Thus, a soft link is rather like a WinBlows 'shortcut' on your desktop.]

Most importantly, the inode lists the locations of the data blocks that comprise the actual file. If files were stored contiguously, then onlyTopics/ Concepts Files and Directories the starting block and the size would be necessary to give the full location of the file. This, however, is a very bad idea on a file system in which files are created and deleted, leading to many small useless 'holes', and problems when a file needs to expand beyond its current size. (One of the places where contiguous file storage works well is CDROMs, since nothing is ever deleted, and hence no 'holes' can form.)

UNIX files do not need to be contiguous, so a list of individual blocks is needed. It is important for inodes to all be the same size, so that they can be efficiently indexed as an array of records. This rules out the option of making only some inodes large enough to point to thousands of blocks; every inode would have to have 'room' for thousands of addresses.

The UNIX solution is very elegant. The first ten pointers in the inode point directly to blocks of file data. The eleventh pointer points to a 'single indirect' disk block that does not contain file data; it contains a list of the next 1024 data blocks (indirect addressing, as shown in Figure 10-34 on Page 790; a clearer picture of this portion of the inode is shown in the figure on Page 325).

This is sufficient to handle files up to a size of 10+1024 blocks. If this is not enough, the 12th pointer is utilized, which points to a 'double indirect' block. The 1024 pointers in this block do not point to file data; each one of them point to a single indirect block, which in turn points to 1024 data blocks, giving a total capacity of over a million blocks. If this is still insufficient, the thirteenth pointer is a triple indirect block, which allows for a billion more blocks. (The actual number of pointers in a data block depends on the size of the disk blocks; the examples discussed in the text assume 256 (not 1024) addresses can fit in one block.) These days, Linux tends to have 12 direct pointers instead of 10, so the math can be a little different on different flavors of UNIX.

Research has shown that 'most' files are 'small', and hence 'most' files are accessed rapidly; this indirect addressing only comes into play for very large files. As a consequence, inode sizes remain small, most files are accessed very efficiently, and yet an extraordinary range of files are easily accommodated. Elegant!

Figure 10-34 on Page 790 also shows the data structures the kernel maintains when open() is called. An open file needs to have a file pointer to indicate where the 'next byte' to read is, but the obvious places to store this pointer do not work. It cannot be put in the inode itself, or weird things will happen if two processes have the file open() at the same time: each process will move the file pointer on each read, and hence each process will only retrieve part of the file.

The other obvious place for a file pointer is the file descriptor table that is created by open() for each process. This avoids the interference issue, but too effectively: if you wish to have cooperating processes all writing to the same file [as our threads do in program 3], we definitely do NOT want every process/thread to write their first line of output at file location 0 -- they would each write to the same place, and overwrite each other.

The solution is illustrated in Figure 10-34 on Page 790. Besides maintaining file descriptor tables for each process, the kernel also keeps a list of the attributes of each open file (essentially the information in the inode, flags that specify whether reading or writing is allowed, plus the current file pointer, and indications of which inode, on which partition, corresponds to this structure). As shown in Figure 10-34, this allows cooperating processes to share file pointers if that is wanted, and allows unrelated processes to maintain their own independent pointers. In the picture, for example, the parent and child processes might be cooperating in writing data to the file, while the third 'unrelated process' is reading the same file at its own pace.

Returning to the concept of the number of links to an object in a file system: Removing an entry in a directory does not necessarily mean a file is deleted. If a file has 3 different names, then the data in that file is still present until all three references are deleted (a given reference is removed via the unlink() system call). Hence, the rm command removes the directory entry, but does not necessarily move the file's data blocks to the free list for reuse; it only does so if the link count of the file drops to zero. (Actually, it is a bit more complicated than that: if the particular file has been open()ed by one or more processes, then the data blocks are not freed until each of these processes has close()d the file.)

Every directory contains a '.' entry, which means 'this directory'. The directory has a (name, inode) pair in the directory 'above' it, and hence each directory will always have a link count of at least 2. Every directory also contains a '..' entry which points back to its parent, so if a directory contains lots of subdirectories, its link count may be very large. (But a directory cannot be hard-linked with 'ln' -- doing that might cause a conflict in what its '..' entry means.)

4.1.6 File Operations (Page 272)

Pages 272-273 outline 11 common system calls for dealing with a typicalfile system. The names for accomplishing the basic tasks outlined will varydepending on the operating system. For example, UNIX uses the stat() system call for retrieving attributes of a file (the information in the inode).Modifying these values use system calls such as chmod(), chown(), and utime(). utime() lets you lie about the last access or modification time of a file; however, UNIX maintains a third time stamp, specifying the last time the inode was modified [a change in inode attributes, not file data!], and this time stamp is not something that can be easily faked. (This is the Ctime attribute at the bottom of the table in Figure 10-33 on Page 788.) As mentioned before, deleting a file is done with unlink(). The renaming system call is, naturally enough, called rename(). Typically, this involves simply removing one entry from a directory and inserting another entry in a [possibly different] directory. Unless a file is being moved from one disk partition to another, no file data needs to be moved or copied; we just associate the inode with a new name in some directory [and therefore rename() is pretty much just a link() followed by an unlink()].

4.1.7 An Example Program Using File System Calls (Page 273)

The exposition in Pages 273-276, and the sample code in Figure 4-5 on Page 274 are well worth studying. The program follows the standard UNIX conventions, exiting with 0 upon success, and with different positive numbers for each possible type of failure. A more user-friendly version might also send helpful error messages to stderr, but the exit codes are sufficient to determine what went wrong [assuming you know what the source code was]. Note also that the read() command will regularly return the exact buffer size, but will likely return a smaller number when EOF is reached, and then return zero when there is nothing but EOF left to read... because read() always returns the actual number of characters that have been read.

4.2 Directories (Page 276) You should already be familiar with the way directories are laid out, so we won't spend much time on this...

Book Notes

Ch 4.1 -4.2 (page 263 – 281)

Video Notes

⁠

File System Implementation

4.3 File System Implementation (Page 281)

Having discussed things from the user's point of view, we switch to system issues.

4.3.1 File System Layout (Page 281)

Most storage devices allow the computer to boot from them, via the MBR [Master Boot Record], stored in the first 512 bytes of the device. A typical motherboard has enough smarts to read from the beginning of the device, put the data into memory and begin executing it this tiny little program. The job of the little program [the bootstrap loader, hence the name] is to bring in a larger program (perhaps the operating system, or perhaps a more sophisticated program that can load the operating system, or perhaps a program that allows you to choose which operating system bootstrap loader to bring in). The BIOS allows the user to specify which storage device [floppy, hard disk, CDROM, etc.] should be used to boot.

For disk drives, partition information follows the MBR. Disks can be logically carved into pieces, with each piece behaving like a separate disk drive. (So, there is a 512-byte boot block at the beginning of each partition. The MBR program of a disk drive examines the partition information, locates the partition that is marked as active, and drags this new little program into memory and control jumps to its entry point.) This new little program is then responsible for loading the operating system associated with this partition. Because they are independent, different partitions may well have very different file systems on them [e.g., linux on one and Windows on another].

IDE disks were originally set up to have four partitions; disks were so limited in space that it seemed absurd to carve it up any further. As capabilities grew and disk size expanded dramatically, the 'last' (fourth) disk partition was turned into an 'extended' enclosure that could hold many more 'logical' partitions, allowing a total of more than four.

The layout of a disk is shown in Figure 4-9 on Page 282; the disk partition with expanded details at the bottom of the figure shows typical structure. Some of the exposition below is illustrated in Figure 10-31 on Page 785, which gives the high-level details for a UNIX file system / partition.

The first block of a disk is a boot record for this partition [the Master Boot Record is for the entire disk; each partition also has a boot record]. The second block is the 'superblock', containing essential information about the file system. (For example: what kind of system? FAT, NTFS, Linux ext2, BSD Fast File System, etc. Each type is identified by a unique one-byte number. another parameter records how many disk blocks are in the partition, and other parameters vary by file system type.)

In Tanenbaum's Fourth edition, ONE sentence mentions GPT (Page 378); UEFI gets an additional half sentence more (Page 893). We will go into quite a bit of detail, since this is practical knowledge relevant to your laptop or desktop.

The classical IDE partitioning scheme has been around for 40 years, and it is showing its age. The original cylinder-head-sector (CHS) addressing that reflected the physical locations of data blocks on a stack of spinning disk platters was limited (on IDE drives) to about 31.5GB. Using Logical Block Addressing (LBA) allowed these 26-bit specifiers to be expanded to 32 bits, which (with 512-byte sectors) accommodates drives up to about 2TB in size, but this is the best you can do with the traditional MBR partition tables. Drives have now exceeded this size, so (for multiple reasons) a new partition table standard is being adopted. This new standard can handle drives in the ZB range (around 2^73 bytes), which seems outlandishly huge, but then again, TB sounded absurdly improbable thirty years ago.

Recall that the MBR (Master Boot Record) is always stored in the first sector (512 bytes) of the disk, and contains machine-code instructions for getting the boot process going: the BIOS loads this into RAM, and starts executing the code that was loaded. The new GPT layout retains this feature, so that a drive formatted in this new way can still be used on older systems that don't understand the new format. (It's now called a "Protective MBR", because the new setup makes it less likely that old, dumb disk utilities will inadvertently overwrite the new format.) With GPT, the first-stage bootloader code that is stored in the MBR is smart enough to handle the new GPT partitions. However, this code is only loaded and used by systems that are still using the BIOS to boot. Modern systems are replacing the BIOS with UEFI, which doesn't rely on a boot sector (see below...)

The 'old' first-stage bootloader code in the MBR could not recognize partitions nor navigate filesystems (which is not surprising; ALL the code had to fit in 512 bytes!). The MBR had instructions like: go to cylinder X and use read head Y to get sector Z and put it in memory. This sector (or series of sectors) contained the code for the second-stage bootloader, which *could* identify partitions and navigate the relevant type of filesystem, and thus find some huge file containing the operating system and load it into memory (e.g., /boot/vmlinuz on Linux systems). [Note that if you for some reason move the location of that second-stage bootloader, you must run a utility that can rewrite the MBR so that the new cylinder/head/sector location is specified in those 512 bytes.]

In the new UEFI (Unified Extensible Firmware Interface) standard, the MBR isn't even used. The new firmware contains enough smarts to negotiate some types of filesystems (such as FAT32) and directly locate the latter-stage bootloaders by parsing directory entries until it finds the designated set of data blocks. Pathnames (such as \EFI\BOOT\BOOTx32.EFI) are stored in firmware variables, which you can view and change right after turning the computer on (in much the same way you would enter the BIOS on older systems).

The Firmware Interface includes many magical capabilities. It can be queried over the network after power-on, even if the system has not had an OS installed on it yet (or if the OS won't boot), allowing diagnostics and recovery services to be performed remotely. (This ability is turned off after control is handed over to a running OS, so it is not quite the gaping security hole you might imagine.) UEFI also includes fancy graphics support and other device drivers that an awakening OS can use until it loads its own device drivers. It also has a 'Secure Boot' capability which can ensure that only 'acceptable' OS loaders are allowed to function. (There is great controversy over this 'feature', as it can be used to ensure that a WinBlows laptop is transformed into a non-functional brick if you attempt to replace the Microsoft Abomination by a decent OS.)

UEFI is an outgrowth of Intel's EFI, and is managed by a consortium of interests, with the goal of having a common platform that all the hardware and software vendors can support (hence the "Unified" in UEFI). But vendors are free to create additional applications that can be attached to the basic firmware interface to give it additional, perhaps unique, capabilities (hence the "Extensible" in UEFI). The standard firmware chip on the motherboard has enough smarts to load these sorts of additional modules from any other non-volatile source. You can think of the OS's boot loader (such as \EFI\BOOT\BOOTx32.EFI) as fitting into this category, but the vendor can also add things like a graphical interface to present the user with various choices before booting the OS, and so on.

For a UNIX file system, the superblock specifies how many inodes are present [from which it can be calculated where the inode array begins and ends], how many data blocks there are for use, and where the list of free blocks begins. If the superblock is destroyed or is unreadable [perhaps due to a head crash], the file system becomes completely unusable, since the operating system no longer knows where to find the necessary data structures [inodes, free list, etc.] For this reason, modern UNIX file systems replicate the superblock in multiple places, so that if the primary superblock is scragged, the backup information has a chance of being accessed. To increase the chances of recovery, duplicate superblocks are typically placed on separate cylinders, since a head crash usually only kills one cylinder.

Classical UNIX filesystems followed the structure outlined above. The Linux ext2 filesystem makes use of many of the features of the Berkeley Fast Filesystem [discussed below]. In particular, it splits up the partition into several 'mini-partitions' [confusingly called 'blocks'], so that inodes can be 'closer' on the disk to their corresponding file data blocks, which should tend to reduce the average distance the read-head armature has to move to access the file. Newer Linux filesystems (ext3 and ext4) increased the allowable size of partitions, files, and directories [drastically!] to a million Terabytes (i.e., 1EB), 16TB files, and 64,000 subdirectories (in ext4), and introduced journaling (discussed later).

UNIX was originally developed at AT&T, which the U.S. government [at the time] had granted a telephone monopoly, and the government therefore prevented AT&T from entering into new lines of business [such as computers]. What they could do is make it available cheaply to universities, which used it as a basis for operating system research. The University of California at Berkeley research group made remarkable improvements, leading to BSD [Berkeley Standard Distribution] UNIX, which likewise was distributed to universities for further improvements.

One improvement was a radical redesign of the file system, which we will discuss next. Continuing with the AT&T thread first: When AT&T was broken up into a long-distance company and the 'Baby Bells' and was no longer a monopoly, they 'won' the right to enter new markets, and marketed computers running their version of UNIX. Having been a monopoly, they were not particularly good at marketing to non-captive consumers, and they soon exited the computer business. Ironically, computers and the internet are now putting the long-distance phone companies out of business.

On to the Berkeley Fast File System and Linux implementation (Page 788/784): In any given filesystem, inodes are all the same size [64 bytes in SystemV, 85 in Linux] see the figure on Page 793/742, to allow them to be easily accessed as an array (if you want inode number 23, start looking 64*23 bytes from the beginning of the inode array in SystemV, or 85*23 bytes for Linux). This is important for efficiency, since we always want to look up the file data based on the inode.

System V UNIX [from AT&T] had a similar regular structure for directory entries: 14 bytes for each filename and 2 bytes for the corresponding inode number. Thus, file names were limited to 14 characters. BSD UNIX expanded this by allowing variable-length file names [up to a maximum of 255 characters, the largest that will fit in a 256-byte buffer with a trailing null character]. Figure 10-32 on Page 791/787 shows how this is accomplished: the "entry size" field says how far to jump to get to the beginning of the next entry. (There is an oddity in this figure in the Third edition: the inode number for colossal [19] is the same as the inode number for voluminous, which would indicate that these are two names for the same file; we could then infer that the link count for this file is [at least] 2. However, this is not what the author intended: the text at the bottom of Page 791/787 indicates that voluminous was supposed to refer to inode number 42, not 19. The '19' appearing just after 'colossal' should be replace by '42'. [This has been fixed in the Fourth edition.]) The lower half of the figure shows what changes when the middle file is deleted. The attributes for this entry are not 'erased'; instead, the 'entry size' of the previous entry is just incremented so that it points to the beginning of the bigdir entry rather than pointing to the beginning of the voluminous entry. (Essentially, the middle entry is deleted from the linked list.)

Searching for a filename in a directory involves going through the directory entries, comparing the names you find until there is a match. Consequently, using a linked list is not really going to slow you down: we never ask for "the third file listed in the directory", so there is no real benefit to having fixed-length directory entries [apart from saving a tiny bit of space for the links]. In contrast, we ALWAYS looked into the inode list by the inode-number, so the fixed-length array structure for the list of inodes was critical for efficiency.

You will find that if you do an 'ls' in a thousand-entry directory, the response will be sluggish, because the system has to work its way through the entire linked list. The next time you access that directory though, the response is much faster. This is because filenames are cached [much like the memory address cache we considered earlier]. Each file that is referenced is put in a cache of pathnames [kicking out some less-used name]. Every time a path is needed, the system first looks in the cache to see if it already knows the inode number, resulting in a significant speedup for cache hits. Newer filesystems, such as ext3 and ext4 used in linux, use the HTree data structure, where names are hashed to allow for much quicker lookup.

Figure 4-29 on Page 319 illustrates the placement of the inodes on a disk [apparently a disk with only one partition]. Putting all the inodes at the beginning of the disk means that the head assembly must travel to the very edge of the disk repeatedly, and since head movements are slow, this is not good for performance. An easy improvement is to put the inodes in the 'middle' so that half the disk blocks are in front of them, and half are behind them. For a slight increase in complexity in code, the average head travel has now been cut in half. Of course, an even better arrangement is to split up the inode array into many pieces, so that the inodes are more likely to be near the relevant data blocks. This improvement [from the BSD guys] is illustrated in the (b) part of Figure 4-29 on Page 319 [showing four 'cylinder groups']. The operating system tries to put the blocks of a file within the same cylinder group as the file's inode, so that the corresponding data is 'near' the inode.

BSD UNIX also implemented a [reasonably] simple way to drastically cut down on internal fragmentation [even if a file consists of only one byte, it must still take up an entire data block]. BSD UNIX has two sizes: 'regular' data blocks can be carved up into 8 equal-sized pieces, so that a one-byte file will only waste 1/8 of the space that would otherwise be required. This turns out to be a very good idea, since the majority of files in most systems are 'small'. Large blocks are still useful, since for very large files, it is much more efficient to have a few large blocks rather than numerous small blocks.

So, in addition to the convenience of longer file names, BSD innovations made the file system much faster with two principle changes: distribution of the inodes throughout the partition, and the use of two block sizes. Just like Linux uses a penguin for it's mascot, BSD used a daemon with a pitchfork; with the introduction of the BSD Fast File System, the daemon icon image included track shoes :-)

The structure of the UNIX filesystem has various implications about what happens when files are created, moved, or deleted.

4.3.5 Log-Structured Filesystems (Page 293)

The basic idea is that the file system is one big log: nothing is in a fixed place, all there is is a sort of stream-of-consciousness time-ordered list of the changes that you want to make to the filesystem. And that's all there is. If you want to update the information in a file, you leave the old information exactly where it was in the log, and then later on in the log you record what new information you are going to put in place of that old information.

To someone at my level of intelligence, it seems like only someone really stupid [or, as it turns out, someone really, really smart] could possibly think this is a viable idea. The standard UNIX filesystem is hard to improve upon, but this scheme [with a few tweaks] outperforms it.

Here's the reasoning that led to this: Nowadays, the speed at which you can read data from a disk doesn't matter very much, so related data can be spread all over the place without much of a performance penalty. We've reached the point where many read requests do not result in a disk access -- due to the various disk caches at our disposal.

On the other hand, we frequently do lots of tiny little writes in unrelated places. If we make a new file, we have to write info into an inode for the new file, we have to update things like the list of free inodes to reflect this change, as well as update the disk of free data blocks since now we have allocated some data blocks for this new file. On top of that, this file has a name, which must be inserted into a directory, which means writing onto a directory block AND updating the inode for the directory itself (we changed data in the directory, so the 'last modified' time stamp has to be updated). All those little writes [FIVE of them] had to happen, just to create a new file.

So, even though we may be able to write *contiguous* data at high speed on the disk, each one of the changes described above is happening at a different place, and so the effective rate at which we can write can drop below one percent of the optimal write rate.

In a log-structured file system, we simply save up those five changes, along with lots and lots of other changes that happen about the same time, and just blast a log entry [called a 'segment'] describing all those changes [contiguously, at 100% write efficiency] onto the disk. A little while later, we write out another log segment, just behind this one. This is [almost] the only writing we ever do to the disk, so it goes fast. Reads go slower, of course, since the data is all strung out, but since we don't do reads as often as we did 'in the old days' [due to efficient caching], this is not as important as making the writes efficient.

Still, this sounds truly crazy, and if that's all there were to the scheme, it definitely would be crazy.

Recall that one of the efficiencies of 'traditional' structure of the UNIX file system was that the inodes were truly index-nodes; if you knew the inode number [which was always directly accessible, right there in the directory listing], then you could immediately calculate the exact location on the disk of the inode, since the inodes were put in an array of fixed-size structures. With this log-stuff, the inodes are strung out all over the place, and you'd basically have to search every single log segment before you were confident you had the most up-to-date data for any given inode. That would be a complete efficiency-killer, so a separate array [an inode-map, indexed by inode number] is maintained on the disk. When you need inode 23, you go to item 23 in the inode-map, and this points you to the exact location on the disk where the most recent entry for this inode is, and you then just go there to read the inode. The inode itself has the usual data-block pointers, and these pointers point you to the blocks of data [in various parts of the log file] that contain the data for this file.

Of course, if we change something about inode 23, then that new version of the inode will be written out [along with lots of other data from recent transactions], somewhere in the log file. We then update location 23 in the inode-map to point to the new home of the 23rd node. This inode-map is kept on the disk (so we don't lose track of everything the next time we power down), but it's also cached, so that when we need to know where an inode lives, we don't need extra disk accesses to find it.

The result is a new way of organizing the UNIX filesystem; instead of the inodes being in one fixed structure on the disk, they are scattered, and move around each time we change any field of the inode, but we have the inode-map to make them very easy to find, even though they tend to wander around! Directory data blocks, by the way, retain the same structure as in the traditional system (a bunch of inode/name pairs).

Unlike the traditional organization, writing out the changes go to contiguous blocks, so they happen very efficiently. We start at the 'beginning' of the disk and just keep writing the log, block after block. Of course, we eventually get to the end of the disk, and what do we do then? Well before we get to the end of the disk, a 'cleaner' thread is awoken periodically to do a sort of 'garbage collection', although this is better though of as 'compaction'. Each time we write out a new version of inode 23, there is no need to keep the old version; the 'stale' inode information is just discarded rather than copied. Similarly, each time we change the data in a file [or delete a file], some old data blocks become irrelevant.

The cleaner goes through a log segment, and sees what inodes and data blocks are still being used; the stuff that's still relevant goes into RAM, and will be written out in the next log segment as though they were new 'changes'. Then the segment that was just examined by the cleaner is marked as unused space on the disk, which will at some point be written over by some new log segment. The disk is one big ring buffer, sort of like what we saw in the producer/consumer algorithm. The 'producer' is the thing writing out new log segments periodically, and the 'consumer' is the cleaner thread that reads old segments (and occasionally gives the producer some new things to write out.

There is a bunch of bookkeeping going on here, but tests show that it is definitely worth the effort. If the cleaner finds an inode in the segment it is copying is still in use [that is, if the inode-map entry points to this segment], that inode winds up going into RAM and then gets written out as part of the next log segment, and the inode-map then gets updated to this new location, even though no actual data in the inode has changed. Similarly, if a data block in the segment the cleaner is examining is determined to still be relevant, it goes into RAM, and gets written out in the next log segment (which means that the data block for this inode has to be rewritten, to record the new location of this data).

To make it easier to recover from crashes and power outages, many types of file systems implement some sort of journaling: they first write to disk what changes they are about to make to the filesystem, then they make the changes, then [once it is clear that the data structures have retained their integrity], the journal entry is erased. If something like a power failure occurs, we don't have to check the entire file system to recover; instead, you just look at the journal entry, and check exactly those parts of the filesystem data structures that might have been left in a weird state when the disaster (kernel panic, power outage, etc.) occurred. Once any such incongruities are fixed, you're back in business.

A log-structured file system IS it's own journal (we just never erase the transactions once we complete them). If there's a power failure, we just find the end of the journal, and realize that the last block or so that it tried to write might have been corrupted, but all the previous stuff should represent valid log entries.

As you might have noticed, a log-structured file system can be a good fit for flash drives! We pretty much write from one end of memory to the other, and then start over from the beginning, so wear-leveling is basically built in to this scheme. [Having that virtual layer is still a good idea, since the inode-map portion of the filesystem is going to get very heavy use.] Since we treat the drive's pages very much like a circular buffer, those gremlins won't have to shuffle things around as much as with other types of file systems.

This is not the greatest scheme to use if you have a drive that is almost full [that is, a drive for which the still-relevant data almost fills the entire drive]. In that situation, most of the stuff the cleaner finds is still in use, so each time the cleaner looks at an old segment, we effectively read almost the entire segment into RAM and then write it back out into a new log segment. In a more traditional file system, things that don't tend to change [like your operating system files, libraries, man pages, binaries, etc.] stay put on the disk. With a log-structured file system, they periodically get reshuffled. Thus, having a log-structured file system on an almost-full flash drive is going to wear out the drive much sooner than you would expect, since EVERYTHING gets rewritten frequently [because the cleaner will have trouble staying ahead of the segment-writer, and will have to 'recycle' segments a lot]. If you have a Solid-state drive on your laptop, try not to let it fill more than 75% of the space; that should ensure that the drive will last well beyond the time you're ready to discard your system.

With a log-structured filesystem, writes are now very efficient since they are all contiguous. 'Traditional' UNIX tries to arrange things so that reads are contiguous, but [if you edit portions of a file multiple times] data blocks for a file can get scattered over many log segments. On edoras, this is not much of an concern, since Ron Nash uses two huge solid-state drives to cache 'everything' that has been read [as well as a lot of stuff the OS anticipates may soon be read]. Accessing data from the solid-state memory is almost as fast as accessing it from RAM, so the fact that data [stored on the 7-terabyes of spinning disks] may not be contiguous is mostly moot. This is an example of what Tanebaum observes on Page 293: it is now possible to satisfy a very substantial fraction of all read requests directly from the file system cache, with no disk access needed. So, a log-structured filesystem sacrifices some read efficiency [which is no longer very important] for a HUGE increase in write efficiency.

As long as we're on the subject of edoras and SSDs: modern operating systems cache the most frequently used disk blocks as well as the most recently used disk blocks in RAM, and hence many read operations can be satisfied immediately rather than waiting for a [very slow!] disk fetch. Of course, when RAM fills, some of these cached blocks have to be thrown overboard, and future reads to any such abandoned block would again have to wait to be retrieved from disk. (A block that was kicked out of the cache but later discovered to be useful is called a 'ghost'.) To minimize the number of ghosts on edoras, a pair of SSD drives are used to form a second-level cache -- blocks that have to be removed from RAM are instead transferred to these SSDs (typically displacing some really old block in the SSD). Since the SSDs are almost as fast as RAM, but hold a lot more data than RAM, the cache of immediately-available blocks is thereby expanded dramatically. Thus, when you attempt to read something on edoras, the OS first sees if what you want is already in RAM, and if not, it checks if it is already in the SSD cache, and only if both these checks fail does your process have to undergo a context switch while it awaits an actual fetch from the spinning disk drives. The result is that read latency (after the cache has 'warmed up') is an order of magnitude better (e.g., the average read time might be twenty times faster than without the cache).

Geeks For Geeks File Allocation

⁠

https://www.geeksforgeeks.org/file-allocation-methods/⁠

⁠

Ch 4.3 (page 281 – 299)

Video Notes

⁠

Example File Systems

NFS and CD

10.6.4 NFS: Network File System (Page 792)

Various companies and university research groups invented protocols for sharing files across separate computers, but the clear winner is Sun Microsystems' NFS protocol. It is similar in spirit to the file sharing across computers in the Microsoft Windows world, but much more robust. Clients can mount entire filesystems from other hosts [servers], so that they appear to be part of the local filesystem. If a server goes down, the files on the NFS filesystem will be temporarily unavailable, but once the server reboots, things pick up right where they left off, with no adverse consequences other than the fact that the read [or write] request was abnormally delayed. The protocol was designed to be 'stateless' -- that is, it works without the servers knowing what files are currently open, where the file pointer currently is [and so the server cannot 'lose' any such information when it crashes, since it is not responsible for maintaining that info in the first place]. Figure 10-35 on Page 793 illustrates how a portion of a filesystem can appear on multiple hosts, located along different [or in some cases the same] paths, depending on where it is mounted.

One of the motivations for NFS was the [former] high cost of disk storage: even a small disk drive could cost as much as the rest of the computer. If you had a bunch of identical hosts (all using the same type of hardware and the same type of operating system, like the ones Sun Microsystems sold), there isn't much point in having all the system binaries replicated on each host: all the executables could all be mounted read-only from a central server, saving lots of space. Indeed, Sun sold completely diskless workstations, which were entirely dependent on some external server for storage. The boot hardware for the diskless workstations began by issuing a call for help over the network to initiate a tftp [Trivial File Transfer Protocol] of the executable images it needed to begin booting. As part of the boot process, the diskless client would mount its entire file system from an NFS server (usually in at least two pieces: one mount for the common read-only binaries, and another mount for the data, such as user's home directories, which are specific to the particular diskless client).

A protocol is essentially a set of 'rules of engagement'. The NFS protocol specifies the format of the requests that clients send to the servers, and the nature of the replies that the servers send back in response. There are two main flavors: mounting requests and data requests. To request a mount, the client simply asks for some directory name on the server [using the path name where the server currently has the partition mounted]. The server will consult its database (usually stored in the file /etc/exports) to find out if this particular directory is allowed to be 'exported' [mounted on other hosts], and if this particular host is on the 'approved' list of clients it should respond to. If the request should be granted, the server returns a 'file handle' for the client to use. This file handle is essentially a magic cookie that the client will later present to the server each time it requests data from the server. Under the original protocol, a host snooping on the network could in effect intercept the magic cookie and gain unauthorized access to the server; an enhanced version called Secure NFS uses encryption to close this loophole.

Once an NFS filesystem is mounted, the client can request information from the server via Remote Procedure Calls [RPCs]. Requests for attribute information as well as read and write are supported. There are no RPC calls for open() or close(), however -- the server simply responds to requests, it purposely does not try to track state information. (To be sure, processes on the client that wish to read a file must locally request the file to be open()ed but this request is not relayed to the server -- the client, not the server, is responsible for tracking the state information.

Figure 10-36 on Page 796 illustrates the basic mechanism for a typical UNIX system. UNIX can mount native filesystems, Windows file systems, NFS file systems, and sundry others, and the entire menagerie is supposed to be presented as one coherent file system tree: processes do not need to know what sort of filesystem underlies the data. UNIX kernels implement what is know as a Virtual File System [VFS] layer to disguise the complexity. A typical system call [such as open()] just presents its request to the VFS, which then is responsible for sorting out any special issues for the particular file system. In the case where a request is made of an NFS file system, the NFS client code translates this request to the appropriate RPC calls to the server.

When a process open()s a file, how does the operating system know which type of system it will have to interact with? The open() call specifies a [relative or absolute] pathname, and as the path is parsed, the kernel will notice when it passes through a mount point, and switch to dealing with the type of file system that is mounted there. So, as a path descends into an NFS-mounted filesystem, the NFS client code takes over as it travels through the mount point, now requesting data from a directory on the server rather than on the local client.

If a user requests the first 100 bytes from an NFS-mounted file, the client anticipates further requests and actually asks for 8K of data from the server, since it is much more efficient to transfer large blocks. The client buffers the 'extra' data so that it can avoid some future network requests.

A similar strategy is used on NFS writes; the client buffers the writes until it has accumulated 8K of data, and then sends it to the server. Of course, if the process closes the file, the data is sent immediately, even if the buffer is not 'full'.

4.5.1/4.5.3 CD-ROM File Systems (Page 325) Much like audio vinyl records, the original CD-ROM media was designed to be molded by a machine from a master, and hence it was a 'write-once' format. (Also like old record albums, the data follows one continuous spiral: contrast this to disk drives, which use concentric tracks, not spirals.) Recall that with the UNIX filesystem, we had to keep track of which blocks were in use, and which were not [ditto for inodes]; the original molded CDs did not have to contend with this problem, since nothing could be changed on the disk.

Injection-molded media has pits and lands which, when a laser beam is focused on them, reflect dimly or brightly, allowing the disk to be 'read' as a sequence of zero/one and one/zero transitions. The CD-Recordable [CD-R] media eventually came along, which created bright/dim via embedded ink: Initially, the blank disk was all 'bright' spots, but a high-power laser could 'darken' some areas, giving whatever pattern of zeroes and ones you wanted. There was no good way to 'erase' the dark spots, so this was effectively a write-once media. (More correctly, it is an append-only media: data could be written in multiple sessions by finding the end of the previously written data, and then laying down more data after that.) Eventually, CD-RW media was perfected, where you could indeed overwrite previously-written data multiple times.

ISO 9660 File System (Page 326)

The one standard format that every device can handle is ISO 9660, which grew out of the 'High Sierra' standard agreed on by a consortium of companies. Indeed, it was designed to be a lowest-common-denominator that every OS, even lowly MS-DOS, could handle. Each logical sector contains a 2K data portion and about 100 bytes of error correction and overhead. 75 such blocks encode one second of music, so skipping 1 minute of a track amounts to jumping over 60*75 blocks [much like picking up a record needle and moving it to the start of a new song on an old vinyl album]. Mangling a few bytes from a music file is almost undetectable to human ears, so not much error correction is needed for music CDs; we'll see that MUCH more is needed to reliably read data CDs.

The ISO 9660 standard leaves 16 'undefined' blocks at the beginning of the spiral; many software disks have a bootstrap program embedded here, and the BIOS booting sequence often looks here to boot the computer from an inserted CD-ROM. The next block is the primary volume descriptor, containing various identification strings, but most importantly containing the block address of the root directory, which is then the starting point for locating other files and directories on the volume.

The format of a CD directory entry is shown in the figure on Page 328). Unlike the complex set of data pointers in an inode in UNIX, the entire file can be described with just two numbers: the starting address and the size of the file. This works because files are stored contiguously, not in 'chunks' distributed over the disk -- an advantage of having a write-once media where you can plan the layout exactly before you begin writing anything. Because the directory must be written before the data files, this means that [for the ISO 9660 standard] you really have to 'know where everything goes' before you start writing: all the fields in the directory entries must be calculated before you write that directory. (We'll see later how this constraint is overcome in subsequent standards.)

Every parameter representing a number in the directory [such as the starting address and the size of a file] are stored twice, once in little-endian and once in big-endian, so that every architecture can read the parameters easily. This was one of the many compromises needed to get everyone to agree to the standard. The ISO 9660 standard is really three standards in one: level1 limits filenames to the most brain-dead level: 8 characters plus as 3-character extension. Level1 is the only format that MS-DOS can handle. Level2 allows filenames to be longer [up to 31 characters]. Level3 allows files to be non-contiguous [which seems goofy, since you can easily arrange any write-once data to be contiguous]. Level3 *is* useful, though: if you have some files that have repeated chunks of data, you only have to write that common data once, and then just arrange for several files to repeatedly reference the same place on the disk.

Each directory entry accommodates a 'System Use' field, to allow for extensions to the basic ISO 9660 standard. The two most common extensions are Rock Ridge, which emulates the attributes of UNIX files, and Joliet, which emulates the things important to Microsoft [such as the Unicode character set, needed for internationalization].

The Rock Ridge extension allows each directory entry to have additional fields, specifying things like where the three time stamps UNIX expects are to be stored. Filesystem nodes can be specified to be soft links, and each file and directory is given rwx permissions for user, group, and other. In addition to having a short 8+3 name, each file can also have a 'long' alternate name, which can handle any legal UNIX name. Since the ISO 9660 standard limits directory depth to only 8 levels, there are also some fields to 'relocate' a directory to a different place in the filesystem, so that arbitrary structures can be expressed.

Systems and devices that have not been programmed to recognize the extensions in the 'System Use' field just see a normal CD, since all the other directory fields are unchanged. Thus, the data within a UNIX file can be faithfully copied to a CD [resulting in 8+3 filenames], but things like the file owner, permissions, and other attributes that are stored in the inode might be 'lost'. Rock-Ridge-aware systems, however, can copy a UNIX filesystem onto a CD, and the inode info will be encoded into the 'System Use' field of the directory entries. The CD can then be taken to another UNIX host, and the filesystem will be faithfully reproduced [since both the data and the inode attributes, long file names, etc. can be extracted]. Similarly, Joliet-aware systems can faithfully represent Windows filesystems on a CD.

Ch 4.5.3 (page 325 -329)

Ch 10.6.4 (page 792 – 798)

Video Notes

⁠

Learning Objectives

Chapter Notes/Readings

Files and Directories

File System Implementation

Example File Systems

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.