Prof Notes
Chapter 4 File Systems (Page 263 Processes store data in memory, but since memory is volatile and [relatively] small, a persistent storage system is needed. Units of storage are known as files; the files and their method of organization is known as the file system. Some UNIX-specific information from Chapter 10 is also considered here. 4.1 Files (Page 265)
4.1.1 File Naming (Page 265)
Names are associated with files so that they can be accessed by processes other than the process that originally created the data. Different file systems place different restrictions on the naming conventions, depending on the scheme. In file systems with directories [i.e., most systems], the directory-separation character ['/' in UNIX, '\' in Windows] cannot be part of a file name, or the path names could not be parsed reliably. Some systems insist on extensions [MS-DOS, for example, requires a '.' near the end of the name, followed by up to 3 more characters], and others [UNIX] do not. There isn't much that is disallowed in a UNIX filename beyond '/' and the null character; it is legal to put a '<' in a filename, but most users avoid this, since '<' means redirection to the shell, and so users must type '\<' each time they want to specify a 'real' '<'. For similar reasons, '!', spaces, and wildcards are generally avoided in UNIX -- even though they are legal. Windows attaches meaning to file extensions; the extension determines which program is launched when a file is double-clicked. Relatively few UNIX applications are picky about the extension [or lack thereof]: rcs [the Revision Control System] requires repositories to end in ',v' and C compilers generally insist that source code be in a file that ends in '.c'. In UNIX, multiple extensions are allowed, and often give a hint as to the nature of the file: The file various.h.tar.gz,v indicates an rcs repository (,v) for a series of C header files (.h) that have been rolled into a tape archive (.tar), and then compressed with gzip (.gz).
Instead of relying on extensions, UNIX files tend to have a 'magic number' in the first two bytes of most files (postscript files, for example, begin with the ASCII codes for the characters "%!", and shell scripts begin with "#!"). Windows depends on the file having .mpg as the extension (see Figure 4-1 on Page 266).
4.1.2 File Structure (Page 267) Both UNIX and Windows regard files as merely a sequence of bytes, but older systems often organized data into records (See Figure 4-2 on Page 267).
4.1.3 File Types (Page 268) A file system may in fact support many different kinds of nodes [for example, directories, which organize the file system structure]. Data files are termed 'regular files', to distinguish them from more specialized files. UNIX also has named pipes, block-special files and character-special files. A named pipe is a FIFO queue that multiple processes can use (that is, one process can write to the pipe and another can read from it, much like the unnamed pipes that UNIX shells use). Character-special files are essentially portals to various hardware: if you send data to /dev/tty [redirect output there, or open() it and write() to it], it is the job of the UNIX device driver that monitors this filesystem node to ensure that the data appears on your screen. Block-special files are nodes in the file system that serve as portals to block devices, such as the physical disk drives.
'Normal' text files used to come in two flavors, encoded in EBCDIC [IBM's format, based on their punched-card encodings] or ASCII [the standard format used by everyone else]. Punched cards had 80 columns with 10 or so potential holes in each column. The card would be very flimsy if it contained too many holes, so the most-used characters were represented by punched patterns that consisted of only one or two holes. A blank was represented by not having any holes in the column [so a pristine card was essentially 80 blanks], the digits 0-9 were the one-hole patterns, and the upper and lowercase alphabetic characters were various two-hole combinations. Unfortunately, this meant that the binary codes for the characters were not very rational. If you increased the code for the letter 'A' by one, you did get the code for 'B', but increasing the code for 'I' does NOT result in the code for 'J'.
ASCII is a more rational scheme. The letters A-Z are assigned to 26 contiguous binary codes, and a-z are in another contiguous range, so converting uppercase to lowercase is simply a matter of adding a constant [namely 32] to any uppercase letter. ASCII also was designed with lexicographical order in mind, so the code for a space is smaller than any code for a letter. That way, when words are ordered based on their character encodings, 'to' will be earlier than 'top' like it is supposed to be in 'dictionary order'.
Once punched cards were obsolete, there was no reason to use the inconvenient EBCDIC encoding any more. Nowadays, a text file comes in only one flavor: ASCII. Line termination is really the only aspect that is not completely standardized. Some systems use the linefeed character, some use carriage return, and some use both.
Having an agreed-upon universal type has many advantages: all editors can handle this file, programs can pass data between them without worrying about encoding issues, etc. (Contrast this to a .doc file, which contains all sorts of formatting gibberish, making it hard to extract the actual text without special processing.)
Files which have a complex internal structure are known as 'binary' files [to distinguish them from text files]. Archives and machine-executable files also fall into this category. File-transfer protocols that transport files between hosts often have a 'friendly' feature that will, for example, turn the code for every linefeed [decimal 10] into the code for carriage return [decimal 13], to allow ASCII text files to be read correctly on a host that uses different conventions for line termination. However, using this feature on a binary file, such as an executable, would be disastrous: you may find that your 'constant' 10 has been turned into a 13, or [perhaps worse] some machine-code instruction has been transformed into an entirely different command. Thus, programs such as ftp have an ASCII and a binary transfer mode, and it is important to choose the correct one.
Figure 4-3 on Page 270 illustrates the structure of some binary files. On the left is the layout of a UNIX executable. It begins with a magic number; were you to transfer an executable from an Intel PC to a Sun Sparc machine [which use different types of CPUs], and try to execute it, the operating system would note that the magic number was inappropriate for this architecture and refuse to run it. Other important header data includes the entry point, which [when combined with the text size information] specifies the location of the first machine-code instruction to begin executing.
The second illustration (Figure 4-3(b) on Page 270) shows the layout of an archive, which is essentially a collection of .o files. These archives are wrapped up with some header information, saying where each module starts and ends, and are generally stored under /usr/lib. Once the compiler has taken your main.c and created a main.o from it, the linker takes the various .o files it needs from the archive, and combines them to form an executable a.out file. (For example, your programs usually use printf(), but you have not written the printf.c module yourself -- instead, this has been pre-compiled into a printf.o and stored in an archive.) In UNIX, archives have the extension .tar (or .tgz, if it's a compressed archive), and could be a compendium of any set of files (not just .o files).
The text discusses the strategy of using strongly-typed files on Page 269, wherein file extensions are mandatory and applications pathologically insist on having the proper extensions. Worse, the operating system will not allow you to change the type of extension (study the text examples on Page 269).
4.1.4 File Access (Page 270)
Initially, computers had to access information on storage devices by starting at the beginning and reading sequentially, even if the information you really wanted was near the end of the file. This reflected the physical reality of the way tape drives had to work. If you suddenly needed information that came before what you were currently reading, the only option was to 'rewind' and start reading from the beginning again. Once there were drums and disk drives, it became possible to go directly to the portion of the data that was desired [with a 'seek' operation that went to a specified place in the file]. This was termed random access [as opposed to sequential access]. Nowadays sequential access has fallen by the wayside, since random access allows both methods. The most common strategy is to by default read sequentially from wherever the current position pointer is. The current pointer position is automatically updated as you read, and can be explicitly changed via the seek command to 'jump' to a non-sequential location within the file.
4.1.5 File Attributes (Page 271)
Besides the data that comprise a file, the file system records various attributes. Many of the possibilities are shown in the figure on Page 272. In UNIX, the name of the file is kept [only] in a directory, and additional attributes are all stored in the file's inode [short for "index node"]. The structures are described in Figure 10-33 on Page 788 and Figure 10-34 on Page 790, which we will consider next.
A few of the fields in Figure 10-33 require some explanation. The "Nlinks" field counts the number of times this inode is associated with a directory name. It is possible to have different directory names referring to the same inode, and hence to the same file. This is illustrated in Figure 10-24 on Page 777. Directory entries are essentially ordered pairs of names and inode numbers, and it is perfectly legal to associate several different names [perhaps in different directories] with a single inode number. This is done via the ln command [which uses the link() system call]. Directory entries such as these are known as 'hard links', because they create a new directory pair, effectively cloning an existing pair, keeping the same inode but potentially using a different name (though Figure 10-24 happens to use the name 'x' in both directory entries). The soft link is just a special type of file, and the data it contains is merely the path to the file to which you are linking. This is illustrated in Figure 10-24 on Page 777; note that this implies that the file system tree is not technically a tree. [Thus, a soft link is rather like a WinBlows 'shortcut' on your desktop.]
Most importantly, the inode lists the locations of the data blocks that comprise the actual file. If files were stored contiguously, then onlyTopics/ Concepts Files and Directories the starting block and the size would be necessary to give the full location of the file. This, however, is a very bad idea on a file system in which files are created and deleted, leading to many small useless 'holes', and problems when a file needs to expand beyond its current size. (One of the places where contiguous file storage works well is CDROMs, since nothing is ever deleted, and hence no 'holes' can form.)
UNIX files do not need to be contiguous, so a list of individual blocks is needed. It is important for inodes to all be the same size, so that they can be efficiently indexed as an array of records. This rules out the option of making only some inodes large enough to point to thousands of blocks; every inode would have to have 'room' for thousands of addresses.
The UNIX solution is very elegant. The first ten pointers in the inode point directly to blocks of file data. The eleventh pointer points to a 'single indirect' disk block that does not contain file data; it contains a list of the next 1024 data blocks (indirect addressing, as shown in Figure 10-34 on Page 790; a clearer picture of this portion of the inode is shown in the figure on Page 325).
This is sufficient to handle files up to a size of 10+1024 blocks. If this is not enough, the 12th pointer is utilized, which points to a 'double indirect' block. The 1024 pointers in this block do not point to file data; each one of them point to a single indirect block, which in turn points to 1024 data blocks, giving a total capacity of over a million blocks. If this is still insufficient, the thirteenth pointer is a triple indirect block, which allows for a billion more blocks. (The actual number of pointers in a data block depends on the size of the disk blocks; the examples discussed in the text assume 256 (not 1024) addresses can fit in one block.) These days, Linux tends to have 12 direct pointers instead of 10, so the math can be a little different on different flavors of UNIX.
Research has shown that 'most' files are 'small', and hence 'most' files are accessed rapidly; this indirect addressing only comes into play for very large files. As a consequence, inode sizes remain small, most files are accessed very efficiently, and yet an extraordinary range of files are easily accommodated. Elegant!
Figure 10-34 on Page 790 also shows the data structures the kernel maintains when open() is called. An open file needs to have a file pointer to indicate where the 'next byte' to read is, but the obvious places to store this pointer do not work. It cannot be put in the inode itself, or weird things will happen if two processes have the file open() at the same time: each process will move the file pointer on each read, and hence each process will only retrieve part of the file.
The other obvious place for a file pointer is the file descriptor table that is created by open() for each process. This avoids the interference issue, but too effectively: if you wish to have cooperating processes all writing to the same file [as our threads do in program 3], we definitely do NOT want every process/thread to write their first line of output at file location 0 -- they would each write to the same place, and overwrite each other.
The solution is illustrated in Figure 10-34 on Page 790. Besides maintaining file descriptor tables for each process, the kernel also keeps a list of the attributes of each open file (essentially the information in the inode, flags that specify whether reading or writing is allowed, plus the current file pointer, and indications of which inode, on which partition, corresponds to this structure). As shown in Figure 10-34, this allows cooperating processes to share file pointers if that is wanted, and allows unrelated processes to maintain their own independent pointers. In the picture, for example, the parent and child processes might be cooperating in writing data to the file, while the third 'unrelated process' is reading the same file at its own pace.
Returning to the concept of the number of links to an object in a file system: Removing an entry in a directory does not necessarily mean a file is deleted. If a file has 3 different names, then the data in that file is still present until all three references are deleted (a given reference is removed via the unlink() system call). Hence, the rm command removes the directory entry, but does not necessarily move the file's data blocks to the free list for reuse; it only does so if the link count of the file drops to zero. (Actually, it is a bit more complicated than that: if the particular file has been open()ed by one or more processes, then the data blocks are not freed until each of these processes has close()d the file.)
Every directory contains a '.' entry, which means 'this directory'. The directory has a (name, inode) pair in the directory 'above' it, and hence each directory will always have a link count of at least 2. Every directory also contains a '..' entry which points back to its parent, so if a directory contains lots of subdirectories, its link count may be very large. (But a directory cannot be hard-linked with 'ln' -- doing that might cause a conflict in what its '..' entry means.)
4.1.6 File Operations (Page 272)
Pages 272-273 outline 11 common system calls for dealing with a typicalfile system. The names for accomplishing the basic tasks outlined will varydepending on the operating system. For example, UNIX uses the stat() system call for retrieving attributes of a file (the information in the inode).Modifying these values use system calls such as chmod(), chown(), and utime(). utime() lets you lie about the last access or modification time of a file; however, UNIX maintains a third time stamp, specifying the last time the inode was modified [a change in inode attributes, not file data!], and this time stamp is not something that can be easily faked. (This is the Ctime attribute at the bottom of the table in Figure 10-33 on Page 788.) As mentioned before, deleting a file is done with unlink(). The renaming system call is, naturally enough, called rename(). Typically, this involves simply removing one entry from a directory and inserting another entry in a [possibly different] directory. Unless a file is being moved from one disk partition to another, no file data needs to be moved or copied; we just associate the inode with a new name in some directory [and therefore rename() is pretty much just a link() followed by an unlink()].
4.1.7 An Example Program Using File System Calls (Page 273)
The exposition in Pages 273-276, and the sample code in Figure 4-5 on Page 274 are well worth studying. The program follows the standard UNIX conventions, exiting with 0 upon success, and with different positive numbers for each possible type of failure. A more user-friendly version might also send helpful error messages to stderr, but the exit codes are sufficient to determine what went wrong [assuming you know what the source code was]. Note also that the read() command will regularly return the exact buffer size, but will likely return a smaller number when EOF is reached, and then return zero when there is nothing but EOF left to read... because read() always returns the actual number of characters that have been read.
4.2 Directories (Page 276) You should already be familiar with the way directories are laid out, so we won't spend much time on this...