The Unix File System

Text files are files that consist entirely of human-readable (more or less) text, while binary files are files that encode data in a fashion intended only for interpretation by a machine.

1.1 File Names

Unix file names can be almost any length and may contain almost any characters. As a practical matter, however, you should avoid using punctuation characters other than the hyphen, the underscore, and the period. Also, avoid blanks, and non-printable characters within file names. All of these have special meanings when you are typing commands and so would be very hard to enter within a file name.

Some things to keep in mind about Unix file names that may be different from other file systems you have used:

Unix file names are often very long so that they describe their contents.² The rather perverse exception to this rule is that program/command names are, by tradition, very short, often confusingly so.

Upper and lower case letters are distinct in Unix file names. “MyFile” and “myfile” are different names.

Periods (“.”) are not treated by Unix as a special character. “This.Is.a.legal.name” is perfectly acceptable as a Unix file name. Many programs, however, expect names of their data files to end in a period followed by a short “standard” extension indicating the type of data in that file. Thus data files with names like “arglebargle.txt” for text files or “nonsense.cpp” for C++ source code are common.

By convention, files containing executable programs (such as clpr and psnup in the above examples) generally do not receive such an extension.

1.2 Text Files

A lot of what we store in files is just text. Text is represented in files, much like it is stored in memory, by placing one character in each successive byte of the file. Of course, bytes actually hold numbers (in the range 0..255), so we use a character set mapping to assign each character a numeric value.

Since the mid-1960’s, the dominant character set has been ASCII. ASCII encodes 128 different characters. Technically, you can say that it wastes one bit of every 8-bit byte. The characters encoded are

“Control characters” (numbers 0..15) which do not print as glyphs on the screen/paper, but describe some other property of location or transmission. For example, the control characters include a “line feed” character used to mark the end of a line, a “page feed” character that marks the end of a page, a “carriage return” character that indicates that we want to return to the leftmost column, and a “tab” character to move some number of spaces to the right.
“Printable characters ” (numbers 32..126), each of which is rendered as a symbol on a page or screen. These include
- Blank (number 32)
- Numeric digits 0..9 (numbers 48..57)
- Upper-case alphabetic letters A..Z (numbers 65..90)
- Lower-case alphabetic letters a..z (numbers 97..122)
- Various punctuation marks fill in the remaining slots. For example, the exclamation mark (“!”) is number 33.
The “del” character at 127 is sort of the odd man out. It’s not a printable character, but it’s not positioned with the “real” control characters either.

So a data file containing a single line of text with the word “Hello” would actually be encoded in a file as

101

108

111

i.e., 72 is the ASCII code for ‘H’, 101 is the code for ‘e’, 108 the ode for ‘l’, 111 the code for ‘o’, and 10 is the line feed (end of line) control character.

Here is a dump of the opening bytes of the text file from which this particular document was generated.

Compare the ASCII codes you see here to the opening paragraphs of this document.

1.2.1 What Can You Do With ASCII Text Files?

Certainly we will be able to do everything to a text file that we can do with generic binary files: copying, renaming, etc.

In addition, we will learn that Linux has quite a few commands for working with text, including commands for viewing, changing, editing, and measuring properties of text.

1.2.2 Unicode

Now, the 128 characters defined in ASCII is good enough for basic purposes, particularly if you speak (and write in) English.

But before long, pressure built to expand the available characters. Some of this pressure came from specialized applications. For example, there are numerous symbols in mathematics and the sciences that aren’t in the ASCII code. Even outside of technical fields, typesetters might desire specialized symbols like ➀ or ✓.

For a while, developers tried to stem the tide by defining character sets that used all 256 possible values of a byte, but even that was little more than a temporary respite.

Even more pressure came from different languages. Anyone whose native language was Spanish, for example, would regret the absence of characters like á or ¿. But that’s only the tip of the iceberg. Greeks and Russians have their own entire alphabets. And once we get past Europe, there are entire families of alphabets for Asian, Middle Eastern, and African languages.

Unicode was introduced in the 1980’s to provide a 16-bit character set, later expanded in the mid 1990’s to 26 bits, which provides for over a million possible characters. Unicode now incorporates not only the traditional ASCII characters (as Unicode values 0..127, so an ASCII text file is also a valid Unicode text file), lots of specialized symbols, many different international alphabets, and, for better or worse, emoticons and emoji.

Complicating matters, Unicode allows for a variety of different ways to arrange the numbers within a stream of bytes. These are called character encoding schemes, or “encodings” for short. For example one encoding, UTF-32, stores a single (26 bit) Unicode character in a block of 4 bytes (32-bits). This is fairly simply, but if 99.9% of your characters are actually down from the original ASCII set, then this wastes nearly 75% of the file storage. Another, more popular, scheme, UTF-8, stores all characters from the original ASCII set in a single byte, but inserts special values outside of the ASCII 0..127 range to signal that the next character is a non-ASCII Unicode symbol that will need 2, 3, or 4 bytes.

1.2.3 What Can You Do With Unicode Text Files?

Certainly we will be able to do everything to a text file that we can do with generic binary files: copying, renaming, etc.

Some of the Linux commands for working with ASCII text files will also work with Unicode. Others will not. Some will work with Unicode files encoded in UTF-8 but not in other encodings. Presumably, more and more of the text processing commands will support Unicode in the future.

1.3 Binary files

A binary file is a sequence of bytes that can contain almost anything. In practice, some software developer working on a program defined a file format for holding the data needed by that program. The file format was probably designed to be compact and easily processed by that program.

Here is a dump of the opening bytes of the file containing the first picture in section 1 of this document.

You’ll notice that the ASCII column on the far right contains lots of ‘.’ characters, whic are actually used by the hexdump program to indicate a byte that contains a non-ASCII character value or an ASCII value that does not have a visible representation (e.g., line terminators). Where ASCII characters are displayed, they appear to be almost random. That’s because, for the most part, they pretty much are. Any binary data file is bound to contain some bytes that just happen to match an ASCII character code.

1.3.1 What Can You Do With Binary Files?

We can also do operations that don’t rely on interpreting or understanding the contents of the file: e.g., copying the file, renaming it, or moving it to a different directory.

Typically, though, the contents of that binary file can only be processed by that one program or by other programs written later with the specific goal of processing that same file format. So you can use binary files only as input to a program specifically designed to handle their file format.

For example, every operating system defines a file format for executable programs. (For all practical purposes, we “run” an executable program by supplying it as input to the operating system.) For the most part, a program is just a block of machine code instructions encoded in binary. Load that block into memory, point the CPU at the address where it was loaded, and it runs. But to facilitate the process of loading that block into memory, an operating system may specify that each executable program file starts with a “header” of information indicating where it can be loaded, what other resources need to be available, etc. Because these headers are operating-system specific, trying to execute a program designed for one operating system will usually result in a quick error message if you try to run it on a different operating system, because that operating system will quickly realize that the header is not in the proper format. (This is by no means the only reason why executables for one operating system won’t run under another. It does, however, account for the fact that if you try this, you will usually get stopped before doing any real damage.)

As another example, in 1987 Steve Wilhite, a developer at Compuserve, defined a new format for holding images called the Graphics Interchange Format, or GIF for short. Originally, the only programs that could interpret the GIF format were conversion programs provided by Compuserve for converting between GIF and older exiting graphics file formats. Eventually, web browsers added code designed to interpret and render GIF, and now almost every program that deals in graphics includes code designed to handle GIF.

You cannot hand a GIF file to the operating system to be executed like a program, nor can you ask a web browser or graphics viewer to render a program as if it were an image. Doing so will result in an error message a best, garbage output if you are not so lucky, a hung system if you are still less likely, or a corrupted file system if you are really having a bad day.

1.4 What’s Text and What’s Binary?

We’ll talk more about this in a later lesson, but some examples might be useful now.

When you write a program by typing in source code, you are working with text. Program source code files (e.g., .cpp and .h files) are text, and almost always ASCII text.
On the other hand, when you run that source code through a compiler and get an executable program, that executable is binary.
When you type in a word processor, you are certainly working with text. But the wide range of formatting options, e.g., bold, italic, underlines, font selection, etc., permitted by a word processor are not, in and of themselves, part of the actual text. Something more elaborate than plain text is needed to handle all of that. And every word processor defines its own distinct format for storing that information.

So when you save the output of your favorite word processor (e.g., Word), that file is binary.
On the other hand, you may occasionally wotk with a simpler text editor that provides none of those fancy formatting options (e.g., NotePad in Windows). Such programs provide text files as output.
Web pages, like the one you are reading now, offer nearly as many formatting options as a typical word processor. So you might expect that they are working from a binary file. In fact, however, web pages are text files, but use special commands embedded into the text via the Hyper Text Markup language (HTML) to indicate what formatting is needed.

If you are unfamiliar with HTML, try right-clicking on this page and select the option to “View page source”, or something similar, to see the HTML of this page. Hit Ctrl-F to open a search box and look for the word “Hyper”, and you should be able to find and recognize this paragraph.
Directories in Linux (and folders in Windows) are actually files. But they are in a binary format that is understood by the various navigation and file manipulation commands in Linux. That binary format tracks information such as the file name and, most importantly, the location of the file on the disk.

2 Directories

Files in Unix are organized by collecting them into directories. (In Windows these are more commonly known as “folders”.) Directories are themselves files, and so may appear within other directories.

All directories are files. (Binary files, to be specific).

But, obviously, not all files are directories!

The result is a tree-like hierarchy. At the root of this tree is a directory known simply as “/”.¹ This directory lists various others:

The bin directory contains many of the programs for performing common Unix commands. The usr directory contains many of the data files that are required by those and other commands. Of particular interest, however, is the home directory, which contains all of the files associated with individual users like you and me. Each individual user gets a directory within home bearing their own login name. My login name is zeil.

We can expand our view of the Unix files then as:

cd and ls are two common Unix commands, as will be explained later.

Within my own home directory, I have a directory also named “bin”, containing my own personal programs. Two of these are called “clpr” and “psnup”. So these files are arranged as:

3 Paths

3.1 How do you give someone directions?

We’ve all done this from time to time – asked someone for directions on how to get to someplace. Some people are very good at giving directions, others not so much. Some people are good a following directions, others not so much.

How do I get to the White House?

Look for the Washington Monument. It should be easy to spot.

From the Washington Monument, head north along the path until you come to a fork. Turn right, walk about 500 ft. then make a sharp left and head north towards the intersection of 15th St. and Constitution Ave. NW.

From that intersection, continue, continue north along 15th st. until you reach Pennsylvania Ave. Turn left and proceed west along Pennsylvania Ave. until you reach the gate of the White House.

This is an example of absolute directions. They rely on your starting from a well-known, easily reached landmark and proceeding from there.

If you asked how to get to my office on the ODU campus, I might give you absolute directions by assuming that you knew how to start from Webb Center, or, more likely, from the abandoned monorail track that passes through much of the campus.

How do I get to the White House?

“Well, where are you now?”

“I’m in Lafayette Square.”

OK, walk along the southeast path until you come to Pennsylvania Ave NW.

Turn right and proceed west along Pennsylvania Ave. until you reach the gate of the White House.

This is an example of relative directions. Relative directions can often (though not always) be shorter and simpler than absolute directions.

Absolute directions remain correct no matter where you start from.

You could be starting from Norfolk, VA., and those absolute directions to the White House are still correct. It’s just more of a chore to get to the starting point.
Relative directions become useless once your starting position changes.

If you start in Norfolk (or, for that matter, anywhere south of the White House, and start walking to the southeast, you will never reach Pennsylvania Ave (or, at least, not the Pennsylvania Ave where the White House is located.

3.2 File Paths

The full “name” of any file is given by listing the entire path from the root of the directory tree down to the file itself, with “/” characters separating each directory from what follows.

For example, the full names (paths) of the four programs in the above diagram are

   /bin/cd
   /bin/ls
   /home/zeil/bin/clpr
   /home/zeil/bin/psnup

3.3 Paths Supply Directions

It’s important to recognize that a path is a step-by-step set of instructions on how to find a specific file. For example, /home/zeil/bin/psnup means:

Start at the root of the Unix file system, /
There you should see a directory named home. Look in that directory.
In that directory, you should see a directory named zeil. Look in that directory.
In that directory, you should see a directory named bin. Look in that directory.
In that directory, you should see a file named psnup. That’s the file you want.

In the assignments for this course, I will often give instructions like
Copy the file /home/zeil/bin/psnup into the directory...
and then will get email from students saying something like
I can't find the file /home/zeil/bin/psnup. Where do I find it?
which is rather like asking “What’s the address of the house at 221B Baker St., London, Eng.”? or “How heavy is a 5 pound bag of flour?”

The answer is literally right there in the question!

Now, when I say that a path is a step-by-step set of directions, understand that we seldom have to follow those directions step by step. Almost any time and any place I need to name a file in Unix, I can simply give a path to it and let the operating system follow those step-by-step directions.

Example 1: Try This: Exploring Directories
Log in to your Unix account.
Upon logging in, your working directory should be your home directory. The command pwd will print the working directory. Give the command

pwd

You should see something like
/home/yourname
This is a path. What does this path tell you?

Answer

You could get to your current locations by

Starting at the file system root /.

In that directory find a directory named home and descend into it.

In that home directory, find a file named with your login name.
The command file can tell you what kind of file you have. Give the command:

file /home/yourName

substituting your own login name for yourName.

Does the response you get from file make sense?
The command cd will let you change your current working directory.

Give the following commands and observe the results:
cd /
pwd
file /
cd /usr
pwd
file /usr
The ls command is used to list the contents of a directory.

Give the following commands.
ls /
ls /usr
ls /home
You should recognize one of the entries in the last of those listings as being your own login name.
Try using the up arrow and down arrow keys. You should be able to move back and forth through the history of commands you have already tried out.

Use the arrows to revisit the ls /usr command. Hit the Enter key to re-issue that command.

Many command shells will offer to auto-complete file and folder names after you type a few characters.

Type

ls /usr/lo

and then hit the Tab key instead of Enter. The command shell should guess that you were on your way to typing “local”, and fill in the remaining characters for you.

You can use the Tab completion feature with any command, not just ls.

Not only does this reduce that amount that you need to type, it reduces the opportunities to make spelling errors as you type and helps you detect spelling errors that you have already made.

Hit Enter to issue the command.

Type

ls /usr/li

and then hit the Tab key instead of Enter. The command shell will look and discover that there are several possible entries in /usr that begin with “li”. As it happens, all of them begin with “lib”, so it will add the ‘b’ and then pause. You may hear a beep that indicates that it could go no further.

Hit the Tab key twice more and the shell will show you all of the possible continuations of /usr/lib…

Hit Enter to list /usr/lib.

Give the command

exit

to close out this session.

3.4 Absolute and Relative Paths

File paths give step-by-step directions on how to reach a file.

Just like when we give directions in the real world, we can give paths that are relative or absolute.

Absolute paths start from a “landmark”, namely the file system root /.

Relative paths start from “wherever we are now”, our current working directory. In the Try This exercise earlier, you saw how to change your current working directory with the cd command and how to find out what is is with the pwd command.

If a path starts with a ‘/’, it is absolute.

Later we will see that an absolute path can also start with ‘~’.

If a path starts with anything besides ‘/’ (or ‘~’), it is relative.

Most Unix commands and programs will work with one or more files.
We tell them what files we want them to use by giving paths to those files.
We can give those paths as absolute paths or relative paths.
- whichever is more convenient for us to type
- because the commands really won’t care. They will simply follow the paths we give them, step by step.

Example 2: Try This

Log in to your Unix account.

Give the commands:
cd /usr
pwd
Is that cd command using a relative or an absolute path?

Answer

Absolute: it begins with a ‘/’, telling us that the first step in following this path is to start from the file system root, /.

The command ls is used to list the files contained in a directory. With no path in the command, it lists the contents of the current working directory. We can also give it one or more paths and, if those paths lead to directories, it will list the contents of those directories.

Give the commands:
ls
ls /usr
Why do these commands produce identical output?

Answer

The first command lists the contents of our current working directory. However, our current working directory is /usr, the same as the path given in the second command.

So both commands are actually being asked to list the same directory.

If we are following directions in the real world, we often walk a little way, then stop and look around to be sure that we are in the right place, then follow a little more of the directions, stop and look around, and so on.

We often use relative paths in Unix commands to accomplish much the same thing.

Try the command:
file /usr/include/net/ethernet.h
Good enough, but how did I even know that file was there? For that matter, how did I know that the directory /usr/include/net was there? And how did I know that I was typing it correctly with no misspellings or other goofs?

In many cases, I would approach it step by step, using relative paths to move one directory at a time.

Try the commands:
cd /
pwd
ls
How do I know which of these are directories and which are ordinary files? We could use the file command, but there’s a nice shortcut available in ls. The -F option will attach a punctuation character to the end of “unusual” files to tell us what they are:

a ‘/’ on the end of directories

a ’*’ on the end of commands and programs that can be executed

a ‘@’ on the end of “symbolic links” – we won’t use these in this course, but they are a kind of shortcut tunnel from one directory to another.

Just remember if you use this option that the appended punctuation is not really part of the file name!

Continuing on, try the commands:
ls -F
cd usr
pwd
Why does usr work in the cd command above?

Answer

usr does not start with ‘/’, so it is a relative path. So, from our current working directory /, we descend one step into the directory named “usr”.

Continuing on, try the commands:
ls -F
cd include
pwd
ls -F
cd net
pwd
ls -F
file ethernet.h
file /usr/include/net/ethernet.h
Notice how each cd command adds another link to your current working directory.

If we know the absolute path to a file, we can get there immediately. Otherwise we can step our way, one step at a time, until we get where we want.

Now, there’s lots of intermediate possibilities in between those two extremes. For example, try these commands:
cd /usr/include
ls -F
ls -F net
file net/ethernet.h

3.5 Abbreviating Paths

There are some common abbreviations that can be used to shorten paths.

You can refer to the home directory of someone with login name name as ~name

Similar to our earlier example, we can deconstruct the path ~cs252/Assignments/Asst1/foobar.txt
1. Start at the home directory of the “cs252” account Unix file system, ~cs252, also known as /home/cs252.
2. There you should see a directory named Assignments. Look in that directory.
3. In that directory, you should see a directory named Asst1. Look in that directory.
4. In that directory, you should see a file named foobar.txt. That’s the file you want.
You can refer to your own home directory simply as ~

For example, you could refer to the file containing my clpr program as either /home/zeil/bin/clpr or ~zeil/bin/clpr.

When I myself am logged in, I can refer to this program by either of those two names, or simply as ~/bin/clpr.
There is a big difference between ~jones/ and ~/jones/
- ~jones/ means "the home account of the jones account, and is a shorthand for /home/jones
- ~/jones/ means “look in your own home directory for a directory named jones” and is an abbreviation for /home/whateverYourLoginNameIs/jones.
  - And, honestly, it’s pretty unlikely that you even have a a directory named jones.
At all times when entering Unix commands, you have a “working” directory. If the file you want is within that directory (or within other directories contained in the working directory), the name of the working directory may be omitted from the start of the file name.

When you first log in, your home directory is your working directory. For example, when I have just logged in, I could refer to my program simply as bin/clpr, dropping the leading /home/zeil/ because that would be my working directory at that time.
Your working directory itself can be referred to as simply “.”.
The “parent” of your working directory (i.e., the directory containing the working directory) can be referred to as “..”.

Example 3: Try This

Try the following commands. See if you can predict what each pwd command will print.
cd ~
pwd
ls -F
cd /usr
cd include
pwd
cd ..
pwd
ls -F
cd /usr/include/..
pwd
cd /usr/include/.
pwd
cd /usr/../home
pwd
ls -F
cd ~
pwd
cd ~cs252
pwd
cd ~
pwd
cd ../cs252
pwd
cd ../../usr/include
pwd
cd ./.
pwd
Any surprises? Any results that you just could not explain?

Some miscellaneous notes:

Most people do the bulk of their work in their own directories. That makes the ~ shortcut particularly useful.
Sometimes you will need to access files and directories belonging to someone else. That’s where the ~otherPerson shortcut comes into play.
.. can go a long way towards making relative paths shorter and easier to type than a full absolute path. But I find that many students tend to forget it.
. is not nearly as useful. It tends to crop up in a few specialized cases, For example, every now and then, we want to tell a command to do something to our current working directory, and so we give a path consisting of “.”, all by itself.

4 File Systems on Other Operating Systems

Much of what we have covered here is actually applicable to other operating systems as well. All operating systems have files, both text and binary. All operating systems have directories, though they may be called “folders” instead.

And proficient use of any operating system will eventually require you to work with paths in that operating system.

4.1 MacOs File Systems

If you do not have access to an Apple OS/X or macOs PC, skip to the next section.

The macOs is a Unix operating system, so it should not be surprising that almost everything we have looked at will carry over directly to Macs.

The general organization of the file system, starting from the file system room /, is the same.
The pwd, cd, and ls commands all work the same.

In fact, the only difference is that in most Unix systems, your home directory would be /home/yourLoginName/, but in macOs your home directory is /Users/yourLoginName/

Example 4: Try This

Open a Terminal window on your Mac. In that window, try the following commands. Try to predict what each pwd and ls is going to show you.
cd /
pwd
ls -F
cd usr
pwd
ls -F
cd /Users
pwd
ls -F
cd ~
pwd
ls -F

Now, it’s true that you can usually use the Finder to examine your directories and files with less effort than working through the command line.

But this course is “…for Programmers”, and there are many tasks that programmers, unlike more casual Mac users, will need to perform that involve paths and other command line concepts.

4.2 Windows File Systems

If you do not have access to a Windows PC, skip this section.

Windows is the only commonly used operating system today that is not a Unix variant, so it will have lots of differences from Linux. Still, it has files, directories (folders), and paths.

The major differences to watch out for:

Windows separates steps in a path with the backslash \ instead of the forward slash ‘/’ used in Unix.
Instead of a single file system root, Windows has separate file trees rooted at each lettered drive: C:\, D:\, E:\, etc.
Windows ignores upper/lower case differences in file names and paths. In Unix, hello.cpp and Hello.CPP are different files and can co-exist in the same directory. In Windows, these are considered to be alternate spellings for the same file name.
Many of the commands have different names.
- cd with a directory name or path works much the same in Windows as in Unix,
- but if you give the cd command with no path, it behaves like the Unix pwd.
- The Windows command for listing the contents of a directory is dir, not ls.
Your home directory in most Windows installations is C:\users\yourLoginName\, where yourLoginName is the account name you use to log in to Windows.

Example 5: Try This

Open a CMD window on your Windows PC. (Click the Start/Windows button on the left end of the task bar and type cmd, then hit enter.)

In that window, try the following commands. Try to predict what each cd and dir is going to show you.
cd
dir
cd \
cd
dir
cd \Windows\Help
cd
dir
cd ..
cd
dir
cd \Users
cd
dir
cd yourWindowsLoginName
cd
dir

Now, it’s true that you can usually use the File Explorer to examine your directories and files with less effort than working through the command line.

But this course is “…for Programmers”, and there are many tasks that programmers, unlike more casual Windows users, will need to perform that involve paths and other command line concepts.

5 Commands Glossary

Command	Example	Explanation
`file path`	`file foo.txt`	Indicate what kind of data is stored in a file.
`cd path`	`cd ~/playing`	change your current working directory
`ls`	`ls`	list the files in your current working directory
`ls path`	`ls /usr`	list the files at that path
`ls -F path`	`ls -F ~`	list the files at that path with simple indicators of whether they are directories, links, executable commands, or just “ordinary” files
`pwd`	`pwd`	prints your current working directory

1 : It may be more precise to say that this directory’s name is the empty string "".

2 : As we will see, one almost never needs to type an entire file name in a Unix command, so long file names are no harder to work with than short ones.