cs24-24fa Pre-Test: disk

Introduction to Computing Systems (Fall 2024)

Introduction

The main goal of this pre-test is to orient you to what CS 24 will be like. It will also cover representation, use number bases, and review pointers. Despite being called “pre-test”, it is not easy, and you should expect to spend a reasonable amount of time on it.

Setup

First, you should register for the assignment by going to https://grinch.caltech.edu/register which will create you a repository on gitlab with the starter code. Then, you should follow the

ENVIRONMENT SETUP INSTRUCTIONS

to get ready for the course. These shouldn’t take that long, but they’re extremely important.

The Scenario

Adam accidentally deleted the contents of their USB drive (oh noes!). This USB drive was particularly important, because Adam painstakingly created a unique welcome PDF for every student. They even went to the trouble of password protecting them so that nobody could read someone else’s message. Unfortunately, they lost the passwords too, because they were stored on the USB drive as well.

Game Plan

You will recover your CS 24 “welcome PDF” and the password to open it from an image of Adam’s USB drive step-by-step. There might also be other hidden goodies that were on the drive that are worth exploring…

Part 1: Constructing and Printing A Directory Tree Data Structure

In this course, we will often find ourselves working with recursive tree structures with multiple “types” of nodes. One example of such a tree is the directory tree on your file system. A node is either a file node or a directory node. File nodes contain a name, a file size, and a byte array of contents. Directory nodes contain a name and an array of children. Notably, the name is common to both types of nodes; so, it should appear in the “super class”. Unfortunately, C doesn’t have objects; so, we have to simulate them using a “tag” to keep track of what type the node is. These tags are usually implemented as enums which are just a way of specifying that a variable must be one of a list of choices:

Node Type

1
2
3
4
typedef enum {
    FILE_TYPE,
    DIRECTORY_TYPE
} node_type_t;

Thus, our node_t struct will consist of a type and a name:

Node struct

1
2
3
4
typedef struct {
    node_type_t type;
    char *name;
} node_t;

The interesting part is how the “subclasses” are implemented. We “embed” the root struct as the first element of the subclasses. This way, they will both start with the same members. So, if we are given a node_t, we can “extend” it to the right type by casting it. See the init functions in directory_tree.c or free_directory_tree for examples of what that looks like.

File Node struct

1
2
3
4
5
typedef struct {
    node_t base;
    size_t size;
    char *contents;
} file_node_t;

Directory Node struct

1
2
3
4
5
typedef struct {
    node_t base;
    size_t num_children;
    node_t **children;
} directory_node_t;

We have provided the following functions to help initialize and free tree nodes:

Your first task is to implement the functions to manipulate, print, and create a directory tree:

Part 2: Recovering a File System

In this part, you will be working with an image of Adam’s USB drive.

Checking the Disk Image

First, download usb.dmg from the link above. Before you start trying to recover files, you should check that the files are actually deleted. To do this, you’ll have to mount the image as a disk. On OS X, you can just double-click it. On Linux, you should use mount. On Windows, you can see it using 7zip or you can download this program to actually mount it as a drive. Regardless of machine, you should notice that there aren’t any useful files on the drive. So, we’ll have to recover them the hard way!

The FAT16 File System

Your computer organizes the files on your machine in a particular format called a “file system”. There are many formats (ntfs on windows and ext3 on linux are some of the most common), but one of the older formats which is used for cross-compatibility is called FAT (which stands for “File Allocation Table”). Adam’s USB drive used FAT16 because of the relatively small size. In this section, we will describe the basics of the FAT16 file system to give you enough information to recover Adam’s files.

When a file is deleted on a FAT16 file system, the contents are actually kept in place! All that changes is the first character of the filename which is set to 0xe5. This means we can recover the files even though we can’t see them in finder/explorer!

The image begins with some special sections as pictured below: fat-layout

Since we’re not trying to boot from the disk image, we can ignore the first section. Since the files are deleted, we don’t care what the system thinks is allocated. So, the only two sections we care about in this “preamble” are: (1) the BPB and (2) the “root directory entries”. The BPB (represented by as a bios_parameter_block_t in fat16.h) stores a number of important parameters about the data that follows. For now, all you have to do is fread the bytes into a struct, and we will use the information contained inside later. The master boot record has size const size_t MASTER_BOOT_RECORD_SIZE = 0x20B as defined in recover.c.

To get around the file, you will need to use the fseek function which takes an origin as the third argument: SEEK_SET indicates the offset is relative to the beginning of the file and SEEK_CUR indicates the offset is relative to where the file currently points; you may want to use both of these during this assignment.

Immediately after the File Allocation Tables, is an “array” of 512 directory_entry_t’s which is where you should start your analysis. These directory_entry_t’s all correspond to some kind of file or directory, but we are only concerned with ones that aren’t hidden. We have provided a utility function bool is_hidden(directory_entry_t d) which determines if a proposed directory_entry_t is hidden or not.

Notably, all directory_entry_t’s have a “cluster number” which is basically a pointer to a section on disk where the contents of that entry are located. You can get an offset (in bytes) from the beginning of the file for this section by calling our utility function get_offset_from_cluster(d.first_cluster, bpb), which does a short calculation to determine where to go.

For file entries, the offset calculated will lead to the contents of that file. For directory entries, the offset calculated will lead to a block of directory_entry_t’s. which are the entries contained in that directory.

The next several steps are to write the follow function, which recursively follows all non-hidden directories and constructs node_t’s out of them.

void follow(FILE *disk, directory_node_t *node, bios_parameter_block_t bpb)

Pre-Condition

The FILE * disk points to a contiguous block of one or several directory_entry_t’s, which represent the files or directories in the directory associated with node.

For each entry, if the entry read is not hidden…

  • …and is a file: read file_size bytes from the offset location into a char *, then make a file_node_t * and attach the parent node to it.
  • …and is a directory: make a directory_node_t *, attach the parent node to it, and recursively call follow using the offset location.

You should use a loop to process directory_entry_t’s from disk until encountering an entry with a name that starts with \0. You should use fread to read the next directory_entry_t from disk, which will move the disk pointer to the next directory_entry_t in the block.

You will find the is_directory function useful in determining if an entry is a file or directory. In both cases, don’t forget to reset the location of the file to the next directory_entry_t after you are done reading from the offset location. Additionally, when you initialize the file/directory name, make sure to call the get_file_name function we’ve provided, because FAT16 stores file names weirdly.