The 2024 Wheel Reinvention Jam just concluded. See the results.

Tutorial/Parsing the MFT

Parsing the MFT

Programs like Everything are incredibly fast at searching by filename. How do they do it? Let's find out!

What is an MFT?

NTFS, the filesystem used by all modern versions of Windows, has an interesting property. Instead of storing file metadata in tables scattered across the partition (usually to optimise hard disk seek times going from file metadata to actual file data), NTFS stores all the file metadata in a few large contiguous blocks.

(Please note that I will be using "file" to refer to both files and directories, as is done in NTFS.)

An example of an NTFS partition's layout. There is a header at 0 bytes. There are a full large blocks of file metadata, with file data stored in between.

These blocks are collectively known as the MFT (master file table).

For those unaware, the metadata for a given file typically includes its filename, its size, and pointers to where the contents of the file can be found in the partition. If the file is a directory, this will also include pointers to where the metadata for its contents can be found in the MFT.

Here's why this is useful. Because file metadata is stored in only a few contiguous blocks in the partition, we can really quickly scan all the files in the partition to get each file's associated metadata. On most other filesystems, you'd have to recurse through each directory manually and enumerate each's contents. This is perfect for when you want to search through an entire partition, or compute summary statistics for directories.

In this tutorial we'll look at how to extract this information.

Raw partition access

First, let's get a simple program setup. We'll be using Sean Barrett's data structures library to keep things simple. Compile with cl /Zi mftreader.cpp.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#include <stdio.h>
#include <assert.h>
#include <stdint.h>
#include <windows.h>

#define STB_DS_IMPLEMENTATION
#include "stb_ds.h"

int main(int argc, char **argv) {
    return 0;
}

Instead of reading from a particular file, we need to read from arbitrary locations in the partition, so let's open a partition handle.

1
drive = CreateFile("\\\\.\\C:", GENERIC_READ, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_EXISTING, 0, NULL);

For this to work, you'll need to run the program as administrator. Make sure to not pass the GENERIC_WRITE flag, as you really, really, do not want to modify your filesystem's structures. I am not responsible for any data loss or damage resulting from following this tutorial.

We'll also define a function to read from a specified position in the partition.

1
2
3
4
5
6
7
8
9
HANDLE drive;

void Read(void *buffer, uint64_t from, uint64_t count) {
    DWORD bytesAccessed;
    LONG high = from >> 32;
    SetFilePointer(drive, from & 0xFFFFFFFF, &high, FILE_BEGIN);
    ReadFile(drive, buffer, count, &bytesAccessed, NULL);
    assert(bytesAccessed == count);
}

We'll be using asserts throughout this tutorial instead of proper error checking, as this is just a simple program. You may wish to replace these later, but make sure to keep in mind that directly reading from the MFT isn't something you're supposed to do. The filesystem driver could be updating the MFT as we read it, so there may be inconsistencies in the data you gather.

Finding the MFT

Before we can parse the MFT, we need to find where the first block begins. Once there, we can find pointers to the subsequent blocks.

The position of the first MFT block is stored in the very first sector of the NTFS partition. A sector is a 512-byte section of a partition. The first sector of a partition is affectionately known as the boot sector, since up until very recently it also stored code needed to boot the operating system. I've annotated the fields important to us.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#pragma pack(push,1)
struct BootSector {
    uint8_t     jump[3]; 
    char        name[8];
    uint16_t    bytesPerSector;         // The number of bytes in a sector. This should be 512.
    uint8_t     sectorsPerCluster;      // The number of sectors in a cluster. Clusters are used for less-granular access to the partition. They're usually 4KB.
    uint16_t    reservedSectors;
    uint8_t     unused0[3];
    uint16_t    unused1;
    uint8_t     media;
    uint16_t    unused2;
    uint16_t    sectorsPerTrack;
    uint16_t    headsPerCylinder;
    uint32_t    hiddenSectors;
    uint32_t    unused3;
    uint32_t    unused4;
    uint64_t    totalSectors;
    uint64_t    mftStart;           // The start of the MFT, given as a cluster index.
    uint64_t    mftMirrorStart;
    uint32_t    clustersPerFileRecord;
    uint32_t    clustersPerIndexBlock;
    uint64_t    serialNumber;
    uint32_t    checksum;
    uint8_t     bootloader[426];
    uint16_t    bootSignature;
};
#pragma pack(pop)

BootSector bootSector;

I've wrapped this structure in pack pragmas so no padding is inserted by the compiler. You'll want to put all other filesystem structures in between these pragmas too.

Now we read the boot sector:

1
Read(&bootSector, 0, 512);

And now we can read the first 1KB of the MFT (that's the bit that points to the other blocks in the MFT).

1
2
3
4
5
#define MFT_FILE_SIZE (1024)
uint8_t mftFile[MFT_FILE_SIZE];

uint64_t bytesPerCluster = bootSector.bytesPerSector * bootSector.sectorsPerCluster;
Read(&mftFile, bootSector.mftStart * bytesPerCluster, MFT_FILE_SIZE);

Finding the other MFT blocks

Finding the $DATA attribute

The MFT consists of 1KB entries, and the data describing the MFT is also 1KB. That's because the first entry of the MFT is the data describing the MFT. Since all other entries in the MFT are files, this is stored as if the MFT was a file. It actually works quite neatly. Therefore to find the other MFT blocks, we need to parse the first entry of the MFT.

The basic layout of an MFT entry - called a file record - is the following:

struct FileRecord {
    FileRecordHeader header;

    struct {
        AttributeHeader header;
        uint8_t contents[...];          // Note: the contents might be stored elsewhere in the partition (a non-resident attribute).
    } attributes[...];
};

Therefore we'll need some more structure definitions. (You haven't forgotten about the pack pragmas, right?)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
struct FileRecordHeader {
    uint32_t    magic;              // "FILE"
    uint16_t    updateSequenceOffset;
    uint16_t    updateSequenceSize;
    uint64_t    logSequence;
    uint16_t    sequenceNumber;
    uint16_t    hardLinkCount;
    uint16_t    firstAttributeOffset;       // Number of bytes between the start of the header and the first attribute header.
    uint16_t    inUse : 1;          // Whether the record is in use.
    uint16_t    isDirectory : 1;
    uint32_t    usedSize;
    uint32_t    allocatedSize;
    uint64_t    fileReference;
    uint16_t    nextAttributeID;
    uint16_t    unused;
    uint32_t    recordNumber;           // The record number. We'll need this later.
};

struct AttributeHeader {
    uint32_t    attributeType;          // The type of attribute. We'll be interested in $DATA (0x80), and $FILE_NAME (0x30).
    uint32_t    length;             // The length of the attribute in the file record.
    uint8_t     nonResident;            // false = attribute's contents is stored within the file record in the MFT; true = it's stored elsewhere
    uint8_t     nameLength;
    uint16_t    nameOffset;
    uint16_t    flags;
    uint16_t    attributeID;
};

Unfortunately, it's a bit more complicated than this. There are some additional fields that are different for non-resident and resident attributes. See the comment in AttributeHeader.nonResident for an explanation of the difference between resident and non-resident attributes.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
struct ResidentAttributeHeader : AttributeHeader {
    uint32_t    attributeLength;
    uint16_t    attributeOffset;
    uint8_t     indexed;
    uint8_t     unused;
};

struct NonResidentAttributeHeader : AttributeHeader {
    uint64_t    firstCluster;
    uint64_t    lastCluster;
    uint16_t    dataRunsOffset;         // The offset in bytes from the start of attribute header to the description of where the attribute's contents can be found.
    uint16_t    compressionUnit;
    uint32_t    unused;
    uint64_t    attributeAllocated;
    uint64_t    attributeSize;
    uint64_t    streamDataSize;
};

We're going to search for the $DATA attribute in the file record. The $DATA attribute is used to store the contents of a file, so in the case of the MFT, it stores the MFT. Since the entire MFT can't fit within a single MFT file record, the $DATA attribute will be non-resident. That means the attribute will instead list the contiguous blocks that do contain the MFT.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
FileRecordHeader *fileRecord = (FileRecordHeader *) mftFile;
AttributeHeader *attribute = (AttributeHeader *) (mftFile + fileRecord->firstAttributeOffset);
NonResidentAttributeHeader *dataAttribute = nullptr;
assert(fileRecord->magic == 0x454C4946);

while (true) {
    if (attribute->attributeType == 0x80) {
        dataAttribute = (NonResidentAttributeHeader *) attribute;
    } else if (attribute->attributeType == 0xFFFFFFFF) {
        break;
    }

    attribute = (AttributeHeader *) ((uint8_t *) attribute + attribute->length);
}

assert(dataAttribute);

Parsing the non-resident data runs

As described above, a non-resident attribute lists the contiguous blocks that make up the attribute's contents. Therefore, with the MFT's $DATA attribute, we have the list of blocks containing the MFT. (Remember - we only got the location of the first MFT block from the bootloader.)

Each block is called a data run. Data runs must begin and end at cluster boundaries (usually every 4KB - see the BootSector definition).

In NTFS, the array of data runs is stored in the following format.

struct DataRun {
    uint8_t     lengthFieldBytes : 4;
    uint8_t     offsetFieldBytes : 4;
    uint64_t    length : lengthFieldBytes * 8;  // Length of the data run in clusters.
    int64_t     offset : offsetFieldBytes * 8;  // Offset of the start of this data run in clusters from the start of the previous data run, or the start of the partition if this is the first run.
};

Variable length fields... it's complicated. We'll have to parse these carefully. First, the header structure.

1
2
3
4
struct RunHeader {
    uint8_t     lengthFieldBytes : 4;
    uint8_t     offsetFieldBytes : 4;
};

Loop through each data run in the $DATA attribute.

1
2
3
4
5
6
7
RunHeader *dataRun = (RunHeader *) ((uint8_t *) dataAttribute + dataAttribute->dataRunsOffset);
uint64_t clusterNumber = 0;

while (((uint8_t *) dataRun - (uint8_t *) dataAttribute) < dataAttribute->length && dataRun->lengthFieldBytes) {
    ...
    dataRun = (RunHeader *) ((uint8_t *) dataRun + 1 + dataRun->lengthFieldBytes + dataRun->offsetFieldBytes);
}

For each data run we need to calculate its length and absolute offset (i.e. from the start of the partition).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
uint64_t length = 0, offset = 0;

for (int i = 0; i < dataRun->lengthFieldBytes; i++) {
    length |= (uint64_t) (((uint8_t *) dataRun)[1 + i]) << (i * 8);
}

for (int i = 0; i < dataRun->offsetFieldBytes; i++) {
    offset |= (uint64_t) (((uint8_t *) dataRun)[1 + dataRun->lengthFieldBytes + i]) << (i * 8);
}

if (offset & ((uint64_t) 1 << (dataRun->offsetFieldBytes * 8 - 1))) {
    // Sign extend the offset.

    for (int i = dataRun->offsetFieldBytes; i < 8; i++) {
        offset |= (uint64_t) 0xFF << (i * 8);
    }
}

clusterNumber += offset;

Then we have clusterNumber and length calculated.

Enumerating files

We've made it! It's time to load each MFT block and scan the file records. Because an MFT block can be quite large (hundred of MBs), we'll process it in 64MB chunks.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
#define MFT_FILES_PER_BUFFER (65536)
uint8_t mftBuffer[MFT_FILES_PER_BUFFER * MFT_FILE_SIZE];

uint64_t filesRemaining = length * bytesPerCluster / MFT_FILE_SIZE;
uint64_t positionInBlock = 0;

while (filesRemaining) {
    fprintf(stderr, "...\n");

    uint64_t filesToLoad = MFT_FILES_PER_BUFFER;
    if (filesRemaining < MFT_FILES_PER_BUFFER) filesToLoad = filesRemaining;
    Read(&mftBuffer, clusterNumber * bytesPerCluster + positionInBlock, filesToLoad * MFT_FILE_SIZE);
    positionInBlock += filesToLoad * MFT_FILE_SIZE;
    filesRemaining -= filesToLoad;

    for (int i = 0; i < filesToLoad; i++) {
        // Process the file record.
    }
}

Standard stuff. Now we process each file.

Let's ignore any file records that aren't in use.

1
2
3
4
FileRecordHeader *fileRecord = (FileRecordHeader *) (mftBuffer + MFT_FILE_SIZE * i);
assert(fileRecord->magic == 0x454C4946);

if (!fileRecord->inUse) continue;

Now we need to find the $FILENAME attribute. (There might be multiple $FILENAME attributes actually - that's how hard links work in NTFS.) The attribute parsing code is the same as before.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
AttributeHeader *attribute = (AttributeHeader *) ((uint8_t *) fileRecord + fileRecord->firstAttributeOffset);

while ((uint8_t *) attribute - (uint8_t *) fileRecord < MFT_FILE_SIZE) {
    if (attribute->attributeType == 0x30) {
        // Parse the $FILE_NAME attribute.
    } else if (attribute->attributeType == 0xFFFFFFFF) {
        break;
    }

    attribute = (AttributeHeader *) ((uint8_t *) attribute + attribute->length);
}

The $FILE_NAME attribute contains the file's name in its contents. The filename is unlikely to be very long, so this attribute will be resident. (If you never understood the whole non-resident stuff, this just means the filename will be within the file record - we don't have to hunt it down elsewhere.)

Here's the header for the attribute.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
struct FileNameAttributeHeader : ResidentAttributeHeader {
    uint64_t    parentRecordNumber : 48;    // The record number (see FileRecordHeader.recordNumber) of the directory containing this file.
    uint64_t    sequenceNumber : 16;
    uint64_t    creationTime;
    uint64_t    modificationTime;
    uint64_t    metadataModificationTime;
    uint64_t    readTime;
    uint64_t    allocatedSize;
    uint64_t    realSize;
    uint32_t    flags;
    uint32_t    repase;
    uint8_t     fileNameLength;         // The length of the filename in wide chars.
    uint8_t     namespaceType;          // If this = 2 then it's the DOS name of the file. We ignore these.
    wchar_t     fileName[1];            // The filename in UTF-16.
};

There's a couple other interesting fields in there you might want to grab, but we'll ignore them for now. We'll just store an array of each file's name and its parent.

1
2
3
4
5
6
struct File {
    uint64_t    parent;             // The index in the files array of the parent.
    char           *name;               // The file name. Zero-terminated; UTF-8.
};

File *files;

For every $FILE_NAME attribute, we'll add an entry to this array.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
FileNameAttributeHeader *fileNameAttribute = (FileNameAttributeHeader *) attribute;

if (fileNameAttribute->namespaceType != 2 && !fileNameAttribute->nonResident) {
    File file = {};
    file.parent = fileNameAttribute->parentRecordNumber;
    file.name = DuplicateName(fileNameAttribute->fileName, fileNameAttribute->fileNameLength); // We'll write the DuplicateName function next!

    uint64_t oldLength = arrlenu(files);

    if (fileRecord->recordNumber >= oldLength) {
        // Initialise skipped fields to 0.
        arrsetlen(files, fileRecord->recordNumber + 1);
        memset(files + oldLength, 0, sizeof(File) * (fileRecord->recordNumber - oldLength));
    }

    files[fileRecord->recordNumber] = file;
}

We need to store the filenames elsewhere, since the MFT buffer will be reused. The following function allocates spaces for the filename, and converts it to UTF-8. We group allocations together into 16MB chunks.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
char *DuplicateName(wchar_t *name, size_t nameLength) {
    static char *allocationBlock = nullptr;
    static size_t bytesRemaining = 0;

    size_t bytesNeeded = WideCharToMultiByte(CP_UTF8, 0, name, nameLength, NULL, 0, NULL, NULL) + 1;

    if (bytesRemaining < bytesNeeded) {
        allocationBlock = (char *) malloc((bytesRemaining = 16 * 1024 * 1024));
    }

    char *buffer = allocationBlock;
    buffer[bytesNeeded - 1] = 0; // Zero-terminate.
    WideCharToMultiByte(CP_UTF8, 0, name, nameLength, allocationBlock, bytesNeeded, NULL, NULL);

    bytesRemaining -= bytesNeeded;
    allocationBlock += bytesNeeded;

    return buffer;
}

...and that's it! We're done.

Complete source

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
/*
    This is free and unencumbered software released into the public domain.
    Anyone is free to copy, modify, publish, use, compile, sell, or distribute this
    software, either in source code form or as a compiled binary, for any purpose,
    commercial or non-commercial, and by any means.
    In jurisdictions that recognize copyright laws, the author or authors of this
    software dedicate any and all copyright interest in the software to the public
    domain. We make this dedication for the benefit of the public at large and to
    the detriment of our heirs and successors. We intend this dedication to be an
    overt act of relinquishment in perpetuity of all present and future rights to
    this software under copyright law.
    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
    AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
    ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
    WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/

#include <stdio.h>
#include <assert.h>
#include <stdint.h>
#include <windows.h>

#define STB_DS_IMPLEMENTATION
#include "stb_ds.h"

#pragma pack(push,1)
struct BootSector {
    uint8_t     jump[3]; 
    char        name[8];
    uint16_t    bytesPerSector;
    uint8_t     sectorsPerCluster;
    uint16_t    reservedSectors;
    uint8_t     unused0[3];
    uint16_t    unused1;
    uint8_t     media;
    uint16_t    unused2;
    uint16_t    sectorsPerTrack;
    uint16_t    headsPerCylinder;
    uint32_t    hiddenSectors;
    uint32_t    unused3;
    uint32_t    unused4;
    uint64_t    totalSectors;
    uint64_t    mftStart;
    uint64_t    mftMirrorStart;
    uint32_t    clustersPerFileRecord;
    uint32_t    clustersPerIndexBlock;
    uint64_t    serialNumber;
    uint32_t    checksum;
    uint8_t     bootloader[426];
    uint16_t    bootSignature;
};

struct FileRecordHeader {
    uint32_t    magic;
    uint16_t    updateSequenceOffset;
    uint16_t    updateSequenceSize;
    uint64_t    logSequence;
    uint16_t    sequenceNumber;
    uint16_t    hardLinkCount;
    uint16_t    firstAttributeOffset;
    uint16_t    inUse : 1;
    uint16_t    isDirectory : 1;
    uint32_t    usedSize;
    uint32_t    allocatedSize;
    uint64_t    fileReference;
    uint16_t    nextAttributeID;
    uint16_t    unused;
    uint32_t    recordNumber;
};

struct AttributeHeader {
    uint32_t    attributeType;
    uint32_t    length;
    uint8_t     nonResident;
    uint8_t     nameLength;
    uint16_t    nameOffset;
    uint16_t    flags;
    uint16_t    attributeID;
};

struct ResidentAttributeHeader : AttributeHeader {
    uint32_t    attributeLength;
    uint16_t    attributeOffset;
    uint8_t     indexed;
    uint8_t     unused;
};

struct FileNameAttributeHeader : ResidentAttributeHeader {
    uint64_t    parentRecordNumber : 48;
    uint64_t    sequenceNumber : 16;
    uint64_t    creationTime;
    uint64_t    modificationTime;
    uint64_t    metadataModificationTime;
    uint64_t    readTime;
    uint64_t    allocatedSize;
    uint64_t    realSize;
    uint32_t    flags;
    uint32_t    repase;
    uint8_t     fileNameLength;
    uint8_t     namespaceType;
    wchar_t     fileName[1];
};

struct NonResidentAttributeHeader : AttributeHeader {
    uint64_t    firstCluster;
    uint64_t    lastCluster;
    uint16_t    dataRunsOffset;
    uint16_t    compressionUnit;
    uint32_t    unused;
    uint64_t    attributeAllocated;
    uint64_t    attributeSize;
    uint64_t    streamDataSize;
};

struct RunHeader {
    uint8_t     lengthFieldBytes : 4;
    uint8_t     offsetFieldBytes : 4;
};
#pragma pack(pop)

struct File {
    uint64_t    parent;
    char           *name;
};

File *files;

DWORD bytesAccessed;
HANDLE drive;

BootSector bootSector;

#define MFT_FILE_SIZE (1024)
uint8_t mftFile[MFT_FILE_SIZE];

#define MFT_FILES_PER_BUFFER (65536)
uint8_t mftBuffer[MFT_FILES_PER_BUFFER * MFT_FILE_SIZE];

char *DuplicateName(wchar_t *name, size_t nameLength) {
    static char *allocationBlock = nullptr;
    static size_t bytesRemaining = 0;

    size_t bytesNeeded = WideCharToMultiByte(CP_UTF8, 0, name, nameLength, NULL, 0, NULL, NULL) + 1;

    if (bytesRemaining < bytesNeeded) {
        allocationBlock = (char *) malloc((bytesRemaining = 16 * 1024 * 1024));
    }

    char *buffer = allocationBlock;
    buffer[bytesNeeded - 1] = 0;
    WideCharToMultiByte(CP_UTF8, 0, name, nameLength, allocationBlock, bytesNeeded, NULL, NULL);

    bytesRemaining -= bytesNeeded;
    allocationBlock += bytesNeeded;

    return buffer;
}

void Read(void *buffer, uint64_t from, uint64_t count) {
    LONG high = from >> 32;
    SetFilePointer(drive, from & 0xFFFFFFFF, &high, FILE_BEGIN);
    ReadFile(drive, buffer, count, &bytesAccessed, NULL);
    assert(bytesAccessed == count);
}

int main(int argc, char **argv) {
    drive = CreateFile("\\\\.\\C:", GENERIC_READ, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_EXISTING, 0, NULL);

    Read(&bootSector, 0, 512);

    uint64_t bytesPerCluster = bootSector.bytesPerSector * bootSector.sectorsPerCluster;

    Read(&mftFile, bootSector.mftStart * bytesPerCluster, MFT_FILE_SIZE);

    FileRecordHeader *fileRecord = (FileRecordHeader *) mftFile;
    AttributeHeader *attribute = (AttributeHeader *) (mftFile + fileRecord->firstAttributeOffset);
    NonResidentAttributeHeader *dataAttribute = nullptr;
    uint64_t approximateRecordCount = 0;
    assert(fileRecord->magic == 0x454C4946);

    while (true) {
        if (attribute->attributeType == 0x80) {
            dataAttribute = (NonResidentAttributeHeader *) attribute;
        } else if (attribute->attributeType == 0xB0) {
            approximateRecordCount = ((NonResidentAttributeHeader *) attribute)->attributeSize * 8;
        } else if (attribute->attributeType == 0xFFFFFFFF) {
            break;
        }

        attribute = (AttributeHeader *) ((uint8_t *) attribute + attribute->length);
    }

    assert(dataAttribute);
    RunHeader *dataRun = (RunHeader *) ((uint8_t *) dataAttribute + dataAttribute->dataRunsOffset);
    uint64_t clusterNumber = 0, recordsProcessed = 0;

    while (((uint8_t *) dataRun - (uint8_t *) dataAttribute) < dataAttribute->length && dataRun->lengthFieldBytes) {
        uint64_t length = 0, offset = 0;

        for (int i = 0; i < dataRun->lengthFieldBytes; i++) {
            length |= (uint64_t) (((uint8_t *) dataRun)[1 + i]) << (i * 8);
        }

        for (int i = 0; i < dataRun->offsetFieldBytes; i++) {
            offset |= (uint64_t) (((uint8_t *) dataRun)[1 + dataRun->lengthFieldBytes + i]) << (i * 8);
        }

        if (offset & ((uint64_t) 1 << (dataRun->offsetFieldBytes * 8 - 1))) {
            for (int i = dataRun->offsetFieldBytes; i < 8; i++) {
                offset |= (uint64_t) 0xFF << (i * 8);
            }
        }

        clusterNumber += offset;
        dataRun = (RunHeader *) ((uint8_t *) dataRun + 1 + dataRun->lengthFieldBytes + dataRun->offsetFieldBytes);

        uint64_t filesRemaining = length * bytesPerCluster / MFT_FILE_SIZE;
        uint64_t positionInBlock = 0;

        while (filesRemaining) {
            fprintf(stderr, "%d%% ", (int) (recordsProcessed * 100 / approximateRecordCount));

            uint64_t filesToLoad = MFT_FILES_PER_BUFFER;
            if (filesRemaining < MFT_FILES_PER_BUFFER) filesToLoad = filesRemaining;
            Read(&mftBuffer, clusterNumber * bytesPerCluster + positionInBlock, filesToLoad * MFT_FILE_SIZE);
            positionInBlock += filesToLoad * MFT_FILE_SIZE;
            filesRemaining -= filesToLoad;

            for (int i = 0; i < filesToLoad; i++) {
                // Even on an SSD, processing the file records takes only a fraction of the time to read the data,
                // so there's not much point in multithreading this.

                FileRecordHeader *fileRecord = (FileRecordHeader *) (mftBuffer + MFT_FILE_SIZE * i);
                recordsProcessed++;

                if (!fileRecord->inUse) continue;

                AttributeHeader *attribute = (AttributeHeader *) ((uint8_t *) fileRecord + fileRecord->firstAttributeOffset);
                assert(fileRecord->magic == 0x454C4946);

                while ((uint8_t *) attribute - (uint8_t *) fileRecord < MFT_FILE_SIZE) {
                    if (attribute->attributeType == 0x30) {
                        FileNameAttributeHeader *fileNameAttribute = (FileNameAttributeHeader *) attribute;

                        if (fileNameAttribute->namespaceType != 2 && !fileNameAttribute->nonResident) {
                            File file = {};
                            file.parent = fileNameAttribute->parentRecordNumber;
                            file.name = DuplicateName(fileNameAttribute->fileName, fileNameAttribute->fileNameLength);

                            uint64_t oldLength = arrlenu(files);

                            if (fileRecord->recordNumber >= oldLength) {
                                arrsetlen(files, fileRecord->recordNumber + 1);
                                memset(files + oldLength, 0, sizeof(File) * (fileRecord->recordNumber - oldLength));
                            }

                            files[fileRecord->recordNumber] = file;
                        }
                    } else if (attribute->attributeType == 0xFFFFFFFF) {
                        break;
                    }

                    attribute = (AttributeHeader *) ((uint8_t *) attribute + attribute->length);
                }
            }
        }
    }

    fprintf(stderr, "\nFound %lld files.\n", arrlen(files));

    return 0;
}

Edited by nakst on