Converting files fast?

Benjamin Pedersen

#26185

March 30, 2022

I have a lot of big binary files (usually 300-400 MB). The file format has changed slightly and I would like to convert old files to the new format.

What is the fastest method to do this? It needs to work for both Windows and Linux and preferably be in C++ or C. Should I use C FILE or C++ streams or something completely different?

Specific about the file format:

It consists of blocks of data. A block consists of

a header,
a body of data, and
a footer.

Header: In the new format the header stays the same size, but one of the fields has changed slightly in definition, so the value does the same. The conversion depends on the footers of nearby blocks, in most cases prior blocks, but footer of the current block needs to be checked, so I need to read a bit ahead. If those footers aren't aren't valid in relation to the current block, I want to look ahead for valid footers.

Body: The size of the body is dependent on the header, but in virtually every case it has a fixed size close to 8 kB. It is unchanged in the new format, so it should just be copied.

Footer: The footer is variable size in the old format (at least four bytes) and fixed four bytes in the new. The definition is completely different. In the conversion the value may depend on nearby blocks.

Edited by Benjamin Pedersen on March 30, 2022, 12:17pm

Mārtiņš Možeiko

#26187

March 30, 2022

For each file to process you read it fully into memory (C stdio API is perfectly fine for that), do all the changes you need, either in place if possible or to new memory block (doesn't matter which option), then write it fully back to new file.

If you have SSD disk and many files, then do this in multiple threads in parallel, one per CPU core you have.

If you do this on HDD then it may be not as efficient with many threads. Depending on how complex are your data transformations you might want create separate threads for IO and format update. Or you could try maybe just with two threads to see how well it works.

Edited by Mārtiņš Možeiko on March 30, 2022, 7:20pm

Benjamin Pedersen

#26188

March 31, 2022

Thanks for your reply. I had not really thought about the disk and threading. Can I ask a few follow up questions? Sorry that they are pretty basic. I know very little about disks.

If you do this on HDD then it may be not as efficient with many threads. Depending on how complex are your data transformations you might want create separate threads for IO and format update. Or you could try maybe just with two threads to see how well it works.

Do you mean like pipelining, i.e. building up output, which is then handled on another thread? Are the read and write heads independent in HDD, so you can read on one thread without slowing down writes on another?

If you have SSD disk and many files, then do this in multiple threads in parallel, one per CPU core you have.

SSDs doesn't have any moving parts so writes to all kinds of places should not be a problem. Would they cache reads though? In that case I guess reads from different threads could interfere with each other if the cache is shared.

For each file to process you read it fully into memory (C stdio API is perfectly fine for that), do all the changes you need, either in place if possible or to new memory block (doesn't matter which option), then write it fully back to new file.

Maybe I should just do a test myself, but am I correct in assuming that the latency on disk reads even with buffering vastly outweigh any imaginable benefits of reading and writing as you go along, e.g.

speed limit of min(read speed, write speed) rather than read+write speed.
more tight memory usage so less cache misses.

Replying to mmozeiko (#26187)

Mārtiņš Možeiko

#26191

March 31, 2022

By IO thread I mean dedicated thread that accepts read or write request and does read/write operation from/to disk. And then gives back result (filled buffer in read case) to thread who asked this request. This way all your IO work is serialized and happening in single instance which is more preferred by HDD disks. You still do your work in multiple background threads because often work involves some reasonable CPU work which you don't want on IO thread to not slow down IO requests.

On HDD read and write is happening with same heads. You don't want to issue too many requests (doesn't matter read or write, both) at the same time. Because then heads will be spending a lot of time seeking around to physical locations on platters - seek time on HDD is slow, measured in many milliseconds. This will reduce your total bandwidth. Random reads/writes are really bad.

On SSD there is no such issue. There is no seek time. Just some minor latency from reading flash cells. It can service many requests at the same time.

In modern SATA protocol there is NCQ which is command queuing. OS can issue many requests to device and device can decide how to serve them back efficiently (i.e. start reading next request before fully sending back current one) - this increases bandwidth & reduces latency. This is especially useful for SSD drives, that's why I'm telling you to do as many threads as you can if your files are on SSD - otherwise SSD will sit idle waiting on your CPU to do its work. But on HDD this may lead to too much seeking around, so your need to measure carefully if threads are useful for your case. For big files most likely just one IO thread will be enough.

Not sure what caches are you talking about. OS does all the caching - it will use free memory that is not used by any process to store file cache. But this is pretty irrelevant to your use case. As you will be reading each file just once. So there is no cache usage at all.

Edited by Mārtiņš Možeiko on March 31, 2022, 6:17pm

Replying to Benjipede (#26188)