How Media Molecule Does Serialization

Oswald Hurlem —

Recently, I decided that Swedish Cubes for Unity would need to have a robust system for serialization and data versioning. This would allow me to avoid getting a bad rap by releasing updates that break people's save data, enable the level designer hired for my own game to start making levels before the tools are fully finished, and finally, to have a leg up on competing products which have some problems in this regard.

This is one thing that has not been covered in great detail by HMH, but thankfully I had another valuable resource. For a while now, I've been exchanging messages on many topics with Alex Evans of Media Molecule, who I got in touch with thanks to Stephanie Hurlburt's software engineer mentor list. Media Molecule's LittleBigPlanet series features somes of the strongest, most easy-to-use, and most reliable content creation tools of any game out there. The series's codebase has been updated thousands of times, and had releases sequels, DLC, and patches. Yet a level made on LittleBigPlanet 1 on a PS3 in 2008 can be opened in LittleBigPlanet 3 on a PS4 in 2017. I figured Alex would be the perfect person to ask about this.

Alex's answer (which can be read here) was thorough and informative. He described to me Media Molecule's in-house versioned serialization/deserialization system, which he calls the "LBP Method," because it was what was used across the LittleBigPlanet series. It's also being used in Media Molecule's next title, Dreams. This post will summarize and expand upon what Alex Evans told me, with a few creative liberties, most notably spelling "Serialize" with a Z.

Standard disclaimer here -- I'm not an employee or stakeholder of Media Molecule, my opinions don't reflect theirs, yada yada.

== THE LBP METHOD ==

In the LBP Method, both serialization and deserialization of data are performed via the same procedure. This procedure visits each field of each serialized struct in a recursive, field-by-field fashion. One subroutine exists for each serialized type. These subroutines are function named Serialize, but they also perform deserialization. Each function takes in a pointer to the data value, and a pointer to a struct containing de/serialization-relevant state.

void Serialize(lbp_serializer* LbpSerializer, T* Datum)
{
    // ???
}

This de/serialization state struct doesn't need to contain more than the version of the data being de/serialized, a handle to wherever the input and output is taking place, and a bool indicating whether serialization or deserialization is being performed.

struct lbp_serializer
{
    int32_t DataVersion;
    FILE* FilePtr;
    bool IsWriting;
};

If T is a primitive type or is a struct that is very unlikely to change, then the Serialize function is simple and leafy. It checks whether the de/serialization state struct is flagged as reading or writing, then reads or writes the value accordingly.

void Serialize(lbp_serializer* LbpSerializer, int32_t* Datum)
{
    if (LbpSerializer->IsWriting)
    {
        fwrite(Datum, sizeof(int32_t), 1, LbpSerializer->FilePtr);
    }
    else
    {
        fread(Datum, sizeof(int32_t), 1, LbpSerializer->FilePtr);
    }
}

This has the advantage of making it so that the read and write operations can't go out of sync.

If T is instead a struct which is expected to undergo a revision or two in the next decade, then de/serialization of its fields is delegated to other overloaded Serialize functions.

struct game_score_state
{
    int32_t P1Score;
    int32_t P2Score;
};

void Serialize(lbp_serializer* LbpSerializer, game_score_state* Datum)
{
    Serialize(LbpSerializer, &Datum->P1Score);
    Serialize(LbpSerializer, &Datum->P2Score);
}

Now, what happens if we want to update a struct but maintain backwards-compatibility? We can use the DataVersion value held by the de/serialization state struct. Each time a change to a struct is made, we signify it with a new revision code. In this example, we'll add two fields for the number of Fouls committed by each player. If we're reading a file from a time after this change was made, then we read these values from the file. Otherwise, we leave them at their default value (zero).

struct game_score_state
{
    int32_t P1Score;
    int32_t P2Score;
    int32_t P1Fouls; // Added with SV_FOULS
    int32_t P2Fouls; // Added with SV_FOULS
};

void Serialize(lbp_serializer* LbpSerializer, game_score_state* Datum)
{
    Serialize(LbpSerializer, &Datum->P1Score);
    Serialize(LbpSerializer, &Datum->P2Score);

    if (LbpSerializer->DataVersion >= SV_FOULS)
    {
        Serialize(LbpSerializer, &Datum->P1Fouls);
        Serialize(LbpSerializer, &Datum->P2Fouls);
    }
}

Checking the DataVersion against a constant value to decide whether or not to visit a Serialize function is a common enough operation that we use a macro for this, named ADD.

#define ADD(_fieldAdded, _fieldName) \
    if (LbpSerializer->DataVersion >= (_fieldAdded)) \
    { \
        Serialize(LbpSerializer, &(Datum->_fieldName)); \
    }

void Serialize(lbp_serializer* LbpSerializer, game_score_state* Datum)
{
    ADD(SV_INITIAL, P1Score);
    ADD(SV_INITIAL, P2Score);
    ADD(SV_FOULS, P1Fouls);
    ADD(SV_FOULS, P2Fouls);
}

How do we get this DataVersion value anyway? When reading, we read the version number first from the file and use it as we traverse down the deserialized structures. When writing, we have it set to the latest (highest) value, and write it to the file.

bool SerializeIncludingVersion(lbp_serializer* LbpSerializer, game_score_state* State)
{
    if (LbpSerializer->IsWriting)
    {
        LbpSerializer->DataVersion = LATEST_VERSION;
    }

    Serialize(LbpSerializer, &LbpSerializer->DataVersion);

    // We are reading a file from a version that came after this one!
    if (LbpSerializer->DataVersion > (LATEST_VERSION))
    {
        return false;
    }
    else
    {
        Serialize(LbpSerializer, State);
        return true;
    }
}

The data version is monolithic -- there's one version number for everything, and it must be increased anytime a developer adds or removes a field from a de/serialized struct. A great way to enforce this is with an enum.

enum : int32_t
{
    SV_AddedPartridge = 1,
    SV_AddedTurtleDoves,
    SV_AddedFrenchHens,
    SV_AddedCallingBirds,
    SV_AddedGoldenRings,
    // Don't remove this
    SV_LatestPlusOne
}

#define SV_LATEST (SV_LatestPlusOne - 1)

Let's say we decide that we made a mistake, and that we want to get rid of the "Fouls" fields. When a field is removed, it's not enough to just remove all associated code. Since the data is still there in files on users' hard drives, we'll want to advance the cursor past it when deserializing. The easiest way to to do this is simply to read the data into a local variable, then discard it. This also is a common enough pattern that a macro exists for it, called REM.

#define REM(_fieldAdded, _fieldRemoved, _type, _fieldName, _defaultValue) \
    _type _fieldName = (_defaultValue); \
    if (LbpSerializer->DataVersion >= (_fieldAdded) && LbpSerializer->DataVersion < (_fieldRemoved)) \
    { \
        Serialize(LbpSerializer, &(_fieldName)); \
    }

struct game_score_state
{
    int32_t P1Score;
    int32_t P2Score;
    // int32_t P1Fouls; Added with SV_FOULS, later removed with SV_NOFOULS
    // int32_t P2Fouls; Added with SV_FOULS, later removed with SV_NOFOULS
};

void Serialize(lbp_serializer* LbpSerializer, game_score_state* Datum)
{
    ADD(SV_INITIAL, P1Score);
    ADD(SV_INITIAL, P2Score);
    REM(SV_FOULS, SV_NOFOULS, int32_t, P1Fouls, 0);
    REM(SV_FOULS, SV_NOFOULS, int32_t, P2Fouls, 0);
}

Since the removed value temporarily exists as a local variable, you can use it in the Serialize function for a given struct. Suppose we updated a game so that a players Score and Fouls are not tracked separately, but rather, a player's score was just the difference of the two. When reading data from an old version, we could use the local variables created by REM like so:

void Serialize(lbp_serializer* LbpSerializer, game_score_state* Datum)
{
    ADD(SV_INITIAL, P1Score);
    ADD(SV_INITIAL, P2Score);
    REM(SV_FOULS, SV_NOFOULS, int32_t, P1Fouls, 0);
    REM(SV_FOULS, SV_NOFOULS, int32_t, P2Fouls, 0);
    Datum->P1Score -= P1Fouls;
    Datum->P2Score -= P2Fouls;
}

We're almost done!

One useful thing to know is that for simple structs, it's possible to have your data layout defined in one place, then use that definition for both your struct's definition and the code for de/serializing it. You accomplish this via a C preprocessor trick reminiscent of the one used to print enum names to a console. Your code for defining the struct and its de/serialization look like this:

#define ADD_TYPED(_fieldAdded, _type, _fieldName) _type _fieldName
#define REM(_fieldAdded, _fieldRemoved, _type, _fieldName, _defaultValue) // nothing!

struct player_mission
{
#include "player_mission.h"
};

#undef ADD_TYPED
#define ADD_TYPED(_fieldAdded, _type, _fieldName) \
    if (LbpSerializer->DataVersion >= (_fieldAdded)) \
    { \
        (LbpSerializer, &(Datum->_fieldName)); \
    }

#undef REM
#define REM(_fieldAdded, _fieldRemoved, _type, _fieldName, _defaultValue) \
    _type _fieldName = (_defaultValue); \
    if (LbpSerializer->DataVersion >= (_fieldAdded) && LbpSerializer->DataVersion < (_fieldRemoved)) \
    { \
        Serialize(LbpSerializer, &(_fieldName)); \
    }

void Serialize(lbp_serializer LbpSerializer, player_mission* Datum)
{
#include "player_mission.h"
}

Then, in a file named player_mission.h, you can have your serialization logic, provided it exclusively uses ADD_TYPED and REM.

ADD_TYPED(SV_AddAssAndBubbleGum, int32_t, AssesToKick);
REM(SV_AddAssAndBubbleGum, SV_RemoveGum, int32_t, PiecesOfGumToChew, 0);

Finally, there's a way to increase your certainty that your LBP-style de/serialization code has not had any bugs recently added to it. Add a Counter value to the de/serialization struct, then use a simple macro which de/serializes Counter, compares Counter to what was de/serialized, then increments Counter. If, while reading the data, the counter read from the file does not match the reference value, the code will fire an assert.

struct lbp_serializer
{
    // ...
    int32_t Counter; // added
};

#define CHECK_INTEGRITY(_checkAdded) \
    if (LbpSerializer->DataVersion >= (_checkAdded)) \
    { \
        int32_t Check = LbpSerializer->Counter; \
        Serialize(LbpSerializer, &Check); \
        ASSERT(Check == LbpSerializer->Counter++) \
    }

That's about it! Phew! If you want to cement these concepts further in your head, you can see most of them demonstrated in this piece of sample code, complete with five revisions, which demonstrates how LBP Method code changes over time. In the sample code, I begin with a very dumb imitation of a Play-By-Email Pong game (SV_Scores), but then I revise it. I make it so that players can foul (SV_Fouls), I make it so that more than two players can be in a game (SV_ExtraPlayers), I change how the players are modeled in code (SV_AllPlayersInList), then finally I remove the Fouls field in favor of a simpler game state (SV_FoulsUntracked).

== BUT WHAT DOES THE LBP METHOD MEAN FOR ME? ==

There's a number of competing concerns that go into choosing a format and method for de/serializing versioned data. Among those Alex mentioned are:

  • Backwards compatibility (being able to open file_v1.dat with program_v3.exe)
  • Forwards compatibility (being able to open file_v3.dat with program_v1.exe)
  • Self-description (tools can parse the data without special knowledge)
  • File size
  • Serialization speed
  • Reliability
  • Flexibility

And a few others I thought of are:

  • Human readability
  • Platform-independence
  • Complexity for users
  • Degree of abstraction
  • Standards compliance

Let's pretend I'm leading a MM-sized team who is making a console exclusive game exactly like LittleBigPlanet, called GiantTinyWorld. Which of these concerns do I have, and how well does LBP Method address them?

To begin, backwards compatibility is vital. My team does not want users or developers to have to retool old content to make it work with a new version. And it's something the LBP Method is great at. Check.

The LBP Method has no forwards compatibility, but I don't think this is a sticking point. It's fairly untroublesome to make sure that developers making content for GiantTinyWorld always have the latest build on their machines. The same is true for GiantTinyWorld's playerbase. Game consoles' online services generally prevent users from downloading the newest content for a game without first updating the software.

The LBP Method creates files which are neither self-describing nor human readable. This means that post-apocalyptic wasteland archeologists will not be able to extract GiantTinyWorld data should they find it on an radioactive thumb drive. But my development team will rarely need to open GiantTinyWorld data with anything other than our own internal tools. It's also fairly painless to code a one-off JSON exporter.

Since the LBP Method writes tagless binary files, its uncompressed size is quite small. And because the deserialization process involves no tag or metadata lookups, the LBP Method is fast -- certainly in comparison to a console's hard disk and network card. Both of these are big wins for GiantTinyWorld, because then it doesn't have long load times.

The LBP Method facilitates the creation of reliable code -- very important for GiantTinyWorld's audience, which would hate to see their Sonic The Hedgehog tributes destroyed by bugs. It's reliable because version differences are accounted for every time data is deserialized. It's much better than haveing dozens of isolated, specific conversion functions! And with a few well-placed CHECK_INTEGRITYs, any mistakes on the part of the programmer will cause the de/serialization procedure to fail early and hard.

Flexibility is perhaps the LBP Method's greatest strength. It allows conversions between old and new formats to be performed by unrestricted, procedural C code. One thing that Media Molecule often does with the LBP Method is replacing two or three named bool fields with an enum or bitfield. It's also easy to change the scale of values. For example, you can make a field represent meters instead of kilometers. Operations like these tend to be difficult to perform when using tagged serialization systems like protobuf. This flexibility is very important for the development of GiantTinyWorld -- it means we can continually revise it into the best product it can be.

The LBP Method isn't platform-independent, though this issue likely isn't hard to fix. Regardless, GiantTinyWorld is console-exclusive.

Finally, only experienced in-house developers will be working with GiantTinyWorld's code. This means that it is sufficiently simple and abstract, and does not need to comply with any standards.

This analysis shows that the LBP Method is a good fit for games like LittleBigPlanet. What's interesting is that Alex Evans tells me this method was one of the very first ones he tried, and not the last. After LittleBigPlanet 1, Media Molecule tried to find some alternative solutions to work around a few of the LBP Method's shortcomings. Some solutions, such as branchable version numbers and per-structure revisions, ended up increasing the complexity for comparatively little added capabilities. Others, such as self-descriptive serialization, reduced complexity but at too high of a cost in terms of flexibility. It's serendipitous that the LBP Method, which, according to Alex, "you could code in an afternoon" (close -- it took me a whole day), ended up being the one they'd use consistently for over a decade.

So far in Swedish Cubes' development, I have been using a C# version of the LBP Method, and it's been working well. Will I use it for the finished product? I'm not sure. I'll have a concern that Media Molecule has not yet had -- licensed developers interacting with my code. But the elegance, performance, reliability, handmade-ness, and proven track record of The LBP makes me want to use it, or at least something not far from it. I strongly recommend that you try it out for your project, too.

Follow my twitter!

EDIT July 2018: Another upside to the "LBP Method" is that it allows you to very easily make simple dummy functions.

1896eb0c5fe49c11689ab0f493e2f69b.png

Comments

Wow this is super useful. Thanks for that!
Hello, and thanks for sharing this! I'm currently working on finding the best serialization system for my projects, and I find very few useful resources online.

Are your sure the LBP method is not platform-independent? If I'm not mistaken, PS3 has a variant of a PowerPC CPU, and PS4 is x86, so opening a PS3 save on PS4 should require at least an endianness conversion. And maybe a way to convert 32-bit pointers to 64-bit, if this system allows saving pointers.

By the way, this is a thing I tried to do in my system: not having to worry about using persistent pointers in the code, and not having to convert these to indices/handles when saving, or the other way around when loading. I managed to do it by putting all the "serializable" data in one place, and when it's time for saving, simply store the "starting" address of this place, and dump all that block of data. When loading, I simply have to offset all the saved pointers by the "starting" address I saved, in case I can't load to the exact same location in memory.

(I know, this doesn't sound like an update-friendly system...)

When I tested, this worked really well... Until I tried to open a file made with the 64-bit executable with the 32-bit executable, and vice-versa. I had begun to make a big conversion function, to offset all the data that was after each pointer, offsetting each pointer to take account of that difference, dealing with the difference of alignment requirements of structures between the two executables... But I felt this was not worth it.

I the end I used, in some places, indices instead of pointers, and in others, what I called "relative pointers", which have to be converted to actual pointers to access the data (with a little bit of magic, this conversion is invisible in the code, and almost none of the code using pointers had to be changed), a system that works only as long as the pointed data is in the same "serializable" block as the pointer...

Is this a problem you or Media Molecule had to deal with?
Great explanation. I do something very similar in my engine (www.fireflytech.org). Binary serialisation is VERY fast even with large meshes. If you want to support inheritance i.e. if you have built a plugin based engine then you'll want to track a version for your base framework (i.e. the version of the framework class your plugin inherits from) in addition to tracking your plugin class's version.
@Guntha Yes and I used indices like you. Generally any state that lives for more than one frame is pointer-free.

Isn't using the REM macro in the struct definition (e.g player_mission struct) mean the field still exists in newer versions?

  1. Won't this waste space?
  2. If it still exists then why don't you just call the serialize function on the datum struct itself rather than make a local variable?
  3. You only serialize the removed field when the serializer's version is less than the _fieldRemoved enum but because the removed field is still inside the struct then wouldn't that make increasing the file pointer in newer versions incorrectly?

No REM(FIELD_ADDED, FIELD_REMOVED, int, score, 0) inside the struct definition expands to nothing (it's an empty define). So there is no wasted space at runtime.

inside the serializer code it expands to

    int score = (0); 
    if (LbpSerializer->DataVersion >= (FIELD_ADDED) && 
        LbpSerializer->DataVersion < (FIELD_REMOVED)) 
    { 
        Serialize(LbpSerializer, &(score)); 
    }

after which you can use the local variable to patch up the struct should you be loading a file from between FIELD_ADDED and FIELD_REMOVED. When the version is outside of those bounds the file pointer doesn't advance at all.

That's my point! Because REM expands to nothing, the removed field is still there!

// ...
ADD(FIELD_ADDED, int, myField) // expands to int myField;
REM(FIELD_ADDED, FIELD_REMOVED, int, myField, 0) // expands to nothing

// Later
struct MyStruct
{
    #include "MyStruct.h";
};

// will expand to
struct MyStruct
{
    // other stuff
    int myField; // Should have been removed
};

// Later struct MyStruct { #include "MyStruct.h"; };

// will expand to struct MyStruct { // other stuff int myField; // Should have been removed };

[/code]

You don't use both the ADD and REM macro for the same field. It's one or the other. When you add a field you use ADD, when you remove the field you change it to REM. So the field is not in the struct anymore.

If you have a working example, try to read the pre-processor output to see what the final code look like (/P option for MSVC).

https://docs.microsoft.com/en-us/cpp/build/reference/p-preprocess-to-a-file

That makes sense!