How Media Molecule Does Serialization

Recently, I decided that Swedish Cubes for Unity would need to have a robust system for serialization and data versioning. This would allow me to avoid getting a bad rap by releasing updates that break people's save data, enable the level designer hired for my own game to start making levels before the tools are fully finished, and finally, to have a leg up on competing products which have some problems in this regard.

This is one thing that has not been covered in great detail by HMH, but thankfully I had another valuable resource. For a while now, I've been exchanging messages on many topics with Alex Evans of Media Molecule, who I got in touch with thanks to Stephanie Hurlburt's software engineer mentor list. Media Molecule's LittleBigPlanet series features somes of the strongest, most easy-to-use, and most reliable content creation tools of any game out there. The series's codebase has been updated thousands of times, and had releases sequels, DLC, and patches. Yet a level made on LittleBigPlanet 1 on a PS3 in 2008 can be opened in LittleBigPlanet 3 on a PS4 in 2017. I figured Alex would be the perfect person to ask about this.

Alex's answer (which can be read here) was thorough and informative. He described to me Media Molecule's in-house versioned serialization/deserialization system, which he calls the "LBP Method," because it was what was used across the LittleBigPlanet series. It's also being used in Media Molecule's next title, Dreams. This post will summarize and expand upon what Alex Evans told me, with a few creative liberties, most notably spelling "Serialize" with a Z.

Standard disclaimer here -- I'm not an employee or stakeholder of Media Molecule, my opinions don't reflect theirs, yada yada.

== THE LBP METHOD ==

In the LBP Method, both serialization and deserialization of data are performed via the same procedure. This procedure visits each field of each serialized struct in a recursive, field-by-field fashion. One subroutine exists for each serialized type. These subroutines are function named Serialize, but they also perform deserialization. Each function takes in a pointer to the data value, and a pointer to a struct containing de/serialization-relevant state.

void Serialize(lbp_serializer* LbpSerializer, T* Datum)
{
    // ???
}

This de/serialization state struct doesn't need to contain more than the version of the data being de/serialized, a handle to wherever the input and output is taking place, and a bool indicating whether serialization or deserialization is being performed.

struct lbp_serializer
{
    int32_t DataVersion;
    FILE* FilePtr;
    bool IsWriting;
};

If T is a primitive type or is a struct that is very unlikely to change, then the Serialize function is simple and leafy. It checks whether the de/serialization state struct is flagged as reading or writing, then reads or writes the value accordingly.

void Serialize(lbp_serializer* LbpSerializer, int32_t* Datum)
{
    if (LbpSerializer->IsWriting)
    {
        fwrite(Datum, sizeof(int32_t), 1, LbpSerializer->FilePtr);
    }
    else
    {
        fread(Datum, sizeof(int32_t), 1, LbpSerializer->FilePtr);
    }
}

This has the advantage of making it so that the read and write operations can't go out of sync.

If T is instead a struct which is expected to undergo a revision or two in the next decade, then de/serialization of its fields is delegated to other overloaded Serialize functions.

struct game_score_state
{
    int32_t P1Score;
    int32_t P2Score;
};

void Serialize(lbp_serializer* LbpSerializer, game_score_state* Datum)
{
    Serialize(LbpSerializer, &Datum->P1Score);
    Serialize(LbpSerializer, &Datum->P2Score);
}

Now, what happens if we want to update a struct but maintain backwards-compatibility? We can use the DataVersion value held by the de/serialization state struct. Each time a change to a struct is made, we signify it with a new revision code. In this example, we'll add two fields for the number of Fouls committed by each player. If we're reading a file from a time after this change was made, then we read these values from the file. Otherwise, we leave them at their default value (zero).

struct game_score_state
{
    int32_t P1Score;
    int32_t P2Score;
    int32_t P1Fouls; // Added with SV_FOULS
    int32_t P2Fouls; // Added with SV_FOULS
};

void Serialize(lbp_serializer* LbpSerializer, game_score_state* Datum)
{
    Serialize(LbpSerializer, &Datum->P1Score);
    Serialize(LbpSerializer, &Datum->P2Score);

    if (LbpSerializer->DataVersion >= SV_FOULS)
    {
        Serialize(LbpSerializer, &Datum->P1Fouls);
        Serialize(LbpSerializer, &Datum->P2Fouls);
    }
}

Checking the DataVersion against a constant value to decide whether or not to visit a Serialize function is a common enough operation that we use a macro for this, named ADD.

#define ADD(_fieldAdded, _fieldName) \
    if (LbpSerializer->DataVersion >= (_fieldAdded)) \
    { \
        Serialize(LbpSerializer, &(Datum->_fieldName)); \
    }

void Serialize(lbp_serializer* LbpSerializer, game_score_state* Datum)
{
    ADD(SV_INITIAL, P1Score);
    ADD(SV_INITIAL, P2Score);
    ADD(SV_FOULS, P1Fouls);
    ADD(SV_FOULS, P2Fouls);
}

How do we get this DataVersion value anyway? When reading, we read the version number first from the file and use it as we traverse down the deserialized structures. When writing, we have it set to the latest (highest) value, and write it to the file.

bool SerializeIncludingVersion(lbp_serializer* LbpSerializer, game_score_state* State)
{
    if (LbpSerializer->IsWriting)
    {
        LbpSerializer->DataVersion = LATEST_VERSION;
    }

    Serialize(LbpSerializer, &LbpSerializer->DataVersion);

    // We are reading a file from a version that came after this one!
    if (LbpSerializer->DataVersion > (LATEST_VERSION))
    {
        return false;
    }
    else
    {
        Serialize(LbpSerializer, State);
        return true;
    }
}

The data version is monolithic -- there's one version number for everything, and it must be increased anytime a developer adds or removes a field from a de/serialized struct. A great way to enforce this is with an enum.

enum : int32_t
{
    SV_AddedPartridge = 1,
    SV_AddedTurtleDoves,
    SV_AddedFrenchHens,
    SV_AddedCallingBirds,
    SV_AddedGoldenRings,
    // Don't remove this
    SV_LatestPlusOne
}

#define SV_LATEST (SV_LatestPlusOne - 1)

Let's say we decide that we made a mistake, and that we want to get rid of the "Fouls" fields. When a field is removed, it's not enough to just remove all associated code. Since the data is still there in files on users' hard drives, we'll want to advance the cursor past it when deserializing. The easiest way to to do this is simply to read the data into a local variable, then discard it. This also is a common enough pattern that a macro exists for it, called REM.

#define REM(_fieldAdded, _fieldRemoved, _type, _fieldName, _defaultValue) \
    _type _fieldName = (_defaultValue); \
    if (LbpSerializer->DataVersion >= (_fieldAdded) && LbpSerializer->DataVersion < (_fieldRemoved)) \
    { \
        Serialize(LbpSerializer, &(_fieldName)); \
    }

struct game_score_state
{
    int32_t P1Score;
    int32_t P2Score;
    // int32_t P1Fouls; Added with SV_FOULS, later removed with SV_NOFOULS
    // int32_t P2Fouls; Added with SV_FOULS, later removed with SV_NOFOULS
};

void Serialize(lbp_serializer* LbpSerializer, game_score_state* Datum)
{
    ADD(SV_INITIAL, P1Score);
    ADD(SV_INITIAL, P2Score);
    REM(SV_FOULS, SV_NOFOULS, int32_t, P1Fouls, 0);
    REM(SV_FOULS, SV_NOFOULS, int32_t, P2Fouls, 0);
}

Since the removed value temporarily exists as a local variable, you can use it in the Serialize function for a given struct. Suppose we updated a game so that a players Score and Fouls are not tracked separately, but rather, a player's score was just the difference of the two. When reading data from an old version, we could use the local variables created by REM like so:

void Serialize(lbp_serializer* LbpSerializer, game_score_state* Datum)
{
    ADD(SV_INITIAL, P1Score);
    ADD(SV_INITIAL, P2Score);
    REM(SV_FOULS, SV_NOFOULS, int32_t, P1Fouls, 0);
    REM(SV_FOULS, SV_NOFOULS, int32_t, P2Fouls, 0);
    Datum->P1Score -= P1Fouls;
    Datum->P2Score -= P2Fouls;
}

We're almost done!

One useful thing to know is that for simple structs, it's possible to have your data layout defined in one place, then use that definition for both your struct's definition and the code for de/serializing it. You accomplish this via a C preprocessor trick reminiscent of the one used to print enum names to a console. Your code for defining the struct and its de/serialization look like this:

#define ADD_TYPED(_fieldAdded, _type, _fieldName) _type _fieldName
#define REM(_fieldAdded, _fieldRemoved, _type, _fieldName, _defaultValue) // nothing!

struct player_mission
{
#include "player_mission.h"
};

#undef ADD_TYPED
#define ADD_TYPED(_fieldAdded, _type, _fieldName) \
    if (LbpSerializer->DataVersion >= (_fieldAdded)) \
    { \
        (LbpSerializer, &(Datum->_fieldName)); \
    }

#undef REM
#define REM(_fieldAdded, _fieldRemoved, _type, _fieldName, _defaultValue) \
    _type _fieldName = (_defaultValue); \
    if (LbpSerializer->DataVersion >= (_fieldAdded) && LbpSerializer->DataVersion < (_fieldRemoved)) \
    { \
        Serialize(LbpSerializer, &(_fieldName)); \
    }

void Serialize(lbp_serializer LbpSerializer, player_mission* Datum)
{
#include "player_mission.h"
}

Then, in a file named player_mission.h, you can have your serialization logic, provided it exclusively uses ADD_TYPED and REM.

ADD_TYPED(SV_AddAssAndBubbleGum, int32_t, AssesToKick);
REM(SV_AddAssAndBubbleGum, SV_RemoveGum, int32_t, PiecesOfGumToChew, 0);

Finally, there's a way to increase your certainty that your LBP-style de/serialization code has not had any bugs recently added to it. Add a Counter value to the de/serialization struct, then use a simple macro which de/serializes Counter, compares Counter to what was de/serialized, then increments Counter. If, while reading the data, the counter read from the file does not match the reference value, the code will fire an assert.

struct lbp_serializer
{
    // ...
    int32_t Counter; // added
};

#define CHECK_INTEGRITY(_checkAdded) \
    if (LbpSerializer->DataVersion >= (_checkAdded)) \
    { \
        int32_t Check = LbpSerializer->Counter; \
        Serialize(LbpSerializer, &Check); \
        ASSERT(Check == LbpSerializer->Counter++) \
    }

That's about it! Phew! If you want to cement these concepts further in your head, you can see most of them demonstrated in this piece of sample code, complete with five revisions, which demonstrates how LBP Method code changes over time. In the sample code, I begin with a very dumb imitation of a Play-By-Email Pong game (SV_Scores), but then I revise it. I make it so that players can foul (SV_Fouls), I make it so that more than two players can be in a game (SV_ExtraPlayers), I change how the players are modeled in code (SV_AllPlayersInList), then finally I remove the Fouls field in favor of a simpler game state (SV_FoulsUntracked).

== BUT WHAT DOES THE LBP METHOD MEAN FOR ME? ==

There's a number of competing concerns that go into choosing a format and method for de/serializing versioned data. Among those Alex mentioned are:

Backwards compatibility (being able to open file_v1.dat with program_v3.exe)
Forwards compatibility (being able to open file_v3.dat with program_v1.exe)
Self-description (tools can parse the data without special knowledge)
File size
Serialization speed
Reliability
Flexibility

And a few others I thought of are:

Human readability
Platform-independence
Complexity for users
Degree of abstraction
Standards compliance

Let's pretend I'm leading a MM-sized team who is making a console exclusive game exactly like LittleBigPlanet, called GiantTinyWorld. Which of these concerns do I have, and how well does LBP Method address them?

To begin, backwards compatibility is vital. My team does not want users or developers to have to retool old content to make it work with a new version. And it's something the LBP Method is great at. Check.

The LBP Method has no forwards compatibility, but I don't think this is a sticking point. It's fairly untroublesome to make sure that developers making content for GiantTinyWorld always have the latest build on their machines. The same is true for GiantTinyWorld's playerbase. Game consoles' online services generally prevent users from downloading the newest content for a game without first updating the software.

The LBP Method creates files which are neither self-describing nor human readable. This means that post-apocalyptic wasteland archeologists will not be able to extract GiantTinyWorld data should they find it on an radioactive thumb drive. But my development team will rarely need to open GiantTinyWorld data with anything other than our own internal tools. It's also fairly painless to code a one-off JSON exporter.

Since the LBP Method writes tagless binary files, its uncompressed size is quite small. And because the deserialization process involves no tag or metadata lookups, the LBP Method is fast -- certainly in comparison to a console's hard disk and network card. Both of these are big wins for GiantTinyWorld, because then it doesn't have long load times.

The LBP Method facilitates the creation of reliable code -- very important for GiantTinyWorld's audience, which would hate to see their Sonic The Hedgehog tributes destroyed by bugs. It's reliable because version differences are accounted for every time data is deserialized. It's much better than haveing dozens of isolated, specific conversion functions! And with a few well-placed CHECK_INTEGRITYs, any mistakes on the part of the programmer will cause the de/serialization procedure to fail early and hard.

Flexibility is perhaps the LBP Method's greatest strength. It allows conversions between old and new formats to be performed by unrestricted, procedural C code. One thing that Media Molecule often does with the LBP Method is replacing two or three named bool fields with an enum or bitfield. It's also easy to change the scale of values. For example, you can make a field represent meters instead of kilometers. Operations like these tend to be difficult to perform when using tagged serialization systems like protobuf. This flexibility is very important for the development of GiantTinyWorld -- it means we can continually revise it into the best product it can be.

The LBP Method isn't platform-independent, though this issue likely isn't hard to fix. Regardless, GiantTinyWorld is console-exclusive.

Finally, only experienced in-house developers will be working with GiantTinyWorld's code. This means that it is sufficiently simple and abstract, and does not need to comply with any standards.

This analysis shows that the LBP Method is a good fit for games like LittleBigPlanet. What's interesting is that Alex Evans tells me this method was one of the very first ones he tried, and not the last. After LittleBigPlanet 1, Media Molecule tried to find some alternative solutions to work around a few of the LBP Method's shortcomings. Some solutions, such as branchable version numbers and per-structure revisions, ended up increasing the complexity for comparatively little added capabilities. Others, such as self-descriptive serialization, reduced complexity but at too high of a cost in terms of flexibility. It's serendipitous that the LBP Method, which, according to Alex, "you could code in an afternoon" (close -- it took me a whole day), ended up being the one they'd use consistently for over a decade.

So far in Swedish Cubes' development, I have been using a C# version of the LBP Method, and it's been working well. Will I use it for the finished product? I'm not sure. I'll have a concern that Media Molecule has not yet had -- licensed developers interacting with my code. But the elegance, performance, reliability, handmade-ness, and proven track record of The LBP makes me want to use it, or at least something not far from it. I strongly recommend that you try it out for your project, too.

Follow my twitter!

EDIT July 2018: Another upside to the "LBP Method" is that it allows you to very easily make simple dummy functions.

How Media Molecule Does Serialization

Comments