Space exploration, you say?
recently wrote space exploration
themed blog posts, and as such I couldn't really stay behind. Thankfully I'd already planned to do a write up on a new feature.
As you may have noticed, a long-awaited feature has recently emerged: the indication of whether or not a thread contains new posts. You might think this a trivial feature to add, but you'd be mistaken. Let me take you through the considerations, (some of) the possible ways to do it, and the way I ended up implementing it and why I did it this way.
So, without much further ado I explore
the problem space
in this Behind the Scenes post, hopefully the first of many.
A word of warning, though: it's 14.5kB worth of text and there's no illustrations, yet.
First, let's state the problem so that we might explore its space.
- There are some 1800 Handmade.Network members (since I last checked)
- We have 42 forums and 37 blogs
- Which have between them approximately 1500 forum threads and blog posts
- Those contain on the order of 8500 posts between them
These members will want to know when a new thread appeared or a previously read thread gained new entries, and preferably in such a way that if they move from one device to another, this read status is synchronised. In other words we'll want to keep this state on the server, not on the client.
You could store this information in a number of ways, each with their pros and cons.
What we need to store depends on the method we want to use. We'll get to the how of it.
Here's what we need for two naive ways to track the information we need:
- You could keep track of a timestamp or last post ID per thread per member, or around 2.7 million pieces of information in the current situation
- You could keep track of every thread (or blog post) a member has ever read and not store information for unread threads
Case 2 looks attractive at first blush, but what happens once a member has either read every thread or hits the 'mark all threads as read' button?
Right, it devolves into case 1 as time goes on. At best it's of temporary benefit. At worst it might fool you into believing you've implemented an efficient scalable solution, right up until the point where a good number of members have marked everything as read.
And even then, it's not that efficient or scalable to begin with. Why?
Well, consider what happens when someone, say you, wants to mark everything as read.
In each of these scenarios the site would have to iterate over all 1500-odd threads, get the date (or ID) of its last post and then update or create a row for each of them that pairs the member with that thread and the timestamp (or ID).
That's a lot of work that we'd obviously like to avoid if we can help it. Even more so because we'd like to see all those numbers mentioned under 'landscape' grow without it making the site burst at its seams.
So, how could we minimise the impact of this feature?
One way would be to keep track of the read state on the client side, but apart from the state not being shared between devices, that's just moving the problem around. Their browser would have to store all this data somehow, in either a massive cookie or in local storage.
Okay, you might say, but what if we also store a piece of data that says "everything before this timestamp is considered to have been read", and the only threads we track the state of are those with posts newer than that? Let's call it a watermark, below which everything is considered to have been read (or of no interest).
Now we're getting somewhere! But let's keep doing that server side and let's see what else might be useful to minimise the amount of data we need to consider.
What happens if you want to mark just a particular forum as read but still want to browse the other forums and see if something new has appeared since your last perusal?
I think you see that a global watermark won't cut it, because then we'd still have to update or create an entry for each of the threads in that particular forum. Indeed, extrapolate a bit further and you'll see that if threads are marked as read forum by forum, the global watermark doesn't gain you much until you've marked the very last forum as read. Meanwhile we've more or less devolved to the old scenarios again.
See, it's not as straightforward a feature as you might have assumed out of hand. Or rather, it's a simple enough feature, but if implemented without considering all these scenarios, you'll find that simple solution blow up in your face.
So we've seen that this global watermark doesn't help us very much. We can't even unconditionally move it if we mark one forum as read because one of the other forums might have older unread posts. If only we knew what the oldest unread post in each forum was, then we could move this global watermark safely to whatever the oldest was between them…
So a per-forum/blog watermark, then?
Indeed! You know what's funny? If we keep track of per-forum watermarks for each member, we no longer even need this global watermark. Good riddance, it didn't appreciably constrain the problem space to begin with.
So, how much do we have to keep track of now?
Now we only need to keep track of the threads in each forum or blog that has a last post newer than its category's watermark. As an aside, internally we call forums and blogs categories and its type determines where on the site you find it and how it's presented to you. More on that in another Behind-the-Scenes post another time, perhaps.
If someone marks the entire site as read, we only need to update or create 80 rows in the database, which is a 20x improvement over having to do the same for each thread. Better, the number of forums isn't likely to scale as explosively as the number of threads will, so the benefit will only grow as time goes on.
But what about those per-thread markers?
Those are still around of course. However, once you mark a category as read, all those thread markers can be deleted. Sure, we could keep them around in case an old thread sees some action, but as newer threads see more replies than older threads, that's just baggage in most of the cases.
Wait... Doesn't that mean we have to iterate over each thread in a category, see if we have an entry for it for this member and then delete that entry in our marker table? Not quite. This is why the thread marker table has a somewhat redundant field that says what forum or blog the thread belongs to. We don't need it most of the time, but it makes deleting all thread markers for a category a piece of cake. We don't have to iterate over the threads themselves to mark everything as read.
Marking everything on the site as read is even easier. In addition to updating those category markers, we can now delete all of that member's thread markers. (And we already paired each member to a thread in these markers because you won't have necessarily last read something when I have, it wasn't even an extra field we added for convenience).
There's two more optimisations to be had.
Optimisation: Conditional thread marking
Since we're in the business of not having the site do more work than is strictly necessary, let's see if we can find a reason not to make an entry when a thread is read. Heck, let's see if we can find several such reasons not to do any work.
- According to the category watermark, this thread's last post is old.
- According to the thread watermark, this thread's last post is old.
- The member reading this thread is also the author of its last post.
In case 1, we've already read all there is to read. Until such time that a newer post appears, we don't need to record this last post's timestamp as the newest thing for this thread.
Case 2 is essentially the same as 1, except you've explicitly read this thread since the last time the category watermark moved. Nevertheless, the thread's last post's timestamp and the thread marker are in agreement, so there's no need to update it, let alone create the thing we've just compared.
As for case 3, I'm no Sherlock Holmes, but I think we can consider you having penned the post, you've also read it. If not, perhaps we should hold you to a higher standard… ;-)
If all goes well and conversation flows back and forth on the forums (and blogs), case 3 won't hold long. But still, it saves an action. And if conversation does go back and forth and you're replying to someone else again, you'll again be its last author.
We really only need to know about new posts made by others, because when searching for the last unread post in a thread, we really only need to know about the first post after a given timestamp that wasn't authored by you. Thankfully, that's a very cheap thing to ask a database if you have a sane data model.
Optimisation: Coalescing the thread markers
But wait, there's more?
Indeed. We have this per-category watermark, don't we? Can't we exploit that thing some more?
Sure... how about when you browse a forum, apart from it showing you which threads have new entries or not, we see if we can't move the watermark a bit.
To do that all we'd need to know about is threads whose last post is older than the category watermark, so under normal circumstances there's not a lot of data to consider. But, consider them we shall, in reverse chronological order, even.
Of the threads under consideration, some will have their own marker and some won't. What we want to know is if the thread immediately 'adjacent' the category watermark has one. If not, we can bail out early.
If it has one, we can move the watermark to its last read date and delete its marker. And again we loop through the now shorter list as long as we can keep coalescing like this, and early out when we can't.
What all of this means when put together
- We store the read status of threads in a sparse manner
- This sparse storage cleans itself up, removing redundant entries where it can by moving the category watermark
And not altogether unimportant: Even in the very unlikely worst case scenario, it behaves no worse than the original common case in the naive scenarios. Indeed, it's more well-behaved in the worst case.
Where it really shines is the practical case where someone reads the threads of interest to them in one sitting, leaves a few as unread for the next sitting and marks all threads in forums of lesser interest to them as read from time to time.
Put differently: If you use the forums as most people read forums, or alternatively only scan the titles, read those of interest and then mark the entire site as read, a.k.a. the common case(s), the impact of all of this bookkeeping is essentially negligible.
But why didn't you such and so?
For one because 'such and so' isn't a very well defined thing to do and you'd need to clarify what you meant by that.
For another, the above is very efficient and has several benefits to some of these suches and soes.
Consider an alternative:
We could have just serialised all of that information in a blob per member.
Sure, we could've done that, but that doesn't consider the serialisation cost. I'll give you a hint: It wouldn't outperform the well-indexed queries used by the above, bloats memory usage for no benefit and has another important drawback.
If in the future we want to move a thread to another forum, all it would take in our scenario: 1 query to update all the thread markers for that thread, regardless of whom they are for, to update the category field to point to the new forum.
I think you can see where I'm going with this: We'd have to iterate over all members' 'read marker' blobs, deserialise them, update the field in question, serialise it and update the blob. That's a heck of a lot more work.
You might counter that this isn't something that happens every day, moving a thread. Not yet, no. But like mentioned at the start, we hope to grow this community, and we'd prefer not to have 'stop the world' scenarios happening a few times a day where the site slows down because every member's 'read blob' has to be updated when a thread moves.
And before you say, "But you could store just the thread id without the category, then there would be no need to update the category when you move the thread." Certainly, but then you're doing more work again when marking a whole category as read. Where previously you could just delete that entire branch from your tree in one go, now you'd have to iterate over all threads and match them up against the data from the blob.
Can you say amplification? You'd still be grinding the site to a halt.
And yes, there are ways to mitigate it by keeping the pertinent information for all threads in memory at all times, that's a few MB worth of info after all. Guess what? That applies to the scheme I've implemented as well. It too can benefit in that way and perform even fewer queries, but it still doesn't have any of these downsides. ;-)
I know :) Web technology may not be my favourite thing in the world, but I'm still going to implement data models and algorithms that work well in spite of this domain's problems. I'm somewhat strangely even looking forward to version 2 of this website, to be made public in around a year from now, hopefully. We'll see how much more of a performance non-issue we can turn this feature in when it's implemented in C with all data kept in memory as much as possible. :D
Can you use the description above to implement your own version of this algorithm in other forum software, or use it in another problem domain altogether? Sure, why not. Software patents are patently obvious most of the time and patently ridiculous always*.
If you've made this far, you've earned the right to a yourname++ on HMN IRC. Just ping me there and you'll get a free one-time imaginary internet point boost. Also, I'm impressed. What started as a medium long BTS post turned into quite a feature on this feature. :D
If you did and want to read more of these Behind the Scenes posts, feel free to leave a comment. You're still welcome to leave a comment even if you thought once was quite enough, thanks.
Until next time, perchance!
* This is my personal opinion, it does not necessarily represent the position of the Handmade Dev organisation. Not that I expect it to be materially different.