This is the second recording in a series of recordings on Mosaic.

Today is the 4th of June, 2025. I'm Mike Dilger.

In this recording I'm going to talk about serialization of binary data, as well as JSON, with
respect to Mosaic. This is quite low level information intended for developers.

When we are defining a network protocol with binary network packets, we need to be precise about
exactly how the data in those packets are arranged.

Data in computer memory are organized in a way that is not the same on every computer. Indeed it's
not even the same every time you run the program. There are at least four different ways that
data in memory looks different from data in a network packet or on disk.

The first is alignment, the second is the sizes of data types, the third is the byte order of
integers, and the fourth is that some data is spread out across your virtual memory space and
connected via pointers.

Let's start by talking about data alignment. Most programmers these days don't worry about low
level details like alignment. Their compiler handles this for them. But in certain cases it pays
to understand it and do it right.

If you load data that is aligned, it loads as fast as the CPU was designed. But if that data is
not aligned, it may take twice as long to load. It has to first load the high bytes, and then load
the low bytes, in two separate instructions.

When you create a struct in C or rust, you specify fields of various sizes. Let's say the first
field is a 16 bit integer and the second field is a 64 bit integer. If that structure is packed,
then the 64 bit integer starts at byte 2 which is not aligned. Reading and writing that field is
twice as costly as it needs to be. And so by default C and rust don't pack structures. They align
each field. So this hypothetical struct will actually have 6 bytes of padding after the first 16
bit integer field and before the second 64 bit integer, so that the 64 bit integer will be aligned.
That padding is invisible to the programmer. You cannot access it and you can easily forget that
it's there.

In a social media protocol, you are going to be pushing data across the network. You certainly
don't want to push raw data structures that have padding you didn't even know about.

Computer languages provide something called serialization and it's inverse, deserialization.
This is a process where data in memory is packed into a linear "serial" form in a standardized
fashion so that every computer can unpack it and end up with the same data, despite the
different alignments, different type sizes, and integer byte orders of those different computers.
Generally the data is not aligned in the serialized packed format because it doesn't need to be.

Serialization and deserialization have a performance cost. You have to allocate space and convert
and copy data both for serialization and again for deserialization.

So next comes the brilliant idea.

What if you could access the data in it's serialized form without actually deserializing it? Then
you could avoid memory allocation and you could avoid copying. The parts you can't avoid are the
conversion parts, the byte ordering adjustments, and the lookups of offsets and lengths. But
computers are very fast when it comes to flipping byte orders and adding offsets, much faster than
memory allocation is. So in many cases it's actually faster to design and use a serialized data
format that can be directly accessed.

In such a case, the the serialized data format needs to have aligned data, it should use an integer
byte ordering that is the most common, it should be laid out so that the fewest number of lookups
of lengths and offsets that are needed, et cetera, while not wasting space.

Now, while using the serialized data directly will usually be faster, it won't always be faster.
The exception is when you need to access a field many times over. Because then you are doing the
conversion over and over. You're not doing memory allocation, you're not doing copying, but you
are doing that conversion over and over. So in that case it might be faster to actually deserialize
it first, to pay the memory allocation and copying cost.

Guess what? We don't have to choose. We can design such a directly accessible serialized data
format, and then programmers can either directly access it, or they can deserialize it, whatever
works best for their use case.

And that is precisely what we are doing in Mosaic.

We try not to waste space, but we keep things aligned. We use little-endian byte ordering for
integers since most CPUs use that layout. And we provide a mosaic-core library that offers
accessor functions directly into the serialized data, as well as serialization and deserialization
into structures for cases where that would be better.

Note that these serialized data structures are a pain in the ass to edit, almost impossible. But
they're meant to store digitally signed data which you cannot edit anyways. For creating Tags
and creating Records, you supply the parts to a function that assembles and signs. For creating
filters it is similar.

In rust there is a type called a String. It is an owned, allocated vector of bytes guaranteed to
be UTF-8 encoded. There is also a borrowed variant called an &str. The ampersand is the borrow or
reference operator, and str is the type. This pattern repeats throughout the standard library,
for example with PathBuf and &Path.

In mosaic-core we followed the same pattern. An &Record is borrowed, and an OwnedRecord is owned.
Because the serialized data is operated on directly, the insides of an OwnedRecord is just a
vector of bytes and the insides of an &Record is just a slice of bytes. The same goes for &Tag and
OwnedTag.

Figuring out where the fields are inside this slice of bytes, or where the slice of bytes ends, is
handled by the field accessor functions.


OK, so many of you will be thinking "What the fuck? Why not just use JSON?" Nostr uses JSON.
I think ActivityPub uses JSON. At protocol uses JSON or XML or some crap like that. Why didn't
I just do that?

Well I guess it is because I know too much. I know what a JSON parser has to do in order to turn
JSON into a struct with field access. Serialization isn't too bad, but deserialization is very
expensive. I wrote a parser that is almost an order of magnitude faster than a general JSON
parser for nostr, getting it's performance edge based on knowing what nostr events look like
ahead of time. But it still has to do a large and unknown number of tiny memory allocations to
build that struct. The performance cost is very high and in my humble opinion, that cost buys you
very little. It's unnecessary and wasteful.

It gets worse when it comes to digital signing. JSON isn't really a data format, it is a
programming language. Quotes and backslashes aren't actually in the data, they are programming
syntax to help arrange the data that lies in and around those characters. Object fields can be
in any order and have the same meaning. The exact same data can be represented in a multitude
of ways. Characters can be escaped that don't need to be escaped, using single letter escapes or
Unicode escapes. And each of these will hash to a different hash value, causing signature
verification to fail unless signatures are done on data that is first normalized somehow. And that
normalization procedure also has a cost.

Lots of people just assume that computers are fast enough. Around 2000 I remember the shift.
Software developers worldwide were repeating the belief that computers would just get faster and
faster and developers should stop trying to optimize things. But even today we have performance
issues. Write a loop inside of a loop inside of a loop and soon you will change your mind.
It is still a good idea to eek out all the performance we can. Especially since the cost of
doing so is very low. Well, there is a cost to the development time. But I'm paying that cost by
designing the spec and writing mosaic-core. Every other developer can just take the win without
paying the cost that I've already paid.

Let me be clear. There is nothing stopping anybody from converting from binary to JSON and back.
In fact I intend to put that functionality into mosaic-core. At your layer of software development
you can deal with 100% JSON and 0% binary if you wish. I encourage you to use JSON for display
output or log files or input. Just at the low layer of the protocol itself, the hot path, let's
do it the fast way.

There was a famous Linux related debate about systemd's log files. Lennert Pottering made them
binary, accessible with journalctl. Linus wanted them to just be plain text files as they always
had been. But Lennert kept them as binary files. Binary files pack much tighter and use less
space. And you can always just run a journalctl command to convert from binary to text whenever
you want... and then use awk and sed and whatever you want on that text. So why not use binary?
Many people sided with Linus probably because he is more famous. But I think Lennert was right.
Log files have to be appended to constantly so you can't just use any compression algorithm.
And a compression scheme that understands the data is going to be more optimal. Despite the
backlash, everything was just fine with binary data.

OK, well that's it. Short episode this time. Cheers.