Everything is serialization

Time:

Serialization pervades programming. Whether you are writing JSON, Tree-Buf, repr(C), or even x86 assembly, the result of your program is always serialized binary. By thinking about serialization actively, we can write faster, more straightforward programs.

We will talk a lot about trade-offs. We will delve into information theory and compression. At every step, we will show how Rust gives you complete control over the process.

Finally, we’ll build upon this knowledge to understand the design of Tree-Buf: a Rust serialization library for real-world data sets.

Presented by

  • Recordings

    Transcript

    Everything is serialization

    Bard:
    Zac Burns wants to serialize
    some ideas that we all won't despise
    into one talk to make
    us see what it will take
    to make code easier to realize

    Zac:
    Your computer by itself doesn't do anything of value. Here is a picture of the inside of a computer. You can't tell from the picture what the computer is doing, if it is doing anything at all. For the computer to be useful it must be a component connected to a larger system. The system that the computer is a part of includes other components attached to the computer components like mice, keyboards, speakers, screens, and network cards send data to and from the computer through wires like in this picture. Because of these wires, and the physical separation of the system's components, the data which drives each component must be and well specified agreed upon formats. At this level of abstraction of the system, we usually think of the data in terms of serialization. Serialization at this level includes many well known formats MP3, JSON, and Http among others.

    Here is a picture of the inside of a computer sub-system compriseing several components, each driven by data, sent over the wires that connect the components. The CPU, GPU, RAM and hard drive are all data-driven components and sub-systems. We don't always think of the things happening at this level of abstraction in terms of serialization but it is serialization just the same. Here too the physical separation of each component aside from the wires connecting them necessitates the data which drives each component to be serialized in well specified agreed upon formats. File system formats like mtfs are serialized by the CPU and sent to the hard drive and buffered to be sent for the GPU for drawing when data is fetched from RAM, at fetch instruction is bytes and a serialization format sent over wires between the CPU and RAM. Instructions are serialized code coming from the assembly coming from serialized mirror coming from serialized Rust source files coming from serialized keyboard presses and so on. If you look into any of these sub-systems, RAM, CPU, GPU, network card, you will find the exact same setup. Data driven competents connected by wires. We can take the CPU and see what it looks like on the inside.

    Here is a CPU with components for instructions to coding, branch predictions, caches and scheduleers. Each component is data driven and transfers information on wires in purpose built. At each level of abstraction of the computer system, you will find components driven by data, sent over wires in serialization formit. Unsurprisingly, the design of the serialization format, which is the design of how the components interact, has large effect on the system as a whole. -- affect. Maybe you feel this characterization of the computer as driven by serialization to be reductionist. Maybe you prefer to think in abstractions. I would like to point out that abstractions cannot be implemented without the use of serialization.

    Perhaps the greatest abstraction of all time is the function call. What happens when you call a function? The first thing that happens is the arguments to the function are serialized to the stack. The order and layout of the argument, or the file format if you will, is called the calling convention. Maybe you don't like to think in terms of implementing. Perhaps you think in high level tasks like serving an http task. It is thought in terms of parsing data in this case an URL which is a standardized serialization having a path and followed by a transform and lastly a serialization and in this case an http response. There are two serialization steps. If we were to look at what the transform steps entails, we would see it breaks down further into a series of parse, transform and serialized steps. It's serialization all the way down. Your database is just a giant serialized file afterall, as are the requests for the data in the database. In fact, all of the state of your entire returning program is stored in memory in a complex format comprised of other nested simpler serialization formats.

    The first point I am trying to make today is that at every level, serialization doesn't just underlie everything we do but is in some sense the means and ends of all programs. As such, serialization should be at the front of your mind when engineering and yet despite this, how to represent data as bytes is not a hot topic in programming circles. There is the occasional flame war about whether it is bet tour have a human readable format like JSON or a high performance format with a schema like protobuff but the tradeoff space goes much deeper. My experience has been that the choice of representation of the data is an essential factor in determining the system's performance, the engineering effort required to produce the system, and the system's capabilities as a whole. The reasons for this are easy to overlook so I will go over each. The first statement was that the data representation is an essential factor in determining the system's performance. This comes down to the limits of each component in the system to produce and consume data and the limits of the wires connecting these components. If the serialization format has low entropy the throughput of data flowing through the system is limited by the wires connecting components. Put in another way, bloat in the representation of the data throttles the throughput of information. Also, data dependencies in the serialization format pause the flow of data through the system incurring the latency cost of the wires. The system operates at peak efficiency only when the communication between components is saturated with efficiently represented data.

    The second point is that data representation is an essential factor in determining the engineering effort required to produce the system. Given input and output data representations and an algorithm sitting in between, a containing to either representation necessitates a corresponding change to the algorithm and note that the inverse is not always true. A change in the algorithm does not necessarily require a change to the data. The algorithms available to us, and their limitations, including the algorithm's minimum possible complexity, are determined by the characteristics and representations of the data. The third point was that data representation is an essential factor in determining the system's capabilities as a whole. Given a representation in time limit, the set of inputs and calculated outputs expressed within those bounds is finite. That finite set of inputs and outputs is the total of the capabilities of the system. There is always a size limit. Even when that limit is bound primarily by throughput and time. I would like to drive these points home with a series of case studies. We will look at some properties inherent to the data representations used by specific serialization formats and see how the formats either help us solve a problem or get in the way. In each example, we will also get to see how Rust gives you best in-class to for manipulating data across any representation. The first example will be in parsing GraphQL and the second in buffers and the third is the use of compression in the Tree-Buf format.

    First, GraphQL. Serialization formats tend to reflect the architectures of the systems that use them. Our computer systems are comprised of many formats nested. For example, inside a tcp packet, a serialization format, you may find part of an http request. The http request comprises multiple serialization formats. The http headers nest further and other serialization formats being the payload, the format which is specified by the headers. The payload may nest to Unicode which may nest to GraphQL which itself nest to many different subformats as defined by the spec. If you find a string in the GraphQL it may nest further. The nesting reflects the system's architecture because many layers exist to address concerns that manifest at that particular layer of abstraction of the computer. Because testing is the natural tendency of serialization, we need formats that allow us to nest data efficiently. We also need tools for parsing and manipulating nest data. Rust gives you these tools in abundance. You need to view slices of strings and byte arrays safely and without copying. Interpreting bytes as another type like a string or integer is also important. Safe, mutable, appendable strings or binary types allow us to progressively push serialized data from each format into the same buffer rather than serializing in the separate buffers and then copying each buffer into the nesting format above. Moving control to memory and safely passing data is the name of the game.

    These capabilities are -- a surprising number of languages don't meet the requirements. Rust is the only memory safe language that I am aware of that does. What Rust gives you is much more? Here is a type from the GraphQL parser crate. It is an enum named value containing all the different kind of values in a GraphQL query like numbers and objects and so on. Value is generic over the kind of text to parse into. One type that implements the text trait this string so you can parse a GraphQL query into string as the text type and because value will own its beta it allows you to manipulate the GraphQL and write it back out. That capability comes with a tradeoff. The performance will be about as bad as those other garbage collected languages because of all the extra allocating and copying that necessarily entails. Reference to stir also implements text. So you could parse the GraphQL in a read-only mode that references to underlying text that the GraphQL was parsed from. With that, you get the best performance possible by avoiding allocations and copies but you loose out on the ability to manipulate the data. In some cases that's OK. Rust takes this up a notch because there is a third type from the standard library that implements text. This type is a cow of string. With this safe and convenient type enabled by our friend and allied the borrow checker we can parse the GraphQL in such a way that all of the text efficiently refers to the source except just the parts that you manipulate and it is all specified at the call site. This is the kind of pleasantry I have come to expect from Rust dependencies. If you want to change some of the GraphQL text you can do so efficiently and safely with this type, almost.

    I say almost because there is a fundamental limitation to the GraphQL that no amount of Rust features or library APIs could overcome. Looking at the list of different values here, we see that the entries for variable, emum and objects are generic over text. Ironically, the string variant is not. The string variant just contains to string type requiring, allocating and copying. What's going on here? The issue is in the way that GraphQL nests its serialization formats. The GraphQL string value is Unicode but the way that GraphQL embeds strings is by putting quotes around them. With this design choice, any quotes in the string must be escaped which inserts new data interpursed with the original data and this comes with consequences.

    One, when encoding a value the length is not known up front and may increase and that means you can't rely on resizing the buffer up front but instead must continually check this buffer size when coding this value or over allocate by twice as much. When reading GraphQL, it is impossible to refer to data because it needs to go through a parsed step to remove the escape character. This problem compounds if you want to nest a byte array containing another serialization format in GraphQL. There is no support for directly supporting bytes in GraphQL so may must be encoded with base 64 or something similar. That means three encode steps are necessary to nest another format. There is encoding the data as bytes, encoding that as a string, and finally re-encoding the escaped string. That may compound even further if you want to store GraphQL in another format. It is common to store GraphQL query as a string embedded in JSON alongside the GraphQL variables. JSON strings are also quoted strings meaning the same data go through another allocation and decode step. It is common to log the JSON. Another layer on another encode step. Now, if we want to get that binary data from the logs, it is just allocating is decoding the same data over and over, up through each layer for every field. It doesn't have to be this way.

    One alternate method of storing a string and other variable length data is to prefix the data with its length. Not doing this is a familiar mistake that was made way back since null terminated strings in C. The difference between the two can be the difference between having decoding be a major bottleneck or instant. No amount of engineering effort spent on optimizing the pipeline that consumes the data can improve the situation because the cost is a representation of the data. You have to design the representation differently to overcome this. I am not saying to avoid GraphQL. I use GraphQL and we are all on the same team. I mentioned GraphQL in this example because using a standard format to illustrate this problem is easier for me than just inventing one for this cautionary tale. When you go out and design your formats, consider designing with efficient nesting in mind.

    Let's look at one more example of how we can build capabilities into a serialization format and how Rust works with us to take advantage of those capabilities. For this case study, we are going to be sending some data to the GPU. A GPU is driven by data, sent to it in a serialization format we will call vertex buffers which contain data like the positions of the points that make up polygons, colors and material needed for rendering. It comes in two parts and the first describes the format and the second part is a contigous amount of memory containing the instructs in an array. This is a vertext buffer. The top portion is a description including the name pause X, Y and Z for a vertex position and rb and g color channels. The bottom part depicts the data with three F-64 slots. Three ua slots and a blank slot for padding making it all log up. These logs repeat over and over again taking up the same amount of space each time. There is a good reason that the GPU receives data in size structs and spaced in contigious arrays. The latest NVIDIA cards have a staggering 10.496 CUDA cores and that's not even counting tensor cores. This is 10.496 boxes. It is a lot. I am even in the way of some of these.

    If you want to break up data into batches for parallelism, the most straightforward way to do that is to have fixed size structs in contigous arrays. You can know where any arbitrary piece of data lives and break it up into any desired size. It reflects the architecture of the system. Contrast that to sending the data to the GPU, in, say, JSON. With JSON, the interpretation of every single bite in the data depends on every proceeding bite. The current length is unknown until you search for and find a token indicating the end of that item. Often a comma an or a closed bracket. If we graphed the data dependencies of a JSON document it would form a continuous train starting with the second byte and the first depending on the first and the third depending on the previous two continuing until the very last byte of that document. Considering a screen in JSON. It is a key or a value? It depends on when it is inside an object. If I hid the values of any proceeding bytes in the document, it would be impossible to tell. The problem with that is that data dependencies limit parallelism. A JSON document must be processed sequentially because that is intrinsic to the format making JSON a non starter for a GPU. The data dependencies limit parallelism and add complexity into the engineering that goes into writing a parser. It is the data dependencies that make writing a correct JSON parser a challenging engineering problem in the first place. If we were to graph the buffer vertex dependency, the interpretation of each byte in the data is only dependent on the first few bytes in the description of there buffer. Aside from that all bytes are independent. By representing this in an array of fixed sized elements we can process data independently and therefore parallelized. There are downsides to arrays of fixed width elements. While we gain data independence we lose the ability to use compression techniques that would rely on variable length and coding. This means you can use some kind of lossy compression but not loss-less compression. JSON can utilize both.

    In JSON, a smaller number takes fewer bytes to represent than a larger number. Integers between 0-9 take one byte because they only need a single character. Numbers between 10-9 take 2 bytes and so on. Here is a depiction of that. I wouldn't ever call JSON a compression format but in principle the building blocks of loss less compression with there. There are better ways to do this which we will return to later. The building blocks for lossy compression are present in the form of truncating plots. The first rendition is only 3.14. The format used by vertex buffers has a different set of capabilities of JSON and that can't be worked around when consuming the data. Those capabilities are inherent to the representations themselves. If you want different capabilities you need to change the representation.

    OK. Having established that writing the data is the problem we are trying to solve and the characteristics the serialization format must have because of the GPU architectures, let's write a program to serialize the data. We will write the program in two languages. First in TypeScript and then in Rust. I don't do this to despairage TypeScript. Parts of TypeScript are pretty neat but rather show you the complexity a memory managed program adds to the problem that wasn't there to start. Without seeing the difference, it is hard to appreciate the power that Rust has over data. The function we will write is a stripped down version of what you might need to write a single vertex to a vertex buffer for a game. Our vertex consists of only a position with three 32 bit float coordinates and a color having 3 UA channels. There are likely significantly more fields you would want to pack into a vertex into real game but this is good for illustration. Let's start with the TypeScript code.

    If you are thinking whoa. That is too much code to put on a slide that's the right reaction. It is also the point I am trying to make. I am going describe the code but don't worry about falling too much. There is not going to be a quiz and this is not a talk about TypeScript. Just listen enough to get a high level field for the concern the code addresses and don't worry about the details. The first section defines our interfaces. Vertex, position, and color are unsurprising. We have this other interface, buffer, which has a byte array and a count of how many items are written in the array. The next session is all about calculating offsets of where the data lives in the buffer. You could hard code these but the comment explaining what the magic numbers were would be just as long as the code any way so it might as well be code since that makes it more likely to be correct and in sync with the rest of the function. Particularly cumbersome is the line that offsets the R field. The value is a byte but the offset the offset of the previous field plus 1 times the size of an F32 in bytes. That mixing of types accounts for a discontinuity because later we will use two different views over the same allocation which is profoundly unsettling. We also have to calculate each element size both in units of bytes and floats for similar reasons. The next thing we are going to do is to possibly resize the buffer. This part is not interesting but the code has to be there or the program will crash when the buffer runs out of space.

    Next, we setup the views and calculate the beginning position of the data we want to write within each view relative to the data size in each view. These offsets are different even though they point to the same place. Lastly, we can finally copy the data from our text into the buffer assuming all of the previous code is correct. Phew. Now, let's take a look at the Rust program. We design the structs. We leave out the interface for buffer and hold the byte away and count. We aren't going to need that. Let's look at the function to write the vertex. Buffer.push vertex. That's it. Rust isn't hiding the fact that our data is represented as bytes under the hood and has given us control of the representation. We needed to annotate the structs and move all error prone work into the compiler. Between JavaScript and Rust which do you think would have better performance? The difference is starker than you might think and not just because of the extra boilerplate code, or it being JavaScript, or the cache from float to int were typed arrays, but mostly because of, again, data dependencies. This time in the form of pointer chases when accessing the properties of objects in TypeScript. For example, element.position.x. it is slow because the serialization format used by the JavaScript run time to represent objects introduced data dependencies. One thing we mean by 0 cost abstractions are abstractions that don't introduce unnecessarily serialization formats. Remember that because the choice of serialization format is a deciding factor in how you can approach the problem that the advantage Rust gives us of being able to choose how data is represented carries forward into every problem not just writing vertex buffers.

    For the final case study, I would like to take some time to go into how a new experimental serialization format called Tree-Buf represents data in a way that is amendable to fast compression. Before we talk about the format we need to talk talk about the nature of datasets and use the game of Go. We will use the game of Go as a baseline against Tree-Buf. This movie depicts a game of Go. I haven't told you anything about how Go works but by watching the movie you might pick up on some patterns in the data. The first pattern we might pick up on is most of the moves are being made in certain areas of the board and many are on the sides and corners. Very little is going on in the center. Another thing you might pick up on is a lot of the time a move is adjacent or near the previous move. The observation that local data is related and the type space is not to be used is not specific to Go. If you have an image, adjacent pixels are likely to be the same and most of the images are not far off from the color palette. We can extend this to a complex 2D polygon described by a series of points. Any given point is not likely to be randomly selected from all possible points with an even probability. No, each point is very likely to be near the previous. There are vast, vast regions of the possibility space that will not be selected at all. And so, what we observe is that datasets containing arrays are often predictable. Predict what the data will be and assign representations to the values so if the prediction is accurate few bits can be used to represent the value. If the prediction is wrong you have to pay more bits. The quality of the prediction is the manufactureer -- determining how well the compression works. If you could accurately predict the context of every byte in a file you could compress that file to 0 bytes. No such prediction method exists.

    We have a dataset and a game of Go and we want an algorithm to predict the next move into the game. To help us we will visualize the raw data from the dataset. This scatter plot is a visual representation of the actual bytes of a Go game. The height correspond do is the value of the byte. If the game starts with a move X coordinate 4 and Y coordinate 4 there would be a dot with high 4 followed by a dot with height 3 and so on. Our eyes can kind of pick up on some kind of clustering of the dots. They don't appear random. The data does not appear random is a good indication some sort of compression is possible. Coming up with an algorithm to predict the value of a dot may not be apparent by looking at scatter plot.

    We can see that the there is probably something there. We just don't know yet what it is. It is worth taking a moment to consider how a general purpose algorithm like deflate which is the algorithm used by jesup which searches for redundancy in the method. If you have seen some sequence of bytes you are likely to find them later. The prediction works great for text. At least in the English language, words are constructed from syllables so it is possible to find repetition in a text even in the absence of repeated word. In the Go game the same coordinate on the board is seldom repeated. You can't place a stone on top of a previously played stone. Barring a few exceptions, each two-byte sequence in the file is unique. A redundancy solution like this will produce a compressed file that is far from optimal because the underlying prediction that sequences of bytes repeat would not help. This observation generalizes to many other kinds of data as well. Recall that we stated that each move is likely to be near the previous move. We could try subtracting each byte from the last so that instead of seeing moves in absolute coordinates we will see them in relative coordinates.

    Here is a visual representation of that. This is garbage. There are points everywhere and seems to be no visual pattern at all. It looks random indicating the data is difficult to predict and therefore difficult to compress. The problem with subtracting is that the X and Y coordinates from the data are independent but interweaved in the data. When we subtract adjacent bytes X was subtracted from Y and vice versa. Here is the same image as before. We first need separate the data so logically related data are stored locally. Instead of writing an X followed by a Y like most serialization formats would do let's write out all of the Xs first and then all of the Ys. Here is a visual representation of that. It looks maybe tighter than before. This indicates our data is less random. Now let's try subtracting. Here is a visual representation of that. Now we are making progress. What I want you to notice is three horizontal lines right near center. Most of the points, about two thirds lie on these lines. These lines correspond to the values 0, negative 1, and 1. If we wanted to write an algorithm to predict what would come next in the sequence, the algorithm could be minimal. It is just the value is probably 0, negative 1, or 1. We can simplify this table further and say that the number is likely to be near 0. A "small" number which sounds familiar when we looked at the variable length encoding used in JSON.

    With a prediction algorithm in hand we need to come up with a representation. We are going to write a variable length encoding. In this graphic, we have three rows of boxes where we will describe the variable length encoding. Each box holds a single bit. There are three boxes on the top row. The first box contains a 0, the next two boxes are blank. The 0 at the beginning is a tag bit. It will indicate whether we are in the four smallest values and the most likely cases 0, 1, negative 1 and 2 or the unlikely value case for all the other values. The first bit is taken for the tag bit using two bits for storing those four values. On the second row, we have the tag bit 1 followed by 4 bits allowing us to score the 16-least likely values. The bottom row shows 8 bits for reference which is how many bits are in a byte. Before we were writing each coordinate in a single byte so with this encoding all moves will also save some amount of space. It didn't have to work out that way but we can do this because a Go board only has 19 points along each axis which means we are not using the full range of a byte. If we did use the full range, the encoding would have to have some values extend beyond 8 bits but indeed most datasets don't use the full range. This generalizes well to other datasets. The result is our Go game compresss to less than half the size of writing the data out using one byte per -- the prediction is more accurate while being computationally easier to introduce. It is better than searching for redundancy by scanning many values.

    Note this isn't the best prediction algorithm possible. If you want to get serious about compression and squeeze the file done further you could make an even better prediction algorithm. You could write a deterministic Go AI and have it sort prediction from best to worse and predict it is more likely the player will make a good move versus a bad one. This could give twice the information but the AI would be expensive, require a lot of engineering effort and once completed would only be able to guess the game of Go whereas the delta compression method sounds like it might be useful for more than just Go. Let's compare the methods in a matrix. This shows GZip, delta and AI. We have the compression ratio which is how small the file is, performance is how fast we can read and write the file and the difficulty is the level of engineering experience it take. A checkmark goes to the best and an X to the worse and no mark for the compressor in between. The delta is at a sweet spot and not the worst at any category. If we were to assign a score of plus 1 for being the best at something and minus 1 for being the worse delta compression would come out on top with a score of 1. Gzip second with a score of 0. And AI last with a score of negative 1. The overall score hard playmateer because where Gzip wins is in the difficulty category. -- matter.

    You get a lot with minimum effort using something like GZIP. Effort is important for working professionals under tight deadlines. I would go so far as to say many of us code in a culture that is hostile to high performance programming methods and this is true when those gains come with any engineering cost. They are not likely to be criticize by peers for using GZIP whereas the delta compression method requires a fair bit of custom code. What if we could use the checkmark from GZIP to the delta compression method? If we could do that, then the delta compression method would dominate GZIP and that is the aspiration of Tree-Buf.

    If you have followed so far in understanding how the delta compression method works you are already almost there in understanding Tree-Buf. If we forget about the details and look at the delta compression methods underlying principles we find the essence of Tree-Buf. The first thing we did when applying our custom design delta compression method was to separate the X and Y coordinate storages. Tree-Buf generalizes this to the entire schema. If we were going to extend from just the X and Y cord to a whole term innocent it might look like this. At the top we have the root element tournament which is a type struct. It has three methods. If you follow that through all of the moves of the games and the coordinates down to the bottom row there is an X and Y property which are bufferers holding all of the games and another buffer containing all the Y buffers of all the games in the tournament. This a tree of buffers hence the name. This brings locality to data that is semantically related and of the same type. This transformation is only possible if you know the schema of the data being written. The next thing we know with the compression is we applied a type aware arranging the data. Writing the deltas and crafting ints was only possible because we knew the bytes were UAs and not strings where subtracting JSON characters produces nonsense.

    Tree-Buf generalizes this principle and uses type aware compression methods for the different kinds of data in the tree. Since no compression method is one-size-fits-all it even spends some performance trying a new techniques on the sample of the data from each buffer. The result approximates a hand rolled file format that generally understands your data. What we have is fantastic performance and compression. What about ease of use and engineering effort? I claim it is easier to use Tree-Buf than GZIP? Yes. The trick is that GZIP is not by itself a serialization format. Using GZIP assumes you already have some method for writing structured data like protobuff or CSV or Message Pack or whatever. Using GZIP also entails introducing a second step. Writing a Tree-Buf file is one step. The Rust implementation has an API very much like Saturday. You just put in code or decode attributes on your structs and call the data. There is no impass using another dependency and format. It is just the same amount of work it would take to use any serialization format. Tree-Buf is easy to use.

    How does it do on compression and performance? Time to look at benchmarks. This benchmark will use real-world production data served by the graph, a decentralized indexing and query protocol for blockchain data. For this, a GraphQL was made to an indexer from the graph for 1,000 recent wearable entity auctions. Each entity and the response looks something like this. There are many properties of different types. There are nested objects, arrays and thousands of other entities like this one but with a cardinality in the data that reflects a real-world distribution of values. What we measured is relative CPU time to roundtrip the data through serialized and deserialized. As described by message pack they are like JSON but fast and small. I have chosen this format because message pack is smaller and faster than JSON and is self-describing like JSON which works well for GraphQL. Tree-Buf is also self-describing which means you could open up and read any Tree-Buf file without requiring a separate schema to interpret the data also making it a good fit for GraphQL. The sets are similar enough that we can't attribute in the results to a difference in capabilities. Side stepping the argument that schemas are necessary for performance.

    Here are the results. The big green box's height is how long it takes the CPU to roundtrip the message back file. Its width is the size of the file and bytes. The message pack file is more than 17 times as large as the Tree-Buf file and takes more than twice as long to serialize and de-serialize. The improvements are significant. Considering that the first thing Tree-Buf has to do is reorganize your data into a tree of buffers before starting and reverse that information when reading the data it has no right to match the speed of message pack much less significantly out perform it. The answers have everything to do with data dependencies and choices made in representing the data as bytes. Everything we just covered. Let's take a look at a different dataset. For this benchmark we will consider geoJSON for serializing a list of all the countries. The dataset includes things like the countries name and the bulk of the data is in the polygons that describe the order. Geo-JSON is a compact formula because it doesn't describe each point with redundant tags like longitude and latitude repeated over and over but instead opts to score that data in a giant nested array to minimize overhead. Here are the results. The green box is geo-JSON and the blue box is Tree-Buf. Tree-Buf is more than 10 times as fast as geo-JSON. And it produces a file that is less than one third of the size. The red box is what we get if we opt into the lossy float compression which allows us to specify how much precision is necessary to represent the dataset. The resulting file compresses down to less than one tenth the size without sacrificing speed.

    We have seen how the choices in the representation of data can have a significant impact on the speed, size, and engineering effort and capabilities. These impacts are not restricted to the cases we have studied but effect every software engineering problem. There are many capabilities that you can design into representations that we did not explore today. Do consider serialization and representation as first class citizens next to algorithms and code structure. If you use the proper tools to parse and manipulate the data you will be surprised by the impact. Thank you.

    Moderator:
    Thank you, Zac. Your serialization is much deeper, to be honest, and I understand now. Yeah, thank you for such a deep presentation. We have three minutes? No questions by nowism do you have anything to add to your presentation or comment? Or additional message?

    Zac:
    I am sorry I didn't make it as accessible as planned. I did find it a struggle to kind of get the ideas down into something like a small package and really present them. I am sorry it wasn't as easy to follow as I hoped when I planned for the talk.

    Moderator:
    Yeah, no, no worries about that. It has a lot of, I say, case studies, so it should be -- to be honest, I need some more time to digest your representation I know. I understand how important actually serialization should be in programming. That I think is quite an important message. I agree.

    Zac:
    Yeah, that really is the focus of the talk. If you want to focus on the problem and if people are interested in that kind of thing there is other interesting presentations that you could watch. I would recommend watching data oriented programming by Mike Acton. He talks about a lot of things in the same term so that's interesting. Definitely follow there and fall in line with data oriented programming. There is a lot to learn in that field.

    Moderator:
    We have one question. Is there anything Tree-Buf is bad for?

    Zac:
    Sure. Tree-Buf is taking advantage of being able to like find predictability in data with arrays. If you -- if you want to do server to server communication for things which do not contain arrays, maybe something like protobuff would be better for that. It tries hard not to be bad in that case where there are no arrays but there are some fundamental tradeoffs that wherever Tree-Buf can optimize with arrays. We are doing pretty well in that area with other serialization formats.

    Moderator:
    And we are running out of time. OK.

    Zac:
    I am going to stick around in the chat so if anyone has questions there, I will answer them.

    Moderator:
    Thanks, again, Zac for the great presentation. And the next session is starting in 10 minutes, I believe. I will see you later.