
An Intro to Zero Copy Reads, Serialization
The first time I stumbled upon this Term `Zero Copy Serialization` was when I started looking into Apache Arrow. The website said
The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead
That’s when I started reading more about this Zero Copy Operations. And why would they beneficial in high performance data systems? This article highlights zero copy IO and compute — typically involved with Serializing and deserializing data.
Often, systems need to serialize data — either from one memory location to the other, on a disk, or between machines in a distributed network.
Serialization involves transforming data structures such as classes, structs , and other primitive types into actual bytes — which can later be stored on a disk or moved over a network.
This is where serialization formats come into picture — the way data is serialized and stored on disk involves some CPU and Memory. One may chose JSON
, protobufs
, Flatbuffers
, or plain Serialization without any format.
For instance, when a JSON API request is parsed to a Java POJO / Go Struct , the data bytes are read in memory, and transformed into native
structs. No matter how small the payload is, the process of reading a JSON string bytes to a native Java Object representation consumes some memory and cpu.
Zero Copy Serialization involves representing data in memory as raw bytes,
and operating on these directly instead of transforming them. It’s like the same bytes can be copied and represented in both disk and memory. Additional Wrappers may be written for facilitating various operations on this data. For instance, when we have
Such a technique can save significantly on CPU and IO when dealing with data intensive applications. As the data size grows, the cost of serialization / transformation can get expensive. Moreover, having data represented the same way both in-memory and on disk is a huge benefit for performing operations without loading the entire data in memory. Imagine the benefits if one could process terabytes of data on disk without loading the entire data in memory or deserializing it.
Additionally, a significant benefit is interoperability. Such Zero Copy structures help in efficient transfers and processing bytes irrespective of the
application/language/system that processes it.
A fantastic application of this concept that is worth checking out is [Apache Arrow]( https://arrow.apache.org/overview/) — an In memory Columnar format for fast data processing. No matter what language / framework processes data, every program in any language would understand this format. And all of them have the same memory representation of data.
As a reference Wikipedia has a good list of serialization formats that support zero copy operations. Do give some new serialization framework a try and learn more.
Originally published at https://shanmukhsista.com.