Protobuf in R: Streaming multiple messages
Introduction
Protocol buffers (protobuf) are a method developed by Google (1) for serializing structured data that is both language and platform neutral.
The structure of the information is defined using a protocol buffer message type in a .proto file. The protobuf program then generates source code from this description for generating or parsing stream of bytes that represent the structured data. Unfortunately, R is not one of the languages supported.
Instead, the RProtobuf (3) R package provides a convenient wrapper around the C++ version of the protobuf interface which can be used to read and write protobuf messages.
The RProtobuf package provide functions for writing a single message to a
binary file. However, it does not provide any functionality to write
multiple messages (in fact only the java version of protobuf provides
this functionality). The standard way to including multiple protobuf
messages in one binary file is to precede the message by its byte size
in varint format (Variable Length Integer) (6). This is how functions
parseDelimitedFrom
and writeDelimitedTo
in the java implementation
work. A python implementation of this procedure is given at (2).

I was unable to find any R implementations of the varint encoders and decoders required to convert the message size to the right format so I ended up simplifying and rewriting the python versions available at (5) and (4) .
Example
First we need a .proto
file with the definition of the structure
of the messages. For example, in the following code block I show the
example in the protobuf tutorial (1). This code is provided in the
./Code/example.proto
in the repository 7.
syntax = "proto2";
package example;
message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}
message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];
}
repeated PhoneNumber phones = 4;
}
message AddressBook {
repeated Person people = 1;
}
Load Protobuf Definition
In R we first need to load the .proto
file using the
readProtoFiles
function.
library("RProtoBuf")
source("./Code/varint.R")
readProtoFiles("./Code/example.proto")
Then we can create two messages that we would like to write to a binary file.
p1 = new(example.Person, id = 1, name = 'Dirk')
p1$email = "test@test.com"
p2 = new(example.Person, id = 2)
p2$name = "Mary"
print(p1)
print(p2)
name: "Dirk"
id: 1
email: "test@test.com"
name: "Mary"
id: 2
Write Protobuf To binary file
The function VarintBytes32
given in the ./Code/
folder is used to
encode the message length using the varint format. The length is
written to the binary file with the funciton writeBin
while the
actual message is written using the serialize
function.
fname = "./Code/binproto.protobuf"
fileo = file(fname, open = "wb", raw = TRUE)
mysize = VarintBytes32(p1$bytesize())
writeBin(mysize, fileo)
p1$serialize(fileo)
mysize2 = VarintBytes32(p2$bytesize())
writeBin(mysize2, fileo)
p2$serialize(fileo)
close(fileo)
There should now be a file called binproto.protobuf
on the Code
folder with the binary content.
system("ls -l ./Code")
total 24
-rw-r--r-- 1 fred staff 33 Apr 8 20:38 binproto.protobuf
-rw-r--r-- 1 fred staff 409 Apr 8 20:29 example.proto
-rw-r--r-- 1 fred staff 911 Apr 8 20:13 varint.R
system("cat ./Code/binproto.protobuf", intern=TRUE)
[1] "\027"
[2] "\004Dirk\020\001\032\rtest@test.com\b"
[3] "\004Mary\020\002"
Read Protobuf file
In order to read the binary protobuf file it is necessary to first read the binary file to memory.
fileloc = "Code/binproto.protobuf"
alldata = readBin(fileloc, "raw", file.size(fileloc))
alldata
[1] 17 0a 04 44 69 72 6b 10 01 1a 0d 74 65 73 74 40 74 65 73 74 2e 63 6f 6d 08
[26] 0a 04 4d 61 72 79 10 02
Then the function DecodeVarint32
is used to decode the length of the
message as well as the start position. With this information it is
possible to parse the protobuf using the read
function. This is
repeated for every message.
n = 1
pos = DecodeVarint32(alldata,n)
clen = pos[[1]]
n = pos[[2]]
nend = n + clen - 1
tmp = alldata[n:nend]
p1_read = example.Person$read(tmp)
n = clen + n
pos = DecodeVarint32(alldata,n)
clen = pos[[1]]
n = pos[[2]]
nend = n + clen - 1
tmp = alldata[n:nend]
p2_read = example.Person$read(tmp)
print(p1_read)
print(p2_read)
name: "Dirk"
id: 1
email: "test@test.com"
name: "Mary"
id: 2
References
- https://developers.google.com/protocol-buffers/
- https://www.datadoghq.com/blog/engineering/protobuf-parsing-in-python/
- https://cran.r-project.org/web/packages/RProtoBuf/
- https://github.com/protocolbuffers/protobuf/blob/master/python/google/protobuf/internal/encoder.py
- https://raw.githubusercontent.com/protocolbuffers/protobuf/master/python/google/protobuf/internal/decoder.py
- https://carlmastrangelo.com/blog/lets-make-a-varint
- https://github.com/fkgruber/protobuf_reader