Decoding Bencode in Clojure
Preface
Majority of this code is taken from nREPL’s repository. This article aims to provide more of an explanation behind the code and ditches the netstring part of the repository.
What is Bencode
Bencode is an encoding scheme used in the BitTorrent protocol when encoding metadata. This bencoded metadata is stored
in .torrent files.
Bencode supports four types of values:
- byte strings
- integers
- lists
- dictionaries
Learn more: Wikipedia BitTorrent Specification
Reading bencode in Clojure
Bencode is tightly packed meaning there are no separators between structures, they are all packed like a single byte string. Because of that we need to ‘consume’ the byte string as we go along it. This means that we need to read a certain byte or bytes and then move past them. To accomplish this we will use Java Input Streams mainly Pushback Input Stream which is a variant of an input stream.
You may come across two types of Java classes that consume files, readers and input streams . The difference between them is that readers are used when we need to deal with Unicode character data, and input streams are used for dealing with binary data. Since bencode produces binary data we need to use an input stream rather than a reader.
Reading bytes
Before we can read bytes we need to figure out how we can read a single byte. To
do that we can use the read method of our
input stream instance. This method reads a byte and returns it or -1 indicating a fail.
1(defn read-byte
2 [input]
3 (let [c (.read input)]
4 (when (neg? c)
5 (throw (Exception. "Invalid byte read")))
6 c))
To read multiple bytes we can once again use the read method, but instead of providing no arguments we provide
it with with a buffer, offset and a length so we can read length number of bytes at an offset and put it
into our buffer. We first create a byte array by using (byte-array n), then read the bytes and put them in our
buffer. We also need to handle the case when this fails and throw an exception.
1(defn read-bytes
2 [input n]
3 (let [content (byte-array n)
4 result (.read input content 0 n)]
5 (when (neg? result)
6 (throw (Exception. "Invalid bytes read")))
7 content))
To know how many bytes we need to read, we read an integer before the colon. For example if we have a bencoded
byte string 4:duja, we first need to parse the 4 before the colon to know that we need to read the next 4 bytes
to get the full byte string.
Reading size numbers
These numbers differ in the way we read them from the integers which we will see how to read later. Because we
read a number digit by digit we need to come up with a way to transform the read digits into a number. We do this
by recuring until we hit a delimiter character (in our case the colon character). We multiply our current number with 10
to make space for our new digit, and then add our digit to it. We also need to transform our byte to a digit which we do
by subtracting 48 from it because numbers start from 48 in the ASCII table.
1(defn read-long
2 [input delim]
3 (loop [n (long 0)]
4 (let [b (read-byte input)]
5 (cond
6 (= b delim) n
7 :else (recur (+ (* n (long 10)) (- (long b) (long 48))))))))
Reading a byte string
We are now ready to read a byte string. We just combine our read-bytes function with our read-long function.
1(def colon 58)
2(defn read-bytestring
3 [input]
4 (read-bytes input (read-long input colon)))
Reading a bencode token
To read a bencode token we need to know what type it is. As mentioned at the beginner we have four options here: a bytestring,
integer, list and a dictionary. Every type except a bytestring has a starting character and ends with e character. Here are examples
of these types taken from the wiki:
- integer -
i4erepresents4 - list -
l4:spam4:eggserepresents["spam", "eggs"] - dictionary -
d3:cow3:moo4:spam4:eggserepresents{ "cow" => "moo", "spam" => "eggs" }
Now checking what is a type of a bencode token is pretty simple. We just check what are read byte is and act accordingly.
We also need to handle a reading of a bytestring which we know doesn’t start with any characters. When we don’t read any special
character we know we hit a byte string but our file pointer consumed the first byte of that bytestring. To reverse this we use unread
method of our input stream to put our lost byte at the front of the input stream. Then we just continue reading as a bytestring.
1(def e 101)
2(def i 105)
3(def l 108)
4(def d 100)
5(defn read-token
6 [input]
7 (let [ch (read-byte input)]
8 (cond
9 (= e ch) nil
10 (= i ch) :integer
11 (= l ch) :list
12 (= d ch) :map
13 :else (do
14 (.unread input (int ch))
15 (read-bytestring input)))))
Reading bencode
To read bencode we put together our token function with functions that read the bencode types. We now need to make those functions.
1(declare read-integer read-list read-map)
2
3(defn read-bencode
4 [input]
5 (let [token (read-token input)]
6 (case token
7 :integer (read-integer input)
8 :list (read-list input)
9 :map (read-map input)
10 token)))
Reading integers, lists and dictionaries
Because we didn’t hardcode the colon as a delimiter we can re purpose our read-long to read an integer by
providing e as a delimiter.
1(defn read-integer
2 [input]
3 (read-long input e))
For reading lists and dictionaries we need to be able to read a sequence of bytestring. This is just reading
a bytestring repeatedly until we read e which returns nil in our read-token function.
1(defn token-seq
2 [input]
3 (->> #(read-bencode input)
4 repeatedly
5 (take-while identity)))
Reading a list just means reading a sequence of tokens and putting it into a vector.
1(defn read-list
2 [input]
3 (vec (token-seq input)))
Reading a dictionary is a little more complicated. We also read a sequence and then partition all pairs because they represent a key value pair. Then we just transform the bytes of the key to a string.
1(defn read-map
2 [input]
3 (->> (token-seq input)
4 (into {} (comp (partition-all 2)
5 (map (fn [[k v]]
6 [(String. k "UTF-8") v]))))