Decoding Bencode in Clojure

2023-03-06

5 minute read

clojure

Preface

Majority of this code is taken from nREPL’s repository. This article aims to provide more of an explanation behind the code and ditches the netstring part of the repository.

What is Bencode

Bencode is an encoding scheme used in the BitTorrent protocol when encoding metadata. This bencoded metadata is stored in .torrent files.

Bencode supports four types of values:

byte strings
integers
lists
dictionaries

Learn more: Wikipedia BitTorrent Specification

Reading bencode in Clojure

Bencode is tightly packed meaning there are no separators between structures, they are all packed like a single byte string. Because of that we need to ‘consume’ the byte string as we go along it. This means that we need to read a certain byte or bytes and then move past them. To accomplish this we will use Java Input Streams mainly Pushback Input Stream which is a variant of an input stream.

You may come across two types of Java classes that consume files, readers and input streams . The difference between them is that readers are used when we need to deal with Unicode character data, and input streams are used for dealing with binary data. Since bencode produces binary data we need to use an input stream rather than a reader.

Reading bytes

Before we can read bytes we need to figure out how we can read a single byte. To do that we can use the read method of our input stream instance. This method reads a byte and returns it or -1 indicating a fail.

1(defn read-byte
2  [input]
3  (let [c (.read input)]
4    (when (neg? c)
5      (throw (Exception. "Invalid byte read")))
6    c))

To read multiple bytes we can once again use the read method, but instead of providing no arguments we provide it with with a buffer, offset and a length so we can read length number of bytes at an offset and put it into our buffer. We first create a byte array by using (byte-array n), then read the bytes and put them in our buffer. We also need to handle the case when this fails and throw an exception.

1(defn read-bytes 
2  [input n]
3  (let [content (byte-array n)
4        result (.read input content 0 n)]
5    (when (neg? result)
6      (throw (Exception. "Invalid bytes read")))
7    content))

To know how many bytes we need to read, we read an integer before the colon. For example if we have a bencoded byte string 4:duja, we first need to parse the 4 before the colon to know that we need to read the next 4 bytes to get the full byte string.

Reading size numbers

These numbers differ in the way we read them from the integers which we will see how to read later. Because we read a number digit by digit we need to come up with a way to transform the read digits into a number. We do this by recuring until we hit a delimiter character (in our case the colon character). We multiply our current number with 10 to make space for our new digit, and then add our digit to it. We also need to transform our byte to a digit which we do by subtracting 48 from it because numbers start from 48 in the ASCII table.

1(defn read-long
2  [input delim]
3  (loop [n (long 0)]
4    (let [b (read-byte input)]
5      (cond 
6        (= b delim) n 
7        :else (recur (+ (* n (long 10)) (- (long b) (long 48))))))))

Reading a byte string

We are now ready to read a byte string. We just combine our read-bytes function with our read-long function.

1(def colon 58)
2(defn read-bytestring 
3  [input]
4  (read-bytes input (read-long input colon)))

Reading a bencode token

To read a bencode token we need to know what type it is. As mentioned at the beginner we have four options here: a bytestring, integer, list and a dictionary. Every type except a bytestring has a starting character and ends with e character. Here are examples of these types taken from the wiki:

integer - i4e represents 4
list - l4:spam4:eggse represents ["spam", "eggs"]
dictionary - d3:cow3:moo4:spam4:eggse represents { "cow" => "moo", "spam" => "eggs" }

Now checking what is a type of a bencode token is pretty simple. We just check what are read byte is and act accordingly. We also need to handle a reading of a bytestring which we know doesn’t start with any characters. When we don’t read any special character we know we hit a byte string but our file pointer consumed the first byte of that bytestring. To reverse this we use unread method of our input stream to put our lost byte at the front of the input stream. Then we just continue reading as a bytestring.

 1(def e 101)
 2(def i 105)
 3(def l 108)
 4(def d 100)
 5(defn read-token 
 6  [input]
 7  (let [ch (read-byte input)]
 8    (cond 
 9      (= e ch) nil
10      (= i ch) :integer
11      (= l ch) :list
12      (= d ch) :map 
13      :else (do 
14              (.unread input (int ch))
15              (read-bytestring input)))))

Reading bencode

To read bencode we put together our token function with functions that read the bencode types. We now need to make those functions.

 1(declare read-integer read-list read-map)
 2
 3(defn read-bencode
 4  [input]
 5  (let [token (read-token input)]
 6    (case token 
 7      :integer (read-integer input)
 8      :list (read-list input)
 9      :map (read-map input)
10      token)))

Reading integers, lists and dictionaries

Because we didn’t hardcode the colon as a delimiter we can re purpose our read-long to read an integer by providing e as a delimiter.

1(defn read-integer
2  [input]
3  (read-long input e))

For reading lists and dictionaries we need to be able to read a sequence of bytestring. This is just reading a bytestring repeatedly until we read e which returns nil in our read-token function.

1(defn token-seq
2  [input]
3  (->> #(read-bencode input)
4       repeatedly
5       (take-while identity)))

Reading a list just means reading a sequence of tokens and putting it into a vector.

1(defn read-list
2  [input]
3  (vec (token-seq input)))

Reading a dictionary is a little more complicated. We also read a sequence and then partition all pairs because they represent a key value pair. Then we just transform the bytes of the key to a string.

1(defn read-map
2  [input]
3  (->> (token-seq input)
4       (into {} (comp (partition-all 2)
5                      (map (fn [[k v]]
6                             [(String. k "UTF-8") v]))))