Concord


Network Protocol

Based on _Kademlia_ (Maymounkov and Mazieres 2002) and its extension
_S/Kademlia_ (Baumgart and Mies 2007). Each node is identified by a
nodeId which is computed using a crypto puzzle. The puzzle generates
also a key pair that is used to sign and verify integrity of messages.

Kademlia builds a _structured overlay_ network, so nodes connections are
not built at random, but following the protocol's rules. Data (in a
file-sharing application for example) and routing information are held
in a _distributed hash table_ (DHT). The word distributed means that
global information are not held anywhere and any peer maintains a small
portion of it; more specifically only the portion it is responsible for,
again following the protocol's rules.

Kademlia uses four RPCs for node / value lookup, routing table
maintenance and storing values in the DHT:

-   ping: usual ping / pong request to check liveness of nodes;
-   store: instructs the target node to store a (key, value) pair;
-   find_node: takes a nodeId as an argument and the recipient of the
    request returns a list of k triples (IPaddr, UDPport, nodeId) of
    nodes closest to the queried nodeId;
-   find_value: behaves like find_node with the exception that if the
    target node has previously received a store request for the target
    key, it will return the stored value directly.

At the moment Concord implements only ping and find_node RPCs. This is
because the other ones seem related to a file-sharing application. The
future development of the voting protocol will tell us if and in what
extent Concord will need to implement store and find_value procedures.

Nodes are stored in a DHT. The local DHT is organized in _k-buckets_.
For each bit in the nodeId, a node maintains a k-bucket, that is a FIFO
list of nodes sorted with a last-seen policy (least recently seen node
at the head). Let be n the number of bits in the nodeId. Each node
maintains n k-buckets and each of them contains nodes at distance
between 2i and 2i + 1 ∀i ∈ {0, ..., n − 1}.

    digraph G {
      size = "10,10";
      ranksep = ".6 equally";
      nodesep = ".5";
      ordering = "out";
      node [shape=circle, height=0, label=""];
      edge [splines=line, labeldistance=3, labelfontsize=22, dir=none];

      root;
      { rank = same; node0; node1 };
      root -> node0 [headlabel="0"];
      root -> node1 [headlabel="1"];

      { rank = same; node00; node01; node10; node11 };
      node0 -> node00 [headlabel="0", color="/set16/1", style=bold];
      node0 -> node01 [headlabel="1", color="/set16/1", style=bold];
      node1 -> node10 [headlabel="0"];
      node1 -> node11 [headlabel="1"];

      { rank = same; node000; node001; node010; node100; node110; node011;
        node101; node111; };
      node00 -> node000 [headlabel="0", color="/set16/1", style=bold];
      node00 -> node001 [headlabel="1", color="/set16/1", style=bold];
      node01 -> node010 [headlabel="0", color="/set16/1", style=bold];
      node01 -> node011 [headlabel="1", color="/set16/1", style=bold];
      node10 -> node100 [headlabel="0", color="/set16/2", style=bold];
      node10 -> node101 [headlabel="1", color="/set16/2", style=bold];
      node11 -> node110 [headlabel="0"];
      node11 -> node111 [headlabel="1"];

      node [style=filled, fillcolor="#EEEEEE", height=.2];
      {
        rank = same;
        node [fillcolor="/set16/1"];
        node0000; node0001 [group=0001]; node0010; node0011; node0100 [group=0100];
        node0101 [group=0101]; node0110 [group=0110]; node0111;
        node [fillcolor="/set16/2"];
        node1000 [group=1000]; node1001 [group=1001]; node1010 [group=1010]; node1011;
        node1100 [fillcolor=black];
        node1101 [fillcolor="/set16/4", group=1101];
        node [fillcolor="/set16/3"];
        node1110 [group=1110]; node1111 [group=1111];
      }
      node000 -> node0000 [headlabel="0", color="/set16/1", style=bold];
      node000 -> node0001 [headlabel="1", color="/set16/1", style=bold];
      node001 -> node0010 [headlabel="0", color="/set16/1", style=bold];
      node001 -> node0011 [headlabel="1", color="/set16/1", style=bold];
      node010 -> node0100 [headlabel="0", color="/set16/1", style=bold];
      node010 -> node0101 [headlabel="1", color="/set16/1", style=bold];
      node011 -> node0110 [headlabel="0", color="/set16/1", style=bold];
      node011 -> node0111 [headlabel="1", color="/set16/1", style=bold];
      node100 -> node1000 [headlabel="0", color="/set16/2", style=bold];
      node100 -> node1001 [headlabel="1", color="/set16/2", style=bold];
      node101 -> node1010 [headlabel="0", color="/set16/2", style=bold];
      node101 -> node1011 [headlabel="1", color="/set16/2", style=bold];
      node110 -> node1100 [headlabel="0"];
      node110 -> node1101 [headlabel="1"];
      node111 -> node1110 [headlabel="0", color="/set16/3", style=bold];
      node111 -> node1111 [headlabel="1", color="/set16/3", style=bold];

      node1101 [fillcolor="/set16/4", height=.2, style=filled];
      node111  [fillcolor="/set16/3", height=.2, style=filled];
      node10   [fillcolor="/set16/2", height=.2, style=filled];
      node0    [fillcolor="/set16/1", height=.2, style=filled];
    }

In the above Figure, we show Kademlia's DHT for nodeId 1100 in a network
where the ID space is in 4 bits. Each color corresponds to a different
k-bucket's range. In each k-bucket node IDs can be present or not (in
the Figure, all nodes are shown as present) and each node maintains
locally a list of k nodes for each bucket. Note also that each k-bucket
is responsible for progressively twice the number of node IDs when
moving away from the local node's ID, that is
ni = 2ni − 1∀i ∈ {1, ..., n − 1} and n₀ = 1 where ni is the number of
possible node IDs in bucket i. By moving away from the current node ID,
each k-bucket covers twice the number of node IDs.

Broadcasting

This term is used to mean a _one to all_ communication originating at a
single node. Broadcasting is for example used in BitTorrent to announce
to other peers when a node starts downloading a file, in Concord it is
used to announce a new poll. The problem of efficient broadcasting (that
is, with a minimum number of messages) has been tackled in (El-Ansary et
al. 2003). The paper focuses on Chord's DHT and overlay structure, but
gives great hindsights on the subject. (Zoltdn Czirkos and Hosszu 2010)
and (Zoltán Czirkos and Hosszú 2013) expand on the same idea, with a
focus instead on Kademlia. The broadcasting algorithm exploits the
network structure so to optimize the number of messages exchanged. The
basic idea is an application of the divide-et-impera strategy: each node
at each round is responsible for a smaller subtree than nodes at the
previous round.

    digraph G {
      size = "10,10";
      ranksep = ".6 equally";
      nodesep = ".5";
      ordering = "out";
      node [shape=circle, height=0, label=""];
      edge [splines=line, labeldistance=3, labelfontsize=22, dir=none];

      root;
      { rank = same; node0; node1 };
      root -> node0 [headlabel="0"];
      root -> node1 [headlabel="1"];

      { rank = same; node00; node01; node10; node11 };
      node0 -> node00 [headlabel="0", color="/set16/1", style=bold];
      node0 -> node01 [headlabel="1", color="/set16/1", style=bold];
      node1 -> node10 [headlabel="0"];
      node1 -> node11 [headlabel="1"];

      { rank = same; node000; node001; node010; node100; node110; node011;
        node101; node111; };
      node00 -> node000 [headlabel="0", color="/set16/1", style=bold];
      node00 -> node001 [headlabel="1", color="/set16/1", style=bold];
      node01 -> node010 [headlabel="0", color="/set16/1", style=bold];
      node01 -> node011 [headlabel="1", color="/set16/1", style=bold];
      node10 -> node100 [headlabel="0", color="/set16/2", style=bold];
      node10 -> node101 [headlabel="1", color="/set16/2", style=bold];
      node11 -> node110 [headlabel="0"];
      node11 -> node111 [headlabel="1"];

      node [style=filled, fillcolor="#EEEEEE", height=.2];
      {
        rank = same;
        node [fillcolor="/set16/1"];
        node0000; node0001 [group=0001]; node0010; node0011; node0100 [group=0100];
        node0101 [group=0101]; node0110 [group=0110]; node0111;
        node [fillcolor="/set16/2"];
        node1000 [group=1000]; node1001 [group=1001]; node1010 [group=1010]; node1011;
        node1100 [fillcolor=black];
        node1101 [fillcolor="/set16/4", group=1101];
        node [fillcolor="/set16/3"];
        node1110 [group=1110]; node1111 [group=1111];
      }
      node000 -> node0000 [headlabel="0", color="/set16/1", style=bold];
      node000 -> node0001 [headlabel="1", color="/set16/1", style=bold];
      node001 -> node0010 [headlabel="0", color="/set16/1", style=bold];
      node001 -> node0011 [headlabel="1", color="/set16/1", style=bold];
      node010 -> node0100 [headlabel="0", color="/set16/1", style=bold];
      node010 -> node0101 [headlabel="1", color="/set16/1", style=bold];
      node011 -> node0110 [headlabel="0", color="/set16/1", style=bold];
      node011 -> node0111 [headlabel="1", color="/set16/1", style=bold];
      node100 -> node1000 [headlabel="0", color="/set16/2", style=bold];
      node100 -> node1001 [headlabel="1", color="/set16/2", style=bold];
      node101 -> node1010 [headlabel="0", color="/set16/2", style=bold];
      node101 -> node1011 [headlabel="1", color="/set16/2", style=bold];
      node110 -> node1100 [headlabel="0"];
      node110 -> node1101 [headlabel="1"];
      node111 -> node1110 [headlabel="0", color="/set16/3", style=bold];
      node111 -> node1111 [headlabel="1", color="/set16/3", style=bold];

      node1101 [fillcolor="/set16/4", height=.2, style=filled];
      node111  [fillcolor="/set16/3", height=.2, style=filled];
      node10   [fillcolor="/set16/2", height=.2, style=filled];
      node0    [fillcolor="/set16/1", height=.2, style=filled];

      subgraph broadcast1 {
        node [shape=box];
        0100 [label=0100, group=0100];
        1001 [label=1001, group=1001];
        1101 [label=1101, group=1101];
        1111 [label=1111, group=1111];
        edge [style=dotted, color=black];
        node0100 -> 0100;
        node1001 -> 1001;
        node1101 -> 1101;
        node1111 -> 1111;
        edge [color="/set16/5", splines=curved, style=solid];
        node1100 -> 0100 [dir=forward];
        node1100 -> 1001 [dir=forward];
        node1100 -> 1101 [dir=forward];
        node1100 -> 1111 [dir=forward];
      }

      subgraph broadcast2 {
        node [shape=box];
        0101 [label=0101, group=0101];
        0110 [label=0110, group=0110];
        0001 [label=0001, group=0001];
        1000 [label=1000, group=1000];
        1010 [label=1010, group=1010];
        1110 [label=1110, group=1110];
        edge [style=dotted, color=black];
        node0101 -> 0101;
        node0110 -> 0110;
        node0001 -> 0001;
        node1000 -> 1000;
        node1010 -> 1010;
        node1110 -> 1110;
        edge [color="/set16/5", splines=curved, style=solid];
        0100 -> 0001 [dir=forward];
        0100 -> 0101 [dir=forward];
        0100 -> 0110 [dir=forward];
        1001 -> 1000 [dir=forward];
        1001 -> 1010 [dir=forward];
        1111 -> 1110 [dir=forward];
      }
    }

The Figure above shows an example execution of the broadcast algorithm
with node 1100 as the initiator. It starts the procedure by sending the
message to a randomly selected node for each k-bucket. Each of the
contacted node is then responsible for the subtree referenced by the
k-bucket it belongs to. Each message contains also the height of the
tree the receiving node is responsible for. Recursively then each of
those nodes forwards the message to a randomly selected node in each of
its k-buckets among those that reference nodes at the height it is
responsible for (contained in the message originally received). When
forwarding the message, each node increases the height field in the
message, in order to limit the responsibility of the receiving node,
based on its distance from the sender.

BROADCAST ALGORITHM 1 _Input: message text, height_

    (0..height-1).each do |i|
      if (buckets[i].size != 0) then
        node = random node from buckets[i]
        send(node, message, i)
      end
    end

The algorithm above implements the broadcast. An initiator node sets the
height to n − 1, the number of k-buckets, and the message text at will.
Then starts sending the message to a randomly picked node from each
k-bucket i. The buckets are sorted in a way that buckets[0] can contain
maximum one node and is the closest possible from itself (their node IDs
differs by only one least significant bit), that is k-bucket indexed
with i contains nodes at a distance between 2i and 2i + 1. When sending
a message to a node in the i-th bucket, the sender sets the height in
the message to i, so to limit the broadcasting scope of the receiver.
The height in this sense, represent the tree height-level already
covered by other nodes.

The problem with this algorithm is that some nodes are responsible of
large subtrees. If one of these nodes fails to deliver the message, big
subtrees wont receive it. Consider the first round of broadcasting, so
messages are sent by the initiator to a node in each bucket. The node
picked from the farthest bucket will be the only responsible of
forwarding the message to other nodes in its subtree, which covers
addressed at a distance between 2n − 1 and 2n, that is half the global
tree. If that node fails to deliver the message (because of any of the
possible failure types, i.e. crash, Byzantine), then, supposing a
perfect binary tree, half of the nodes in the network wont receive the
message.

In (Zoltán Czirkos and Hosszú 2013) the _reliability_ of the
broadcasting algorithm is defined as the ratio between nodes that
receive the message and the total number of nodes in the network:
$m = \frac{N_r}{N}$ where N is the total number of nodes and Nr is the
number of nodes that receive the message. The authors set a
probabilistic framework to evaluate the reliability starting from a
measure of _successful delegate selection_, that is a message is
successfully broadcasted in a subtree through the selection of a
delegate like Algorithm 1. They define a delegate selection to be
successful if the packet does not encounter any of three cases I) it is
lost, II) the receiver is Byzantine or III) the entry in the routing
table for the receiver is stale. The successful delegation probability
is P = 1 − Ph where Ph denotes the probability of failure. The authors
show that for a balanced tree of height b, the reliability of Algorithm
1 is given by $m = \left( \frac{1 + P}{2} \right)^b$.

To improve the reliability of the broadcasting algorithm, the authors of
the same paper proposed an upgraded version of Algorithm 1 that uses
_replication_. In Algorithm 2 the message is sent to more nodes in the
same k-bucket, in this way there is not any more a single point of
failure in a subtree, but the responsibility is replicated among a
number kb of them. To maintain efficiency it is required that kb ≤ k in
order to avoid sending find_node requests. Moreover because a node can
now receive a message more than once, each message has to be tagged with
a unique identifier.

BROADCAST ALGORITHM 2 _Input: identifier, message text, height_

    return if seen_messages.includes? identifier
    seen_messages = seen_messages + [identifier]
    (0..height-1).each do |i|
      if buckets[i].size != 0 then
        nodes = Kb random nodes from buckets[i]
        nodes.each do |node|
          send(node, identifier, message, i)
        end
      end
    end

By selecting more nodes from each k-bucket, the probability of failing
to broadcast the message in a subtree will be Phkb where Ph is the
probability of a single message getting lost and of the broadcast
failing in the subtree the receiving node was responsible for. For
example in a network with a fairly high amount of packets loss, lets say
Ph = 10%, selecting a replication factor of kb = 2 will reduce the
probability of losing a message to Ph² = 1%, thus the probability of
successful delegation is increased from P = 90% to P = 99%. By
substituting Phkb into Ph in the reliability's formula and solve for kb,
we obtain an expression of the replication factor as a function of the
reliability m, the tree height b and the probability of packet loss Ph:
$$k_b = \left\lceil \frac{\ln (2 (1 - \sqrt[b]{m}))}{\ln P_h} \right\rceil$$


Voting Protocol

The goal of this protocol is to allow anyone to create a poll and
collect opinions on it. Each node is itself a voter and everyone is free
to participate in the voting process. Each voter is also responsible of
verification of votes and votes counting.

Starting a Poll

Whenever someone wants to start a poll, he / she has to send a
start_poll RPC. This procedure contains the poll's question and its
hash, computed from the text and a nonce. This hash is then used to
build the POLL'S MERKLE TREE (details to come in a later section). Each
node that wants to take part in the poll, has to find a key pair for
which the hash of the public key is numerically less then the start_poll
_genesis_ hash (the created with the aforementioned RPC). The process is
similar to the nodeId generation in S/Kademlia and is used to prevent
huge amounts of valid key pairs for a single poll. The start_poll
message contains also a TTL (time-to-live) indicating the maximum time
allowed to find a valid key pair for the poll. TTL and crypto puzzle
difficulty are measures that control the ability of nodes to generate an
high number of valid key pairs.

    title "Starting a poll"

    actor Alice
    actor Bob
    actor Charlie
    actor Dave

    Alice -> Bob: start_poll
    Alice -> Charlie: start_poll
    Bob -> Alice: ack_poll
    Charlie -> Alice: nack_poll
    Bob -> Charlie: start_poll
    Charlie -> Bob: nack_poll
    Bob -> Dave: start_poll

The node that starts the poll is responsible of sending the start_poll
RPC to other nodes. To do so it has to send the RPC to all nodes that it
knows about and sends the same request to other nodes that discovers
through recursively call find_node on new nodes that discovers. Another
strategy / extension could be to start searching for nodes randomly by
generating a random nodeId starting a recursive lookup for that node and
sending start_poll requests along the way. Nodes that decide to
participate in the poll sends an ack_poll back to the node from which
came to know about the poll. This RPC does not contain nodes and means
that the node is taking responsibility to forward the start_poll message
to its neighbors (all nodes that it knows about or only k closest
neighbors are options). Nodes that do not want to participate in the
poll send an nack_poll RPC containing its k closest nodes.

After the TTL expires, each node participating in the poll has a key
pair that can be used to vote the poll.


 References

Baumgart, Ingmar, and Sebastian Mies. 2007. “S/Kademlia: A Practicable
Approach Towards Secure Key-Based Routing.” In _Parallel and Distributed
Systems, 2007 International Conference on_, 2:1–8. IEEE.

Czirkos, Zoltán, and Gábor Hosszú. 2013. “Solution for the Broadcasting
in the Kademlia Peer-to-Peer Overlay.” _Computer Networks_ 57 (8).
Elsevier: 1853–62.

Czirkos, Zoltdn, and Gdbor Hosszu. 2010. “Enhancing the Kademlia P 2 P
Network.” _Periodica Polytechnica, Electrical Engineering_ 54 (3-4).
Budapest University of Technology and Economics: 87–92.

El-Ansary, Sameh, Luc Onana Alima, Per Brand, and Seif Haridi. 2003.
“Efficient Broadcast in Structured P2p Networks.” In _International
Workshop on Peer-to-Peer Systems_, 304–14. Springer.

Maymounkov, Petar, and David Mazieres. 2002. “Kademlia: A Peer-to-Peer
Information System Based on the Xor Metric.” In _International Workshop
on Peer-to-Peer Systems_, 53–65. Springer.