Concord Network Protocol Based on _Kademlia_ (Maymounkov and Mazieres 2002) and its extension _S/Kademlia_ (Baumgart and Mies 2007). Each node is identified by a nodeId which is computed using a crypto puzzle. The puzzle generates also a key pair that is used to sign and verify integrity of messages. Kademlia builds a _structured overlay_ network, so nodes connections are not built at random, but following the protocol's rules. Data (in a file-sharing application for example) and routing information are held in a _distributed hash table_ (DHT). The word distributed means that global information are not held anywhere and any peer maintains a small portion of it; more specifically only the portion it is responsible for, again following the protocol's rules. Kademlia uses four RPCs for node / value lookup, routing table maintenance and storing values in the DHT: - ping: usual ping / pong request to check liveness of nodes; - store: instructs the target node to store a (key, value) pair; - find_node: takes a nodeId as an argument and the recipient of the request returns a list of k triples (IPaddr, UDPport, nodeId) of nodes closest to the queried nodeId; - find_value: behaves like find_node with the exception that if the target node has previously received a store request for the target key, it will return the stored value directly. At the moment Concord implements only ping and find_node RPCs. This is because the other ones seem related to a file-sharing application. The future development of the voting protocol will tell us if and in what extent Concord will need to implement store and find_value procedures. Nodes are stored in a DHT. The local DHT is organized in _k-buckets_. For each bit in the nodeId, a node maintains a k-bucket, that is a FIFO list of nodes sorted with a last-seen policy (least recently seen node at the head). Let be n the number of bits in the nodeId. Each node maintains n k-buckets and each of them contains nodes at distance between 2i and 2i + 1 ∀i ∈ {0, ..., n − 1}. digraph G { size = "10,10"; ranksep = ".6 equally"; nodesep = ".5"; ordering = "out"; node [shape=circle, height=0, label=""]; edge [splines=line, labeldistance=3, labelfontsize=22, dir=none]; root; { rank = same; node0; node1 }; root -> node0 [headlabel="0"]; root -> node1 [headlabel="1"]; { rank = same; node00; node01; node10; node11 }; node0 -> node00 [headlabel="0", color="/set16/1", style=bold]; node0 -> node01 [headlabel="1", color="/set16/1", style=bold]; node1 -> node10 [headlabel="0"]; node1 -> node11 [headlabel="1"]; { rank = same; node000; node001; node010; node100; node110; node011; node101; node111; }; node00 -> node000 [headlabel="0", color="/set16/1", style=bold]; node00 -> node001 [headlabel="1", color="/set16/1", style=bold]; node01 -> node010 [headlabel="0", color="/set16/1", style=bold]; node01 -> node011 [headlabel="1", color="/set16/1", style=bold]; node10 -> node100 [headlabel="0", color="/set16/2", style=bold]; node10 -> node101 [headlabel="1", color="/set16/2", style=bold]; node11 -> node110 [headlabel="0"]; node11 -> node111 [headlabel="1"]; node [style=filled, fillcolor="#EEEEEE", height=.2]; { rank = same; node [fillcolor="/set16/1"]; node0000; node0001 [group=0001]; node0010; node0011; node0100 [group=0100]; node0101 [group=0101]; node0110 [group=0110]; node0111; node [fillcolor="/set16/2"]; node1000 [group=1000]; node1001 [group=1001]; node1010 [group=1010]; node1011; node1100 [fillcolor=black]; node1101 [fillcolor="/set16/4", group=1101]; node [fillcolor="/set16/3"]; node1110 [group=1110]; node1111 [group=1111]; } node000 -> node0000 [headlabel="0", color="/set16/1", style=bold]; node000 -> node0001 [headlabel="1", color="/set16/1", style=bold]; node001 -> node0010 [headlabel="0", color="/set16/1", style=bold]; node001 -> node0011 [headlabel="1", color="/set16/1", style=bold]; node010 -> node0100 [headlabel="0", color="/set16/1", style=bold]; node010 -> node0101 [headlabel="1", color="/set16/1", style=bold]; node011 -> node0110 [headlabel="0", color="/set16/1", style=bold]; node011 -> node0111 [headlabel="1", color="/set16/1", style=bold]; node100 -> node1000 [headlabel="0", color="/set16/2", style=bold]; node100 -> node1001 [headlabel="1", color="/set16/2", style=bold]; node101 -> node1010 [headlabel="0", color="/set16/2", style=bold]; node101 -> node1011 [headlabel="1", color="/set16/2", style=bold]; node110 -> node1100 [headlabel="0"]; node110 -> node1101 [headlabel="1"]; node111 -> node1110 [headlabel="0", color="/set16/3", style=bold]; node111 -> node1111 [headlabel="1", color="/set16/3", style=bold]; node1101 [fillcolor="/set16/4", height=.2, style=filled]; node111 [fillcolor="/set16/3", height=.2, style=filled]; node10 [fillcolor="/set16/2", height=.2, style=filled]; node0 [fillcolor="/set16/1", height=.2, style=filled]; } In the above Figure, we show Kademlia's DHT for nodeId 1100 in a network where the ID space is in 4 bits. Each color corresponds to a different k-bucket's range. In each k-bucket node IDs can be present or not (in the Figure, all nodes are shown as present) and each node maintains locally a list of k nodes for each bucket. Note also that each k-bucket is responsible for progressively twice the number of node IDs when moving away from the local node's ID, that is ni = 2ni − 1∀i ∈ {1, ..., n − 1} and n₀ = 1 where ni is the number of possible node IDs in bucket i. By moving away from the current node ID, each k-bucket covers twice the number of node IDs. Broadcasting This term is used to mean a _one to all_ communication originating at a single node. Broadcasting is for example used in BitTorrent to announce to other peers when a node starts downloading a file, in Concord it is used to announce a new poll. The problem of efficient broadcasting (that is, with a minimum number of messages) has been tackled in (El-Ansary et al. 2003). The paper focuses on Chord's DHT and overlay structure, but gives great hindsights on the subject. (Zoltdn Czirkos and Hosszu 2010) and (Zoltán Czirkos and Hosszú 2013) expand on the same idea, with a focus instead on Kademlia. The broadcasting algorithm exploits the network structure so to optimize the number of messages exchanged. The basic idea is an application of the divide-et-impera strategy: each node at each round is responsible for a smaller subtree than nodes at the previous round. digraph G { size = "10,10"; ranksep = ".6 equally"; nodesep = ".5"; ordering = "out"; node [shape=circle, height=0, label=""]; edge [splines=line, labeldistance=3, labelfontsize=22, dir=none]; root; { rank = same; node0; node1 }; root -> node0 [headlabel="0"]; root -> node1 [headlabel="1"]; { rank = same; node00; node01; node10; node11 }; node0 -> node00 [headlabel="0", color="/set16/1", style=bold]; node0 -> node01 [headlabel="1", color="/set16/1", style=bold]; node1 -> node10 [headlabel="0"]; node1 -> node11 [headlabel="1"]; { rank = same; node000; node001; node010; node100; node110; node011; node101; node111; }; node00 -> node000 [headlabel="0", color="/set16/1", style=bold]; node00 -> node001 [headlabel="1", color="/set16/1", style=bold]; node01 -> node010 [headlabel="0", color="/set16/1", style=bold]; node01 -> node011 [headlabel="1", color="/set16/1", style=bold]; node10 -> node100 [headlabel="0", color="/set16/2", style=bold]; node10 -> node101 [headlabel="1", color="/set16/2", style=bold]; node11 -> node110 [headlabel="0"]; node11 -> node111 [headlabel="1"]; node [style=filled, fillcolor="#EEEEEE", height=.2]; { rank = same; node [fillcolor="/set16/1"]; node0000; node0001 [group=0001]; node0010; node0011; node0100 [group=0100]; node0101 [group=0101]; node0110 [group=0110]; node0111; node [fillcolor="/set16/2"]; node1000 [group=1000]; node1001 [group=1001]; node1010 [group=1010]; node1011; node1100 [fillcolor=black]; node1101 [fillcolor="/set16/4", group=1101]; node [fillcolor="/set16/3"]; node1110 [group=1110]; node1111 [group=1111]; } node000 -> node0000 [headlabel="0", color="/set16/1", style=bold]; node000 -> node0001 [headlabel="1", color="/set16/1", style=bold]; node001 -> node0010 [headlabel="0", color="/set16/1", style=bold]; node001 -> node0011 [headlabel="1", color="/set16/1", style=bold]; node010 -> node0100 [headlabel="0", color="/set16/1", style=bold]; node010 -> node0101 [headlabel="1", color="/set16/1", style=bold]; node011 -> node0110 [headlabel="0", color="/set16/1", style=bold]; node011 -> node0111 [headlabel="1", color="/set16/1", style=bold]; node100 -> node1000 [headlabel="0", color="/set16/2", style=bold]; node100 -> node1001 [headlabel="1", color="/set16/2", style=bold]; node101 -> node1010 [headlabel="0", color="/set16/2", style=bold]; node101 -> node1011 [headlabel="1", color="/set16/2", style=bold]; node110 -> node1100 [headlabel="0"]; node110 -> node1101 [headlabel="1"]; node111 -> node1110 [headlabel="0", color="/set16/3", style=bold]; node111 -> node1111 [headlabel="1", color="/set16/3", style=bold]; node1101 [fillcolor="/set16/4", height=.2, style=filled]; node111 [fillcolor="/set16/3", height=.2, style=filled]; node10 [fillcolor="/set16/2", height=.2, style=filled]; node0 [fillcolor="/set16/1", height=.2, style=filled]; subgraph broadcast1 { node [shape=box]; 0100 [label=0100, group=0100]; 1001 [label=1001, group=1001]; 1101 [label=1101, group=1101]; 1111 [label=1111, group=1111]; edge [style=dotted, color=black]; node0100 -> 0100; node1001 -> 1001; node1101 -> 1101; node1111 -> 1111; edge [color="/set16/5", splines=curved, style=solid]; node1100 -> 0100 [dir=forward]; node1100 -> 1001 [dir=forward]; node1100 -> 1101 [dir=forward]; node1100 -> 1111 [dir=forward]; } subgraph broadcast2 { node [shape=box]; 0101 [label=0101, group=0101]; 0110 [label=0110, group=0110]; 0001 [label=0001, group=0001]; 1000 [label=1000, group=1000]; 1010 [label=1010, group=1010]; 1110 [label=1110, group=1110]; edge [style=dotted, color=black]; node0101 -> 0101; node0110 -> 0110; node0001 -> 0001; node1000 -> 1000; node1010 -> 1010; node1110 -> 1110; edge [color="/set16/5", splines=curved, style=solid]; 0100 -> 0001 [dir=forward]; 0100 -> 0101 [dir=forward]; 0100 -> 0110 [dir=forward]; 1001 -> 1000 [dir=forward]; 1001 -> 1010 [dir=forward]; 1111 -> 1110 [dir=forward]; } } The Figure above shows an example execution of the broadcast algorithm with node 1100 as the initiator. It starts the procedure by sending the message to a randomly selected node for each k-bucket. Each of the contacted node is then responsible for the subtree referenced by the k-bucket it belongs to. Each message contains also the height of the tree the receiving node is responsible for. Recursively then each of those nodes forwards the message to a randomly selected node in each of its k-buckets among those that reference nodes at the height it is responsible for (contained in the message originally received). When forwarding the message, each node increases the height field in the message, in order to limit the responsibility of the receiving node, based on its distance from the sender. BROADCAST ALGORITHM 1 _Input: message text, height_ (0..height-1).each do |i| if (buckets[i].size != 0) then node = random node from buckets[i] send(node, message, i) end end The algorithm above implements the broadcast. An initiator node sets the height to n − 1, the number of k-buckets, and the message text at will. Then starts sending the message to a randomly picked node from each k-bucket i. The buckets are sorted in a way that buckets[0] can contain maximum one node and is the closest possible from itself (their node IDs differs by only one least significant bit), that is k-bucket indexed with i contains nodes at a distance between 2i and 2i + 1. When sending a message to a node in the i-th bucket, the sender sets the height in the message to i, so to limit the broadcasting scope of the receiver. The height in this sense, represent the tree height-level already covered by other nodes. The problem with this algorithm is that some nodes are responsible of large subtrees. If one of these nodes fails to deliver the message, big subtrees wont receive it. Consider the first round of broadcasting, so messages are sent by the initiator to a node in each bucket. The node picked from the farthest bucket will be the only responsible of forwarding the message to other nodes in its subtree, which covers addressed at a distance between 2n − 1 and 2n, that is half the global tree. If that node fails to deliver the message (because of any of the possible failure types, i.e. crash, Byzantine), then, supposing a perfect binary tree, half of the nodes in the network wont receive the message. In (Zoltán Czirkos and Hosszú 2013) the _reliability_ of the broadcasting algorithm is defined as the ratio between nodes that receive the message and the total number of nodes in the network: $m = \frac{N_r}{N}$ where N is the total number of nodes and Nr is the number of nodes that receive the message. The authors set a probabilistic framework to evaluate the reliability starting from a measure of _successful delegate selection_, that is a message is successfully broadcasted in a subtree through the selection of a delegate like Algorithm 1. They define a delegate selection to be successful if the packet does not encounter any of three cases I) it is lost, II) the receiver is Byzantine or III) the entry in the routing table for the receiver is stale. The successful delegation probability is P = 1 − Ph where Ph denotes the probability of failure. The authors show that for a balanced tree of height b, the reliability of Algorithm 1 is given by $m = \left( \frac{1 + P}{2} \right)^b$. To improve the reliability of the broadcasting algorithm, the authors of the same paper proposed an upgraded version of Algorithm 1 that uses _replication_. In Algorithm 2 the message is sent to more nodes in the same k-bucket, in this way there is not any more a single point of failure in a subtree, but the responsibility is replicated among a number kb of them. To maintain efficiency it is required that kb ≤ k in order to avoid sending find_node requests. Moreover because a node can now receive a message more than once, each message has to be tagged with a unique identifier. BROADCAST ALGORITHM 2 _Input: identifier, message text, height_ return if seen_messages.includes? identifier seen_messages = seen_messages + [identifier] (0..height-1).each do |i| if buckets[i].size != 0 then nodes = Kb random nodes from buckets[i] nodes.each do |node| send(node, identifier, message, i) end end end By selecting more nodes from each k-bucket, the probability of failing to broadcast the message in a subtree will be Phkb where Ph is the probability of a single message getting lost and of the broadcast failing in the subtree the receiving node was responsible for. For example in a network with a fairly high amount of packets loss, lets say Ph = 10%, selecting a replication factor of kb = 2 will reduce the probability of losing a message to Ph² = 1%, thus the probability of successful delegation is increased from P = 90% to P = 99%. By substituting Phkb into Ph in the reliability's formula and solve for kb, we obtain an expression of the replication factor as a function of the reliability m, the tree height b and the probability of packet loss Ph: $$k_b = \left\lceil \frac{\ln (2 (1 - \sqrt[b]{m}))}{\ln P_h} \right\rceil$$ Voting Protocol The goal of this protocol is to allow anyone to create a poll and collect opinions on it. Each node is itself a voter and everyone is free to participate in the voting process. Each voter is also responsible of verification of votes and votes counting. Starting a Poll Whenever someone wants to start a poll, he / she has to send a start_poll RPC. This procedure contains the poll's question and its hash, computed from the text and a nonce. This hash is then used to build the POLL'S MERKLE TREE (details to come in a later section). Each node that wants to take part in the poll, has to find a key pair for which the hash of the public key is numerically less then the start_poll _genesis_ hash (the created with the aforementioned RPC). The process is similar to the nodeId generation in S/Kademlia and is used to prevent huge amounts of valid key pairs for a single poll. The start_poll message contains also a TTL (time-to-live) indicating the maximum time allowed to find a valid key pair for the poll. TTL and crypto puzzle difficulty are measures that control the ability of nodes to generate an high number of valid key pairs. title "Starting a poll" actor Alice actor Bob actor Charlie actor Dave Alice -> Bob: start_poll Alice -> Charlie: start_poll Bob -> Alice: ack_poll Charlie -> Alice: nack_poll Bob -> Charlie: start_poll Charlie -> Bob: nack_poll Bob -> Dave: start_poll The node that starts the poll is responsible of sending the start_poll RPC to other nodes. To do so it has to send the RPC to all nodes that it knows about and sends the same request to other nodes that discovers through recursively call find_node on new nodes that discovers. Another strategy / extension could be to start searching for nodes randomly by generating a random nodeId starting a recursive lookup for that node and sending start_poll requests along the way. Nodes that decide to participate in the poll sends an ack_poll back to the node from which came to know about the poll. This RPC does not contain nodes and means that the node is taking responsibility to forward the start_poll message to its neighbors (all nodes that it knows about or only k closest neighbors are options). Nodes that do not want to participate in the poll send an nack_poll RPC containing its k closest nodes. After the TTL expires, each node participating in the poll has a key pair that can be used to vote the poll. References Baumgart, Ingmar, and Sebastian Mies. 2007. “S/Kademlia: A Practicable Approach Towards Secure Key-Based Routing.” In _Parallel and Distributed Systems, 2007 International Conference on_, 2:1–8. IEEE. Czirkos, Zoltán, and Gábor Hosszú. 2013. “Solution for the Broadcasting in the Kademlia Peer-to-Peer Overlay.” _Computer Networks_ 57 (8). Elsevier: 1853–62. Czirkos, Zoltdn, and Gdbor Hosszu. 2010. “Enhancing the Kademlia P 2 P Network.” _Periodica Polytechnica, Electrical Engineering_ 54 (3-4). Budapest University of Technology and Economics: 87–92. El-Ansary, Sameh, Luc Onana Alima, Per Brand, and Seif Haridi. 2003. “Efficient Broadcast in Structured P2p Networks.” In _International Workshop on Peer-to-Peer Systems_, 304–14. Springer. Maymounkov, Petar, and David Mazieres. 2002. “Kademlia: A Peer-to-Peer Information System Based on the Xor Metric.” In _International Workshop on Peer-to-Peer Systems_, 53–65. Springer.