[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[lsd0003] branch master updated: work on section 4/5.1
From: |
gnunet |
Subject: |
[lsd0003] branch master updated: work on section 4/5.1 |
Date: |
Thu, 14 Jan 2021 12:06:25 +0100 |
This is an automated email from the git hooks/post-receive script.
grothoff pushed a commit to branch master
in repository lsd0003.
The following commit(s) were added to refs/heads/master by this push:
new cab45ba work on section 4/5.1
cab45ba is described below
commit cab45ba74dc984053f893d8caf5e406f643bed93
Author: Christian Grothoff <christian@grothoff.org>
AuthorDate: Thu Jan 14 12:05:36 2021 +0100
work on section 4/5.1
---
draft-summermatter-set-union.xml | 195 +++++++++++++++++++++++++--------------
1 file changed, 125 insertions(+), 70 deletions(-)
diff --git a/draft-summermatter-set-union.xml b/draft-summermatter-set-union.xml
index f19bb8f..b108c79 100644
--- a/draft-summermatter-set-union.xml
+++ b/draft-summermatter-set-union.xml
@@ -180,12 +180,13 @@
<name>Bloom Filters</name>
<t>
A Bloom filter (BF) is a space-efficient datastructure to
test if am element is part of a set of elements.
+ Elements are identified by an element ID.
Since a BF is a probabilistic datastructure, it is
possible to have false-positives: when asked
if an element is in the set, the answer from a BF is
either "no" or "maybe".
</t>
<t>
A BF consists of L buckets. Every bucket is a binary value
that can be either 0 or 1. All buckets are initialized
- to 0. A mapping function M is used to map each element of
the set to a subset of k buckets. M is non-injective
+ to 0. A mapping function M is used to map each the ID of
each element from the set to a subset of k buckets. M is non-injective
and can thus map the same element multiple times to the
same bucket.
The type of the mapping function can thus be described by
the following mathematical notation:
</t>
@@ -346,7 +347,9 @@
+-------------+-------------+-------------+-------------+-------
count | COUNTER | COUNTER | COUNTER | COUNTER | C...
+-------------+-------------+-------------+-------------+------
- value | XHASH | XHASH | XHASH | XHASH | X...
+ idSum | IDSUM | IDSUM | IDSUM | IDSUM | I...
+ +-------------+-------------+-------------+-------------+------
+hashSum | HASHSUM | HASHSUM | HASHSUM | HASHSUM | H..
+-------------+-------------+-------------+-------------+-------
]]></artwork>
</figure>
@@ -355,7 +358,7 @@
<section anchor="ibf_operations" numbered="true" toc="default">
<name>Operations</name>
<t>
- When an IBF is created, all counters and XHASH values of
+ When an IBF is created, all counters and IDSUM and HASHSUM
values of
all buckets are initialized to zero.
</t>
<section anchor="ibv_operations_insert" numbered="true"
toc="default">
@@ -363,12 +366,14 @@
<t>
To add an element to a IBF, the element is mapped to a
subset of k buckets using
the mapping function M as described in the <xref
target="bf" format="title" /> section introducing
- BFs. For the buckets selected by the mapping function,
the counter is increased by one and the value
- field is set to the XOR of the hash of the element and
the previously stored XHASH.
+ BFs. For the buckets selected by the mapping function,
the counter is increased by one and the
+ IDSUM field is set to the XOR of the element ID and the
previously stored IDSUM. Furthermore,
+ the HASHSUM is set to the XOR of the hash of the element
ID and the previously stored HASHSUM.
</t>
<t>
- In the following example the insert operation for the
element [0101] with the hash 0x4242 and
- the element [1100] with the hash 0x0101 is
demonstrated.
+ In the following example, the insert operation is
illustrated using an element with the
+ ID 0x01020304 and a hash of 0x4242, and a second element
with the ID 0x03040506 and
+ a hash of 0x0101.
</t>
<t>Empty IBF:</t>
<figure anchor="figure_ibf_insert_0">
@@ -377,7 +382,9 @@
+-------------+-------------+-------------+-------------+
count | 0 | 0 | 0 | 0 |
+-------------+-------------+-------------+-------------+
- value | 0000 | 0000 | 0000 | 0000 |
+ idSum | 0000 | 0000 | 0000 | 0000 |
+ +-------------+-------------+-------------+-------------+
+hashSum | 0000 | 0000 | 0000 | 0000 |
+-------------+-------------+-------------+-------------+
]]></artwork>
</figure>
@@ -408,8 +415,9 @@
<name>Remove Element</name>
<t>
To remove an element from the IBF the element is again
mapped to a subset of the buckets using M.
- Then all the counters of the buckets selected by M are
reduced by one, and the XHASH is
- replaced by the XOR of the old XHASH and the hash of
the element being removed.
+ Then all the counters of the buckets selected by M are
reduced by one, the IDSUM is
+ replaced by the XOR of the old IDSUM and the ID of the
element being removed, and the
+ HASHSUM is similarly replaced with the XOR of the old
HASHSUM and the hash of the ID.
</t>
<t>
In the following example the remove operation for the
element [1100] with the hash 0x0101 is demonstrated.
@@ -432,7 +440,7 @@
+-------------+-------------+-------------+-------------+
count | 0 | 1 | 0 | 1 |
+-------------+-------------+-------------+-------------+
- value | 0000 | 4242 | 0000 | 4242 |
+ value | 0000 | 0x4242 | 0000 | 0x4242 |
+-------------+-------------+-------------+-------------+
]]></artwork>
</figure>
@@ -444,12 +452,13 @@
present in the set: a bucket with a counter of zero can
be the result of one
element being added and a different element (mapped to
the same bucket) being removed.
To check that an element is not present requires a
counter of zero and an
- XHASH of zero --- and some assurance that there was no
hash collision. Thus,
+ IDSUM and HASHSUM of zero --- and some assurance that
there was no collision due
+ to the limited number of bits in IDSUM and HASHSUM.
Thus,
IBFs are not suitable to replace BFs or IBFs.
</t>
<t>
Buckets in an IBF with a counter of 1 or -1 are crucial
for decoding an IBF, as
- they might represent only a single element, with the
XHASH being the hash of that element.
+ they might represent only a single element, with the
IDSUM being the ID of that element.
Following Eppstein (CITE), we will call buckets that
only represent a single
element pure buckets.
Note that due to the possibility of multiple insertion
and removal operations
@@ -463,18 +472,46 @@
<section anchor="ibf_operations_decode" numbered="true"
toc="default">
<name>Decode IBF</name>
<t>
- To decode an IBF there are pure buckets needed, pure
buckets are buckets which contain only
- one element - the counter is set to 1 or -1. If there
is no pure bucket in the IBF, its not possible
- to decode the IBF. In this case a new IBF has to be
created, the new IBF needs to be bigger or a different mapping function
- should be used.
- If there are pure buckets its possible to decode the
IBF by removing elements as described
- in the section <xref target="ibf_operations_remove"
format="title" /> from the pure buckets from the filter creating new pure
buckets
- until the IBF is empty and all elements have been
decoded. Its possible the an IBF only partly decodes, in this case a new IBF
has to be
- created.
+ Decoding an IBF yields the HASH of an element from the
IBF, or failure.
+ </t>
+ <t>
+ A decode operation requires a pure bucket, that is a
bucket to which M
+ only mapped a single element, to succeed. Thus, if
there is no bucket with
+ a counter of 1 or -1, decoding fails. However, as a
counter of 1 or -1 is
+ not a guarantee that the bucket is pure, there is also
a chance that the
+ decoder returns an IDSUM value that is actually the
XOR of several IDSUMs.
+ This is primarily detected by checking that the
HASHSUM is the hash of the IDSUM.
+ Only if the HASHSUM also matches, the bucket could be
pure. Additionally,
+ one should check that the IDSUM value actually would
be mapped by M to
+ the respective bucket. If not, there was a hash
collision.
+ </t>
+ <t>
+ The very rare case that after all these checks a
bucket is still
+ falsely identified as pure must be detected (say by
determining that
+ extracted element IDs do not match any actual
elements), and addressed
+ at a higher level in the protocol. As these failures
are probabilistic
+ and depend on element IDs and the IBF construction,
they can typically
+ be avoided by retrying with different parameters, such
as a different
+ way to assign element IDs to elements, using a larger
value for L, or
+ a different mapping function M.
+ A more common scenario (especially if L was too small)
is that
+ IBF decoding fails because there is no pure bucket. In
this case, the
+ higher-level protocol also should retry using
different parameters.
</t>
<t>
- In the following example the successful decoding of an
IBF containing the two elements previously
- added.
+ Suppose the IBF contains a pure bucket. In this case,
the IDSUM in the
+ bucket identifies a single element. Furthermore, it
is then possible
+ to remove that element from the IBF (by inserting it
if the counter
+ was negative, and by removing it if the counter was
positive). This
+ is likely to cause other buckets to become pure,
allowing further
+ elements to be decoded. Eventually, decoding should
succeed with
+ all counters and IDSUM and HASHSUM values reaching
zero. However, it is also
+ possible that an IBF only partly decodes and then
decoding fails after
+ yielding some elements.
+ </t>
+ <t>
+ In the following example the successful decoding of an
IBF containing
+ the two elements previously added in our running
example.
</t>
<t>
IBF with the two encoded elements:
@@ -530,7 +567,7 @@
To calculate the IBF representing this set difference,
both IBFs must have the same
length L, the same number of buckets per element k and
use the same map M. Given this,
- one can compute the IBF representing the set
difference by taking the XOR of the XHASH values
+ one can compute the IBF representing the set
difference by taking the XOR of the IDSUM and HASHSUM values
of the respective buckets and subtracting the
respective counters. Care should be taken
to handle overflows and underflows by setting the
counter to "infinity" as necessary.
The result is a new IBF with the same number of
buckets representing the set difference.
@@ -538,16 +575,9 @@
<t>
This new IBF can be decoded as described in section
<xref target="ibf_operations_decode" format="counter" />.
The new IBF can have two types of pure buckets with
counter set to 1 or -1. If the counter is set to 1
- the element is missing in the secondary set and if the
counter is set to -1 the element is missing in
+ the element is missing in the secondary set, and if
the counter is set to -1 the element is missing in
the primary set.
</t>
- <t>
- Through this addition/subtraction its possible that a
counter is 1 or -1, but is not pure.
- This case its easy to detect by checking if an element
exist in either set, if
- the element does not exist its a falsely pure bucket
in this case continue test the
- next pure bucket until one is found that is truly
pure. If no truly pure buckets are found the IBF
- fails to decode.
- </t>
<t>
To demonstrate the set difference operation we compare
IBF-A with IBF-B and generate as described
IBF-AB
@@ -609,24 +639,35 @@
counter reaches "infinity" (-128).
</t>
<t>
- For the "HASH VALUE", we always use a 32-bit
+ For the "IDSUM", we always use a 64-bit representation.
+ The IDSUM value must have sufficient entropy for the
+ mapping function M to yield reasonably random buckets
+ even for very large values of L. With a 32 bit
+ value the chance that multiple elements may be mapped
+ to the same ID would be quite high, even for moderately
+ large sets. Using more than 64 bits would at best make
+ sense for very large sets, but then it is likely always
+ better to simply afford additional round trips to handle
+ the occasional collision. 64 bits are also a reasonable
+ size for many CPU architectures.
+ </t>
+ <t>
+ For the "HASHSUM", we always use a 32-bit
representation. Here, it is mostly important to
avoid collisions, where different elements are
- mapped to the same hash. The protocol is designed
+ mapped to the same hash. However, we note that
+ by design only a few elements (certainly less than
+ 127) should ever be mapped
+ to the same bucket, so a small number of bits
+ should suffice. Furthermore, our protocol is designed
to handle occasional collisions, so while with
- 32-bits there remains a significant chance of
- accidental collisions, at 32-bits the chance is
+ 32-bits there remains a chance of
+ accidental collisions, at 32 bit the chance is
generally believed to be sufficiently small enough
for the protocol to handle those cases efficiently
for a wide range of use-cases. Smaller hash
- values would safe bandwidth, but drastically
- increase the chance of collisions even for
- moderately-sized sets. Larger hash values would
- reduce the chance of collisions for very large
- sets, alas with very large sets the bandwidth used
- by the protocol is also typically larger and thus
- additional round-trips to address a few collisions
- are unlikely to have a major impact. 32-bit is
+ values would safe bandwidth, but also drastically
+ increase the chance of collisions. 32 bits are
also again a reasonable size for many CPU
architectures.
</t>
@@ -638,21 +679,28 @@
<section anchor="se_description" numbered="true" toc="default">
<name>Description</name>
<t>
- Strata Estimators should help to estimate the difference
between two different set of elements.
- That is necessary to efficiently determinate the needed
size of an IBF.
+ Strata Estimators help estimate the size of the set
difference between two set of elements.
+ This is necessary to efficiently determinate the tuning
parameters for an IBF, in particular
+ a good value for L.
</t>
<t>
- Basically an Strata Estimator (SE) is an IBF with the
difference that only a certain share of
- the elements is added to to the SE. The size of the share
is described by the metric stratum which
- means "Stratum n" contains 1/(2^n) (stratum factor) of all
elements.
+ Basically a Strata Estimator (SE) is a series of IBFs
(with a rather small value of L)
+ in which increasingly large subsets of the full set
+ of elements are added to each IBF. For the n-th IBF, the
function selecting the
+ subset of elements should sample to select
(probabilistically) 1/(2^n) of all
+ elements. This can be done by counting the number of
trailing bits set to "1"
+ in an element ID, and then inserting the element into the
IBF identified by that
+ counter. As a result, all elements will be mapped to one
IBF, with the n-th
+ IBF being statistically expected to contain 1/(2^n)
elements.
</t>
<t>
- The trick is to decode the SE with the biggest stratum
first and calculate the difference of
- the set if the SE decodes successfully decode the next
smaller strata estimator repeat
- this until the decoding of the SE fails or Stratum 0 is
reached. Then its possible to estimate
- how big the difference between two sets is by adding the
number of extracted hashes up (C) and scale it
- by the expected number of elements (E) in the remaining
unencoded IBF's (C*E=[estimated count of objects]).
- If non of the SE decoded choose a smaller stratum or try a
other mapping function.
+ Given two SEs, the set size difference can be estimated by
trying to decode all of the
+ IBFs. Given that L was set to a rather small value, IBFs
containing large strata
+ will likely fail to decode. For those IBFs that failed to
decode, one simply
+ extrapolates the number of elements by scaling the numbers
obtained from the
+ other IBFs that did decode. If none of the IBFs of the SE
decoded (which given
+ a reasonable choice of L should be highly unlikely), one
can retry using a different
+ mapping function M.
</t>
</section>
</section>
@@ -661,28 +709,35 @@
<section anchor="modeofoperation" numbered="true" toc="default">
<name>Mode of operation</name>
<t>
- The set union protocol uses the above discussed topics IBF and
Strata Estimators.
+ The set union protocol uses IBFs and SEs as primitives.
Depending on the state of the two sets there are different
strategies or operation modes how to efficiently
- determinate missing elements in two sets of two peers.
+ determinate missing elements between the two sets.
</t>
<t>
- The simples mode is the full sync mode. The idea is that if
the difference between the sets of the two
- peers exceeds a certain threshold the overhead of determinate
which elements are different outweighs
- the overhead of sending the complete set. In this case its
more efficient to just exchange the full sets.
+ The simplest mode is the "full" synchronization mode. The idea
is that if the difference between the sets of the two
+ peers exceeds a certain threshold, the overhead to determine
which elements are different outweighs
+ the overhead of sending the complete set. In this case, the
most efficient method can be to just
+ exchange the full sets.
############# IMAGE ##################
</t>
+ <t>
+ The second possibility is that the difference of the sets is
small compared to the set size.
+ Here, an efficient "delta" synchronization mode is more
efficient. Given these two possibilities,
+ the first steps of the protocol are used to determine which
mode should be used.
+ </t>
<t>
- In the following section the operation mode independent flow
in the beginning is described in detail:
+ Thus, the set synchronization protocol always begins with the
following operation mode independent steps.
</t>
<t>
- The initiating peer is initially in the <strong>Initiating
Connection</strong> state and the receiving peer in the <strong>Expecting
Connection</strong>
+ The initiating peer begins in the <strong>Initiating
Connection</strong> state and the receiving peer in the <strong>Expecting
Connection</strong>
state. The first step for the initiating peer in the protocol
is to send an <em><xref target="messages_operation_request" format="title"
/></em> to the receiving peer and
- change into <strong>Expect SE</strong> state. After receiving
the <em><xref target="messages_operation_request" format="title" /></em> the
receiving peer changes in <strong>Expecting IBF</strong> state and answers with
the
- <em><xref target="messages_se" format="title" /></em> message.
When the initiating peer receives the Strata Estimator the initiating peer
decides
- with some heuristics which operation mode is best fitted for
the the estimated set difference and the environment.
+ transition into the <strong>Expect SE</strong> state. After
receiving the <em><xref target="messages_operation_request" format="title"
/></em> the receiving peer
+ transitions to the <strong>Expecting IBF</strong> state and
answers with the
+ <em><xref target="messages_se" format="title" /></em> message.
When the initiating peer receives the <em><xref target="messages_se"
format="title" /></em> message,
+ it decides with some heuristics which operation mode is likely
more suitable for the estimated set difference and the application-provided
latency-bandwidth tradeoff.
The detailed tradeoff between the <xref
target="modeofoperation_full-sync" format="title" /> and the <xref
target="modeofoperation_individual-elements" format="title" />
is explained in the section <xref
target="modeofoperation_combined-mode" format="title" />.
</t>
@@ -690,7 +745,7 @@
<name>Full Synchronisation Mode</name>
<t>
- When the initiating peer decide to use the Full
Synchronisation Mode and the set of the initiating peer is bigger than the set
of the receiving peer, the initiating
+ When the initiating peer decides to use the full
synchronisation mode and the set of the initiating peer is bigger than the set
of the receiving peer, the initiating
peer sends a <em><xref target="messages_request_full"
format="title" /></em> message and change from <strong>Expecting SE</strong> to
the <strong>Full Receiving</strong> State.
In all other cases the initiating peer sends all set
elements to the other peer followed by the <em><xref
target="messages_full_done" format="title" /></em> message and
changes into <strong>Full Sending</strong> state.
@@ -720,9 +775,9 @@
</dl>
</section>
<section anchor="modeofoperation_individual-elements"
numbered="true" toc="default">
- <name>Individual Element Synchronisation Mode</name>
+ <name>Delta Synchronisation Mode</name>
<t>
- When the initiating peer in <strong>Expected SE</strong>
state decide to use the Individual Element Synchronisation Mode, the peer
+ When the initiating peer in <strong>Expected SE</strong>
state decide to use the delta synchronisation mode, the peer
sends a <em><xref target="messages_ibf" format="title"
/></em> to the receiving peer and changes into the <strong>Passive
Decoding</strong> state.
</t>
<t>
@@ -736,7 +791,7 @@
peer and the peer that is in <strong>Passive
Decoding</strong> and <strong>Finish Waiting</strong> state is called the
passive peer.
</t>
<t>
- ############# Statemaschine Individual Element
Synchronisation Mode ##################
+ ############# Statemaschine Delta Synchronisation Mode
##################
</t>
<t><strong>The behavior of the participants the different
state is described below:</strong></t>
<dl>
@@ -879,7 +934,7 @@
############# NOTE ############
To ensure that ...... the difference is multiplied by 1.5
if there are more than 200 elements differences between the sets (WHY? line
1398).
The Full Synchronisation Mode is used if the flag to force
full sync is set, the estimated difference between the two sets is bigger
- than 25% or the set size of the receiving peer is zero.
Otherwise the Individual Element Synchronisation Mode is used.
+ than 25% or the set size of the receiving peer is zero.
Otherwise the delta synchronisation mode is used.
############# NOTE END############
</t>
</section>
--
To stop receiving notification emails like this one, please contact
gnunet@gnunet.org.
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [lsd0003] branch master updated: work on section 4/5.1,
gnunet <=