[lsd0003] branch master updated: work on section 4/5.1

gnunet-svn
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[lsd0003] branch master updated: work on section 4/5.1

From:	gnunet
Subject:	[lsd0003] branch master updated: work on section 4/5.1
Date:	Thu, 14 Jan 2021 12:06:25 +0100
This is an automated email from the git hooks/post-receive script.

grothoff pushed a commit to branch master
in repository lsd0003.

The following commit(s) were added to refs/heads/master by this push:
     new cab45ba  work on section 4/5.1
cab45ba is described below

commit cab45ba74dc984053f893d8caf5e406f643bed93
Author: Christian Grothoff <christian@grothoff.org>
AuthorDate: Thu Jan 14 12:05:36 2021 +0100

    work on section 4/5.1
---
 draft-summermatter-set-union.xml | 195 +++++++++++++++++++++++++--------------
 1 file changed, 125 insertions(+), 70 deletions(-)

diff --git a/draft-summermatter-set-union.xml b/draft-summermatter-set-union.xml
index f19bb8f..b108c79 100644
--- a/draft-summermatter-set-union.xml
+++ b/draft-summermatter-set-union.xml
@@ -180,12 +180,13 @@
                 <name>Bloom Filters</name>
                 <t>
                     A Bloom filter (BF) is a space-efficient datastructure to 
test if am element is part of a set of elements.
+                    Elements are identified by an element ID.
                     Since a BF is a probabilistic datastructure, it is 
possible to have false-positives: when asked
                     if an element is in the set, the answer from a BF is 
either "no" or "maybe".
                 </t>
                 <t>
                     A BF consists of L buckets. Every bucket is a binary value 
that can be either 0 or 1. All buckets are initialized
-                    to 0.  A mapping function M is used to map each element of 
the set to a subset of k buckets.  M is non-injective
+                    to 0.  A mapping function M is used to map each the ID of 
each element from the set to a subset of k buckets.  M is non-injective
                     and can thus map the same element multiple times to the 
same bucket.
                     The type of the mapping function can thus be described by 
the following mathematical notation:
                 </t>
@@ -346,7 +347,9 @@
         +-------------+-------------+-------------+-------------+-------
   count |   COUNTER   |   COUNTER   |   COUNTER   |   COUNTER   |  C...
         +-------------+-------------+-------------+-------------+------
-  value |    XHASH    |    XHASH    |    XHASH    |     XHASH   |  X...
+  idSum |    IDSUM    |    IDSUM    |    IDSUM    |     IDSUM   |  I...
+        +-------------+-------------+-------------+-------------+------
+hashSum |   HASHSUM   |   HASHSUM   |   HASHSUM   |    HASHSUM  |  H..
         +-------------+-------------+-------------+-------------+-------
                  ]]></artwork>
                     </figure>
@@ -355,7 +358,7 @@
             <section anchor="ibf_operations" numbered="true" toc="default">
               <name>Operations</name>
               <t>
-                When an IBF is created, all counters and XHASH values of
+                When an IBF is created, all counters and IDSUM and HASHSUM 
values of
                 all buckets are initialized to zero.
               </t>
                 <section anchor="ibv_operations_insert" numbered="true" 
toc="default">
@@ -363,12 +366,14 @@
                     <t>
                       To add an element to a IBF, the element is mapped to a 
subset of k buckets using
                       the mapping function M as described in the <xref 
target="bf" format="title" /> section introducing
-                      BFs. For the buckets selected by the mapping function, 
the counter is increased by one and the value
-                      field is set to the XOR of the hash of the element and 
the previously stored XHASH.
+                      BFs. For the buckets selected by the mapping function, 
the counter is increased by one and the
+                      IDSUM field is set to the XOR of the element ID and the 
previously stored IDSUM. Furthermore,
+                      the HASHSUM is set to the XOR of the hash of the element 
ID and the previously stored HASHSUM.
                     </t>
                     <t>
-                        In the following example the insert operation for the 
element [0101] with the hash 0x4242 and
-                        the element [1100] with the hash 0x0101 is 
demonstrated.
+                      In the following example, the insert operation is 
illustrated using an element with the
+                      ID 0x01020304 and a hash of 0x4242, and a second element 
with the ID 0x03040506 and
+                      a hash of 0x0101.
                     </t>
                     <t>Empty IBF:</t>
                     <figure anchor="figure_ibf_insert_0">
@@ -377,7 +382,9 @@
         +-------------+-------------+-------------+-------------+
   count |      0      |      0      |      0      |      0      |
         +-------------+-------------+-------------+-------------+
-  value |     0000    |     0000    |     0000    |     0000    |
+  idSum |     0000    |     0000    |     0000    |     0000    |
+        +-------------+-------------+-------------+-------------+
+hashSum |     0000    |     0000    |     0000    |     0000    |
         +-------------+-------------+-------------+-------------+
                  ]]></artwork>
                     </figure>
@@ -408,8 +415,9 @@
                     <name>Remove Element</name>
                     <t>
                         To remove an element from the IBF the element is again 
mapped to a subset of the buckets using M.
-                        Then all the counters of the buckets selected by M are 
reduced by one, and the XHASH is
-                        replaced by the XOR of the old XHASH and the hash of 
the element being removed.
+                        Then all the counters of the buckets selected by M are 
reduced by one, the IDSUM is
+                        replaced by the XOR of the old IDSUM and the ID of the 
element being removed, and the
+                        HASHSUM is similarly replaced with the XOR of the old 
HASHSUM and the hash of the ID.
                     </t>
                     <t>
                         In the following example the remove operation for the 
element [1100] with the hash 0x0101 is demonstrated.
@@ -432,7 +440,7 @@
         +-------------+-------------+-------------+-------------+
   count |      0      |      1      |      0      |      1      |
         +-------------+-------------+-------------+-------------+
-  value |     0000    |     4242    |     0000    |     4242    |
+  value |     0000    |   0x4242    |     0000    |   0x4242    |
         +-------------+-------------+-------------+-------------+
                  ]]></artwork>
                     </figure>
@@ -444,12 +452,13 @@
                       present in the set: a bucket with a counter of zero can 
be the result of one
                       element being added and a different element (mapped to 
the same bucket) being removed.
                       To check that an element is not present requires a 
counter of zero and an
-                      XHASH of zero --- and some assurance that there was no 
hash collision.  Thus,
+                      IDSUM and HASHSUM of zero --- and some assurance that 
there was no collision due
+                      to the limited number of bits in IDSUM and HASHSUM.  
Thus,
                       IBFs are not suitable to replace BFs or IBFs.
                     </t>
                     <t>
                       Buckets in an IBF with a counter of 1 or -1 are crucial 
for decoding an IBF, as
-                      they might represent only a single element, with the 
XHASH being the hash of that element.
+                      they might represent only a single element, with the 
IDSUM being the ID of that element.
                       Following Eppstein (CITE), we will call buckets that 
only represent a single
                       element pure buckets.
                       Note that due to the possibility of multiple insertion 
and removal operations
@@ -463,18 +472,46 @@
                 <section anchor="ibf_operations_decode" numbered="true" 
toc="default">
                     <name>Decode IBF</name>
                     <t>
-                        To decode an IBF there are pure buckets needed, pure 
buckets are buckets which contain only
-                        one element - the counter is set to 1 or -1. If there 
is no pure bucket in the IBF, its not possible
-                        to decode the IBF. In this case a new IBF has to be 
created, the new IBF needs to be bigger or a different mapping function
-                        should be used.
-                        If there are pure buckets its possible to decode the 
IBF by removing elements as described
-                        in the section <xref target="ibf_operations_remove" 
format="title" /> from the pure buckets from the filter creating new pure 
buckets
-                        until the IBF is empty and all elements have been 
decoded. Its possible the an IBF only partly decodes, in this case a new IBF 
has to be
-                        created.
+                      Decoding an IBF yields the HASH of an element from the 
IBF, or failure.
+                    </t>
+                    <t>
+                        A decode operation requires a pure bucket, that is a 
bucket to which M
+                        only mapped a single element, to succeed.  Thus, if 
there is no bucket with
+                        a counter of 1 or -1, decoding fails. However, as a 
counter of 1 or -1 is
+                        not a guarantee that the bucket is pure, there is also 
a chance that the
+                        decoder returns an IDSUM value that is actually the 
XOR of several IDSUMs.
+                        This is primarily detected by checking that the 
HASHSUM is the hash of the IDSUM.
+                        Only if the HASHSUM also matches, the bucket could be 
pure.  Additionally,
+                        one should check that the IDSUM value actually would 
be mapped by M to
+                        the respective bucket. If not, there was a hash 
collision.
+                    </t>
+                    <t>
+                        The very rare case that after all these checks a 
bucket is still
+                        falsely identified as pure must be detected (say by 
determining that
+                        extracted element IDs do not match any actual 
elements), and addressed
+                        at a higher level in the protocol. As these failures 
are probabilistic
+                        and depend on element IDs and the IBF construction, 
they can typically
+                        be avoided by retrying with different parameters, such 
as a different
+                        way to assign element IDs to elements, using a larger 
value for L, or
+                        a different mapping function M.
+                        A more common scenario (especially if L was too small) 
is that
+                        IBF decoding fails because there is no pure bucket. In 
this case, the
+                        higher-level protocol also should retry using 
different parameters.
                     </t>
                     <t>
-                        In the following example the successful decoding of an 
IBF containing the two elements previously
-                        added.
+                        Suppose the IBF contains a pure bucket. In this case, 
the IDSUM in the
+                        bucket identifies a single element.  Furthermore, it 
is then possible
+                        to remove that element from the IBF (by inserting it 
if the counter
+                        was negative, and by removing it if the counter was 
positive). This
+                        is likely to cause other buckets to become pure, 
allowing further
+                        elements to be decoded.  Eventually, decoding should 
succeed with
+                        all counters and IDSUM and HASHSUM values reaching 
zero. However, it is also
+                        possible that an IBF only partly decodes and then 
decoding fails after
+                        yielding some elements.
+                    </t>
+                    <t>
+                        In the following example the successful decoding of an 
IBF containing
+                        the two elements previously added in our running 
example.
                     </t>
                     <t>
                         IBF with the two encoded elements:
@@ -530,7 +567,7 @@
 
                         To calculate the IBF representing this set difference, 
both IBFs must have the same
                         length L, the same number of buckets per element k and 
use the same map M.  Given this,
-                        one can compute the IBF representing the set 
difference by taking the XOR of the XHASH values
+                        one can compute the IBF representing the set 
difference by taking the XOR of the IDSUM and HASHSUM values
                         of the respective buckets and subtracting the 
respective counters.  Care should be taken
                         to handle overflows and underflows by setting the 
counter to "infinity" as necessary.
                         The result is a new IBF with the same number of 
buckets representing the set difference.
@@ -538,16 +575,9 @@
                     <t>
                         This new IBF can be decoded as described in section 
<xref target="ibf_operations_decode" format="counter" />.
                         The new IBF can have two types of pure buckets with 
counter set to 1 or -1. If the counter is set to 1
-                        the element is missing in the secondary set and if the 
counter is set to -1 the element is missing in
+                        the element is missing in the secondary set, and if 
the counter is set to -1 the element is missing in
                         the primary set.
                     </t>
-                    <t>
-                        Through this addition/subtraction its possible that a 
counter is 1 or -1, but is not pure.
-                        This case its easy to detect by checking if an element 
exist in either set, if
-                        the element does not exist its a falsely pure bucket 
in this case continue test the
-                        next pure bucket until one is found that is truly 
pure. If no truly pure buckets are found the IBF
-                        fails to decode.
-                    </t>
                     <t>
                         To demonstrate the set difference operation we compare 
IBF-A with IBF-B and generate as described
                         IBF-AB
@@ -609,24 +639,35 @@
                     counter reaches "infinity" (-128).
                 </t>
                 <t>
-                    For the "HASH VALUE", we always use a 32-bit
+                    For the "IDSUM", we always use a 64-bit representation.
+                    The IDSUM value must have sufficient entropy for the
+                    mapping function M to yield reasonably random buckets
+                    even for very large values of L.  With a 32 bit
+                    value the chance that multiple elements may be mapped
+                    to the same ID would be quite high, even for moderately
+                    large sets.  Using more than 64 bits would at best make
+                    sense for very large sets, but then it is likely always
+                    better to simply afford additional round trips to handle
+                    the occasional collision. 64 bits are also a reasonable
+                    size for many CPU architectures.
+                </t>
+                <t>
+                    For the "HASHSUM", we always use a 32-bit
                     representation.  Here, it is mostly important to
                     avoid collisions, where different elements are
-                    mapped to the same hash.  The protocol is designed
+                    mapped to the same hash.  However, we note that
+                    by design only a few elements (certainly less than
+                    127) should ever be mapped
+                    to the same bucket, so a small number of bits
+                    should suffice.  Furthermore, our protocol is designed
                     to handle occasional collisions, so while with
-                    32-bits there remains a significant chance of
-                    accidental collisions, at 32-bits the chance is
+                    32-bits there remains a chance of
+                    accidental collisions, at 32 bit the chance is
                     generally believed to be sufficiently small enough
                     for the protocol to handle those cases efficiently
                     for a wide range of use-cases.  Smaller hash
-                    values would safe bandwidth, but drastically
-                    increase the chance of collisions even for
-                    moderately-sized sets.  Larger hash values would
-                    reduce the chance of collisions for very large
-                    sets, alas with very large sets the bandwidth used
-                    by the protocol is also typically larger and thus
-                    additional round-trips to address a few collisions
-                    are unlikely to have a major impact.  32-bit is
+                    values would safe bandwidth, but also drastically
+                    increase the chance of collisions. 32 bits are
                     also again a reasonable size for many CPU
                     architectures.
                 </t>
@@ -638,21 +679,28 @@
             <section anchor="se_description" numbered="true" toc="default">
                 <name>Description</name>
                 <t>
-                    Strata Estimators should help to estimate the difference 
between two different set of elements.
-                    That is necessary to efficiently determinate the needed 
size of an IBF.
+                    Strata Estimators help estimate the size of the set 
difference between two set of elements.
+                    This is necessary to efficiently determinate the tuning 
parameters for an IBF, in particular
+                    a good value for L.
                 </t>
                 <t>
-                    Basically an Strata Estimator (SE) is an IBF with the 
difference that only a certain share of
-                    the elements is added to to the SE. The size of the share 
is described by the metric stratum which
-                    means "Stratum n" contains 1/(2^n) (stratum factor) of all 
elements.
+                    Basically a Strata Estimator (SE) is a series of IBFs 
(with a rather small value of L)
+                    in which increasingly large subsets of the full set
+                    of elements are added to each IBF.  For the n-th IBF, the 
function selecting the
+                    subset of elements should sample to select 
(probabilistically) 1/(2^n) of all
+                    elements.  This can be done by counting the number of 
trailing bits set to "1"
+                    in an element ID, and then inserting the element into the 
IBF identified by that
+                    counter.  As a result, all elements will be mapped to one 
IBF, with the n-th
+                    IBF being statistically expected to contain 1/(2^n) 
elements.
                 </t>
                 <t>
-                    The trick is to decode the SE with the biggest stratum 
first and calculate the difference of
-                    the set if the SE decodes successfully decode the next 
smaller strata estimator repeat
-                    this until the decoding of the SE fails or Stratum 0 is 
reached. Then its possible to estimate
-                    how big the difference between two sets is by adding the 
number of extracted hashes up (C) and scale it
-                    by the expected number of elements (E) in the remaining 
unencoded IBF's (C*E=[estimated count of objects]).
-                    If non of the SE decoded choose a smaller stratum or try a 
other mapping function.
+                    Given two SEs, the set size difference can be estimated by 
trying to decode all of the
+                    IBFs.  Given that L was set to a rather small value, IBFs 
containing large strata
+                    will likely fail to decode.  For those IBFs that failed to 
decode, one simply
+                    extrapolates the number of elements by scaling the numbers 
obtained from the
+                    other IBFs that did decode.  If none of the IBFs of the SE 
decoded (which given
+                    a reasonable choice of L should be highly unlikely), one 
can retry using a different
+                    mapping function M.
                 </t>
             </section>
         </section>
@@ -661,28 +709,35 @@
         <section anchor="modeofoperation" numbered="true" toc="default">
             <name>Mode of operation</name>
             <t>
-                The set union protocol uses the above discussed topics IBF and 
Strata Estimators.
+                The set union protocol uses IBFs and SEs as primitives.
                 Depending on the state of the two sets there are different 
strategies or operation modes how to efficiently
-                determinate missing elements in two sets of two peers.
+                determinate missing elements between the two sets.
             </t>
 
             <t>
-                The simples mode is the full sync mode. The idea is that if 
the difference between the sets of the two
-                peers exceeds a certain threshold the overhead of determinate 
which elements are different outweighs
-                the overhead of sending the complete set. In this case its 
more efficient to just exchange the full sets.
+                The simplest mode is the "full" synchronization mode. The idea 
is that if the difference between the sets of the two
+                peers exceeds a certain threshold, the overhead to determine 
which elements are different outweighs
+                the overhead of sending the complete set. In this case, the 
most efficient method can be to just
+                exchange the full sets.
                 ############# IMAGE ##################
             </t>
+            <t>
+                The second possibility is that the difference of the sets is 
small compared to the set size.
+                Here, an efficient "delta" synchronization mode is more 
efficient. Given these two possibilities,
+                the first steps of the protocol are used to determine which 
mode should be used.
+            </t>
 
             <t>
-                In the following section the operation mode independent flow 
in the beginning is described in detail:
+                Thus, the set synchronization protocol always begins with the 
following operation mode independent steps.
             </t>
 
             <t>
-                The initiating peer is initially in the <strong>Initiating 
Connection</strong> state and the receiving peer in the <strong>Expecting 
Connection</strong>
+                The initiating peer begins in the <strong>Initiating 
Connection</strong> state and the receiving peer in the <strong>Expecting 
Connection</strong>
                 state. The first step for the initiating peer in the protocol 
is to send an <em><xref target="messages_operation_request" format="title" 
/></em> to the receiving peer and
-                change into <strong>Expect SE</strong> state. After receiving 
the <em><xref target="messages_operation_request" format="title" /></em> the 
receiving peer changes in <strong>Expecting IBF</strong> state and answers with 
the
-                <em><xref target="messages_se" format="title" /></em> message. 
When the initiating peer receives the Strata Estimator the initiating peer 
decides
-                with some heuristics which operation mode is best fitted for 
the the estimated set difference and the environment.
+                transition into the <strong>Expect SE</strong> state. After 
receiving the <em><xref target="messages_operation_request" format="title" 
/></em> the receiving peer
+                transitions to the <strong>Expecting IBF</strong> state and 
answers with the
+                <em><xref target="messages_se" format="title" /></em> message. 
When the initiating peer receives the <em><xref target="messages_se" 
format="title" /></em> message,
+                it decides with some heuristics which operation mode is likely 
more suitable for the estimated set difference and the application-provided 
latency-bandwidth tradeoff.
                 The detailed tradeoff between the <xref 
target="modeofoperation_full-sync" format="title" /> and the <xref 
target="modeofoperation_individual-elements" format="title" />
                 is explained in the section <xref 
target="modeofoperation_combined-mode" format="title" />.
             </t>
@@ -690,7 +745,7 @@
                 <name>Full Synchronisation Mode</name>
 
                 <t>
-                    When the initiating peer decide to use the Full 
Synchronisation Mode and the set of the initiating peer is bigger than the set 
of the receiving peer, the initiating
+                    When the initiating peer decides to use the full 
synchronisation mode and the set of the initiating peer is bigger than the set 
of the receiving peer, the initiating
                     peer sends a <em><xref target="messages_request_full" 
format="title" /></em> message and change from <strong>Expecting SE</strong> to 
the <strong>Full Receiving</strong> State.
                     In all other cases the initiating peer sends all set 
elements to the other peer followed by the <em><xref 
target="messages_full_done" format="title" /></em> message and
                     changes into <strong>Full Sending</strong> state.
@@ -720,9 +775,9 @@
                 </dl>
             </section>
             <section anchor="modeofoperation_individual-elements" 
numbered="true" toc="default">
-                <name>Individual Element Synchronisation Mode</name>
+                <name>Delta Synchronisation Mode</name>
                 <t>
-                    When the initiating peer in <strong>Expected SE</strong> 
state decide to use the Individual Element Synchronisation Mode, the peer
+                    When the initiating peer in <strong>Expected SE</strong> 
state decide to use the delta synchronisation mode, the peer
                     sends a <em><xref target="messages_ibf" format="title" 
/></em> to the receiving peer and changes into the <strong>Passive 
Decoding</strong> state.
                 </t>
                 <t>
@@ -736,7 +791,7 @@
                     peer and the peer that is in <strong>Passive 
Decoding</strong> and <strong>Finish Waiting</strong> state is called the 
passive peer.
                 </t>
                 <t>
-                    ############# Statemaschine Individual Element 
Synchronisation Mode ##################
+                    ############# Statemaschine Delta Synchronisation Mode 
##################
                 </t>
                 <t><strong>The behavior of the participants the different 
state is described below:</strong></t>
                 <dl>
@@ -879,7 +934,7 @@
                     ############# NOTE ############
                     To ensure that ...... the difference is multiplied by 1.5 
if there are more than 200 elements differences between the sets (WHY? line 
1398).
                     The Full Synchronisation Mode is used if the flag to force 
full sync is set, the estimated difference between the two sets is bigger
-                    than 25% or the set size of the receiving peer is zero. 
Otherwise the Individual Element Synchronisation Mode is used.
+                    than 25% or the set size of the receiving peer is zero. 
Otherwise the delta synchronisation mode is used.
                     ############# NOTE END############
                 </t>
             </section>

-- 
To stop receiving notification emails like this one, please contact
gnunet@gnunet.org.
[Prev in Thread]
Current Thread
[Next in Thread]
[lsd0003] branch master updated: work on section 4/5.1, gnunet <=
Prev by Date: [libeufin] branch master updated: CLI avoids printing empty JSONs
Next by Date: [lsd0003] branch master updated: more comments
Previous by thread: [libeufin] branch master updated: CLI avoids printing empty JSONs
Next by thread: [lsd0003] branch master updated: more comments
Index(es):
- Date
- Thread