From 96d4b93366f4b81b4c3b3f297685ba9eeeefe0c3 Mon Sep 17 00:00:00 2001 From: fiatjaf Date: Tue, 29 Oct 2024 11:30:33 -0300 Subject: [PATCH 1/5] nip45: add hyperloglog relay response. --- 45.md | 45 ++++++++++++++++++++++++++++++++++++++------- 1 file changed, 38 insertions(+), 7 deletions(-) diff --git a/45.md b/45.md index 219368e4..ea0b7f5f 100644 --- a/45.md +++ b/45.md @@ -29,15 +29,38 @@ In case a relay uses probabilistic counts, it MAY indicate it in the response wi Whenever the relay decides to refuse to fulfill the `COUNT` request, it MUST return a `CLOSED` message. +## HyperLogLog + +Relays may return an HyperLogLog value together with the count, hex-encoded. + +``` +["COUNT", , {"count": , "hll": ""}] +``` + +This is so it enables merging results from multiple relays and yielding a reasonable estimate of reaction counts, comment counts and follower counts, while saving many millions of bytes of bandwidth for everybody. + +### Algorithm + +The HLL value must be calculated with a precision of `8`, i.e. with 256 registers. + +To compute HLL values, first initi the 256 registers to `0` each; then, for on every event to be counted, + + 1. take byte `16` of the `id` and use it to determine the register index; + 2. count the number of leading zero bits in the following bytes `17..24` of the `id`; + 3. if the number of leading zeros is bigger than what was previously stored in that register, overwrite it. + +That is all that has to be done on the relay side, and therefore the only part needed for interoperability. + +On the client side, these HLL values received from different relays can be merged (by simply going through all the registers in HLL values from each relay and picking the highest value for each register, regardless of the relay). + +And finally the absolute count can be estimated by running some methods I don't dare to describe here in English, it's better to check some implementation source code (also, there can be different ways of performing the estimation, with different quirks applied on top of the raw registers). + +### `hll` encoding + +The value `hll` value must be the concatenation of the 256 registers, each being a uint8 value (i.e. a byte). Therefore `hll` will be a 512-character hex string. + ## Examples -### Followers count - -``` -["COUNT", , {"kinds": [3], "#p": []}] -["COUNT", , {"count": 238}] -``` - ### Count posts and reactions ``` @@ -45,6 +68,7 @@ Whenever the relay decides to refuse to fulfill the `COUNT` request, it MUST ret ["COUNT", , {"count": 5}] ``` + ### Count posts approximately ``` @@ -52,6 +76,13 @@ Whenever the relay decides to refuse to fulfill the `COUNT` request, it MUST ret ["COUNT", , {"count": 93412452, "approximate": true}] ``` +### Followers count with HyperLogLog + +``` +["COUNT", , {"kinds": [3], "#p": []}] +["COUNT", , {"count": 16578, "hll": "0607070505060806050508060707070706090d080b0605090607070b07090606060b0705070709050807080805080407060906080707080507070805060509040a0b06060704060405070706080607050907070b08060808080b080607090a06060805060604070908050607060805050d05060906090809080807050e0705070507060907060606070708080b0807070708080706060609080705060604060409070a0808050a0506050b0810060a0908070709080b0a07050806060508060607080606080707050806080c0a0707070a080808050608080f070506070706070a0908090c080708080806090508060606090906060d07050708080405070708"}] +``` + ### Relay refuses to count ``` From 867d2d9788339bf06b3ed4e4f7ff9e0ac4897d87 Mon Sep 17 00:00:00 2001 From: fiatjaf Date: Sun, 3 Nov 2024 16:49:02 -0300 Subject: [PATCH 2/5] nip45: negate pow attacks on hyperloglog using a stupid hack. --- 45.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/45.md b/45.md index ea0b7f5f..f6ecbeef 100644 --- a/45.md +++ b/45.md @@ -41,13 +41,14 @@ This is so it enables merging results from multiple relays and yielding a reason ### Algorithm -The HLL value must be calculated with a precision of `8`, i.e. with 256 registers. +This section describes the steps a relay should take in order to return HLL values to clients. -To compute HLL values, first initi the 256 registers to `0` each; then, for on every event to be counted, - - 1. take byte `16` of the `id` and use it to determine the register index; - 2. count the number of leading zero bits in the following bytes `17..24` of the `id`; - 3. if the number of leading zeros is bigger than what was previously stored in that register, overwrite it. +1. Upon receiving a filter, if it has a single `#e`, `#p`, `#a` or `#q` item, read its 32th ascii character as a byte and take its modulo over 24 to obtain an `offset` -- in the unlikely case that the filter doesn't meet these conditions, set `offset` to the number 16; +2. Initialize 256 registers to 0 for the HLL value; +3. For all the events that are to be counted according to the filter, do this: + 1. Read byte at position `offset` of the event `pubkey`, its value will be the register index `ri`; + 2. Count the number of leading zero bits starting at position `offset+1` of the event `pubkey`; + 3. Compare that with the value stored at register `ri`, if the new number is bigger, store it. That is all that has to be done on the relay side, and therefore the only part needed for interoperability. From 2e31f714db942377736c2ce2d0927b363cdfe043 Mon Sep 17 00:00:00 2001 From: fiatjaf Date: Sat, 9 Nov 2024 07:59:14 -0300 Subject: [PATCH 3/5] nip45: mention hyperloglog attack and its solution. --- 45.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/45.md b/45.md index f6ecbeef..a5ce02db 100644 --- a/45.md +++ b/45.md @@ -56,6 +56,10 @@ On the client side, these HLL values received from different relays can be merge And finally the absolute count can be estimated by running some methods I don't dare to describe here in English, it's better to check some implementation source code (also, there can be different ways of performing the estimation, with different quirks applied on top of the raw registers). +### Attack vectors + +One could mine a pubkey with a certain number of zero bits in the exact place where the HLL algorithm described above would look for them in order to artificially make its reaction or follow "count more" than others. For this to work a different pubkey would have to be created for each different target (event id, followed profile etc). This approach is not very different than creating tons of new pubkeys and using them all to send likes or follow someone in order to inflate their number of followers. The solution is the same in both cases: clients should not fetch these reaction counts from open relays that accept everything, they should base their counts on relays that perform some form of filtering that makes it more likely that only real humans are able to publish there and not bots or artificially-generated pubkeys. + ### `hll` encoding The value `hll` value must be the concatenation of the 256 registers, each being a uint8 value (i.e. a byte). Therefore `hll` will be a 512-character hex string. From 545281646c4ce79667c1721937064aa78bb50d9d Mon Sep 17 00:00:00 2001 From: fiatjaf Date: Mon, 11 Nov 2024 22:12:37 -0300 Subject: [PATCH 4/5] nip45: a mike dilger fix and a change inspired by a mike dilger fix. --- 45.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/45.md b/45.md index a5ce02db..d2af7c26 100644 --- a/45.md +++ b/45.md @@ -43,11 +43,11 @@ This is so it enables merging results from multiple relays and yielding a reason This section describes the steps a relay should take in order to return HLL values to clients. -1. Upon receiving a filter, if it has a single `#e`, `#p`, `#a` or `#q` item, read its 32th ascii character as a byte and take its modulo over 24 to obtain an `offset` -- in the unlikely case that the filter doesn't meet these conditions, set `offset` to the number 16; -2. Initialize 256 registers to 0 for the HLL value; +1. Upon receiving a filter, if it has a single `#e`, `#p`, `#a` or `#q` item, read its 32th ascii character as a nibble (a half-byte, a number between 0 and 16) and add `8` to it to obtain an `offset` -- in the unlikely case that the filter doesn't meet these conditions, set `offset` to the number `16`; +2. Initialize 256 registers to `0` for the HLL value; 3. For all the events that are to be counted according to the filter, do this: 1. Read byte at position `offset` of the event `pubkey`, its value will be the register index `ri`; - 2. Count the number of leading zero bits starting at position `offset+1` of the event `pubkey`; + 2. Count the number of leading zero bits starting at position `offset+1` of the event `pubkey` and add `1`; 3. Compare that with the value stored at register `ri`, if the new number is bigger, store it. That is all that has to be done on the relay side, and therefore the only part needed for interoperability. From bf0d740d1257705cde1251d44181882afff56d28 Mon Sep 17 00:00:00 2001 From: fiatjaf Date: Sat, 7 Dec 2024 07:38:40 -0300 Subject: [PATCH 5/5] nip45: restrict hyperloglog to two hardcoded use cases with deterministic offset for now. --- 45.md | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/45.md b/45.md index d2af7c26..4866e579 100644 --- a/45.md +++ b/45.md @@ -43,10 +43,10 @@ This is so it enables merging results from multiple relays and yielding a reason This section describes the steps a relay should take in order to return HLL values to clients. -1. Upon receiving a filter, if it has a single `#e`, `#p`, `#a` or `#q` item, read its 32th ascii character as a nibble (a half-byte, a number between 0 and 16) and add `8` to it to obtain an `offset` -- in the unlikely case that the filter doesn't meet these conditions, set `offset` to the number `16`; +1. Upon receiving a filter, if it is eligible (see below) for HyperLogLog, compute the deterministic `offset` for that filter (see below); 2. Initialize 256 registers to `0` for the HLL value; 3. For all the events that are to be counted according to the filter, do this: - 1. Read byte at position `offset` of the event `pubkey`, its value will be the register index `ri`; + 1. Read the byte at position `offset` of the event `pubkey`, its value will be the register index `ri`; 2. Count the number of leading zero bits starting at position `offset+1` of the event `pubkey` and add `1`; 3. Compare that with the value stored at register `ri`, if the new number is bigger, store it. @@ -56,6 +56,13 @@ On the client side, these HLL values received from different relays can be merge And finally the absolute count can be estimated by running some methods I don't dare to describe here in English, it's better to check some implementation source code (also, there can be different ways of performing the estimation, with different quirks applied on top of the raw registers). +### Filter eligibility and `offset` computation + +This NIP defines (for now) two filters eligible for HyperLogLog: + +- `{"#p": [""], "kinds": [3]}`, i.e. a filter for `kind:3` events with a single `"p"` tag, which means the client is interested in knowing how many people "follow" the target ``. In this case the `offset` will be given by reading the character at the position `32` of the hex `` value as a base-16 number then adding `8` to it. +- `{"#e": [""], "kinds": [7]}`, i.e. a filter for `kind:7` events with a single `"e"` tag, which means the client is interested in knowing how many people have reacted to the target event ``. In this case the `offset` will be given by reading the character at the position `32` of the hex `` value as a base-16 number then adding `8` to it. + ### Attack vectors One could mine a pubkey with a certain number of zero bits in the exact place where the HLL algorithm described above would look for them in order to artificially make its reaction or follow "count more" than others. For this to work a different pubkey would have to be created for each different target (event id, followed profile etc). This approach is not very different than creating tons of new pubkeys and using them all to send likes or follow someone in order to inflate their number of followers. The solution is the same in both cases: clients should not fetch these reaction counts from open relays that accept everything, they should base their counts on relays that perform some form of filtering that makes it more likely that only real humans are able to publish there and not bots or artificially-generated pubkeys.