𝐼𝒻 𝐼 𝓂𝒶𝓎 𝒾𝓃𝓉𝑒𝓇𝒿𝑒𝒸𝓉
I really wish DHTs would be removed from the literature because they seem so good but they have so many unsolved problems that they trick people into designing systems that can't be safely implemented.
1. DoS -> If you're hashing the /64 of IP6 space then I need to get a /48 (trivial), getting a /32 is not so trivial but it's not really hard. Domains are actually probably less bad than /64s
My instinct is to do repeated searches with dns-like TTL caches. So forwarding a search to the entire fediverse is obviously pretty horrible, but nodes who didn't recently use a tag shouldn't be bothered with the question.
So perhaps use a pubsub which allows you to subscribe to recent tag activity, for example "I have messages from the past 1 minute which use the tags" [ .... ]
Then you can limit the number of nodes you have to search...
@cjd @lain Searches alone don't provide any subscription functionality, as having to poll for posts will just overload the network at high interest for a tag.
Furthermore, many use cases mandate post delivery to happen at least close to real-time. This wouldn't be possible with TTL-flooding at all.
Regarding flooding, I hope you know how "well" Gnutella worked?
My idea is definitely not well thought out, but the reason why I tried to make it work on top of querying/pubsub is because when someone is running a shitty server (freefedifollowers), you can just block the server and be done with it, whereas in a DHT you would have to get everybody onboard.
So thinking about it a bit more, I think what I'd do is the following:
1. Gossip all of the data in order to reduce the load
2. When you get an update about server X from server Y, next time you want to learn about server X, ask server Y again (unless server Y goes down, in which case you switch)
3. Publish the chain of servers through-which the updates from server X reached you
This way you can do blacklisting and whitelisting which is resistant to fakery.
What do you mean with "all of the data"? Apart from the hashtag posts assigned to an instance by the DHT as a relay or storage node, instances shall only receive the posts they subscribed to or queried. Otherwise we'll get an overload-causing "all or nothing" situation for smaller nodes like with current relays.
2. & 3.: So we're building ourselves a multicast tree by learning, right? That is a possible approach, but how does it perform better in ->
Everything -> everything you proposed to put in the dht (I like your idea of a message id only)
re performance, nodes can set a preference number to indicate how much they want you to pull from them so you tend to build a tree with hub nodes that can handle it. Ofc each update from node x should be signed and time-stamped by node x so it can't be tampered with.
> Everything -> everything you proposed to put in the dht
Still not sure what you mean.
The DHT part assigns responsibility of handling a bunch of hashtags to an instance. All other hashtag posts don't have to reside on that instance if it doesn't deliberately fetch them/ subscribe to them at another instance out of interest.
Writing up a little something here: https://cryptpad.fr/code/#/2/code/view/DjX7MWbez2OF5uXSjN1aQjL6IMoE-tXC26AG6fN5OEw/present/
Still in progress...
@cjd Gonna take a look at it tomorrow
@cjd Sure. /me is exhausted and going to sleep right now.
The only table I really want to bang my fist on is please don't slip on the DHT-banana-peel.
AFAICT the only protocol using DHT at scale is bittorrent (are there others?) and their usage is very unique. I would argue that in their usage it's a motte and bailey.
I'm really glad we have more people taking interest; this is the part I know very little about. It would be good to have other people step up and take leadership in this area.
@kaniini @emacsen @cjd @lain @schmittlauch That's good news. Horray! Now we need to figure out where to coordinate this work... we have #datashards on freenode, but I'm getting the sense we need to do something more long-lived.
A few options:
- The W3C Credentials CG might be interested in picking it up https://w3c-ccg.github.io/ and we could use their calls, mailing lists
- We could maybe coordinate it on socialhub.activitypub.rocks once we have it up. Thoughts @how ?
- Something else?
@cwebber @schmittlauch @lain @emacsen
The bailey in this case is the trackers. They work really well, they're fast, they're centrally administered so if something goes wrong, someone can deal with it.
But if the baddies threaten to take down the trackers, the bittorrent people say "ohh you are fools, you can take down the trackers all you like, we have <drumroll> The DHT", and that's true, if the trackers go down, the network will continue to function.
But then you have the DHT attacks...
@cjd @schmittlauch @lain @emacsen Do you think hosting such things over tor .onion services or I2P helps? Makes it harder to take down nodes. But OTOH, I'd also love to be able to use the fediverse servers we already have to distribute content without setting up separate daemons necessarily (I'm guessing that's where the Pleroma devs plan to take things)
I think it would be really cool if actually everything was gossiped, so then the fediverse could cross network boundaries (some nodes in tor, some in i2p, some in Hyperboria, some in China), but that's just a dream and the bandwidth to move media around makes such a thing untenable.
Should a node, once it has content that is "important" to it (eg, let's say my node containing this very post) continue to hold onto it and respond to queries asking for content?
On the one hand, this helps important content survive. On the other hand, it helps reveal who has the content.
I wonder if we can make progress on this without going full-freenet ;)
Having the originating server store the content and other servers only "cache" it makes logical sense because the originating server is the one which has the direct relationship with the person who created the content (who is probably the relevant data-subject).
You have a strong interest in this, evidenced by the paper you put significant time and effort into, I'm occupied by other things and I only have a marginal interest in making the fediverse more flexible in how it deals with attacks.
At this point your proposal is more standards-ready than mine, yours has a champion (you), mine doesn't because I don't have the time.
@cjd @cwebber @lain @emacsen I'll try to read your proposal as soon as possible. I like your enthusiasm and you quickly getting onto things, but am also a bit appalled by how quick you put together an alternative suggestion and people discussing it.
I need to remember my considerations for *not* building gossip (I didn't know that term back then) trees 6 months(!) ago 😅
@lain @emacsen @cjd @schmittlauch I think one thing that happened at APConf is that a lot of us started to get excited about the viability of bringing Datashards to the fediverse. It seems to me that the Pleroma team is looking to take leadership here, and that's really great and increases my confidence.
@lain @emacsen @cjd @schmittlauch @gargron @nightpool I also want to say that we want to be careful about rolling this stuff out in testing stages; Datashards is still in flux and *will be shaped by* the participation of the fediverse. We want to be careful about not rolling it out completely to the wider fediverse before we're sure about how it works.
@cjd @schmittlauch @lain @emacsen The good news also is that we don't have to do the 100% best thing initially; your statement here is *at minimum* an extremely good starting place and is way better than how the *current* fediverse distribution works. We have the advantage that Datashards doesn't specify the routing algorithm and can compose with multiple approaches, so we can tweak that later.
Then there is a third actor which a sort of hidden tracker. Back in the day, they were running what were effectively sybil nets in the DHT which were "good sybil nets" that were answering requests just like a traditional tracker.
This system shouldn't be derided, it won a war. But opacity was a big part of it.
I'm trying to equip myself to express these ideas clearly, because I'm convinced that community engagement is essential to uptake, and we need to democratize this vocabulary for that to happen
What are the families of routing protocols that share common characteristics - DHTs, gossip, multicast, etc.? Which ones are decentralized, rather than distributed, and why? This community has reasons for being decentralized rather than distributed. How does a given architecture address those reasons?
How do you define message scopes? Is there a way to define public scope that will lead to people with similar opinions as me discovering my post in greater numbers than those opposed to my perspective? This takes labor from the community, how is it different from what we have now and why is that important? Forget what's possible for skilled attackers for 30 seconds, how do marginalize people dealing with Basic Becky Bigot benefit from a given feature?
In the FUD about DHTs, I missed the critical point that this paper isn't addressing general delivery or arbitrary retrieval of messages in the public scope, which many actually want to be something a little less public, but tagged posts, which is (currently) a signal that the poster is looking for broad discoverability
With these considerations in mind, the abuse profile is minimal. Disruption of the network means a fallback to the status quo. Targeted disruption of an instance means that the instance drops from the DHT network
There are 2 differences between this and my "n-dimensional hyper-torus" thread recently. Besides the fact that this paper is coherent, I would add that instances should should participate in a Chord for each hashtag. This may result in multiple networks around a given tag, e.g. loli, as those given to opinions on some topics may also given to sharp disagreements
The strength of the fediverse is social discovery along affinity groups. An intellectually rigorous proposal to create "one ring" for hash tag discovery might encounter resistance where a more accessible document describing a *slightly* more complex proposal that clearly shows the participation requirements continuing to scale with the size of the base would be better received
The only weak point I noticed in the analysis is that instance size has a long tail of small instances and the assertion "Storing 24 GiB of data for a year is manageable for a single node," is erroneous. If the storage requirement was commensurate with participation in hash tag usage, then a single user instance on lean hardware would be more consistently able to participate
@cj @cwebber @cjd @schmittlauch @lain @emacsen
If I'm (finally) reading the paper correctly, the suggested topology is that each node has a predecessor and a successor in each of two DHTs, where each DHT has a separate realm of function in a single network where the content is addressed by a hash of the tag. It's a toroid
The topology I'm suggesting is 2 or 4 relations per hash tag where posts are addressed by a content hash
@yaaps @cj @cwebber @cjd @lain @emacsen
Thx for the feedback.
I don't really see why you'd want to create a new Chord ring of multiple nodes for each hashtag: How do you use the key-value lookup capabilities of a DHT on just a single hashtag?
Regarding the resilience against deliberate blocking of a tag: Relay/ storage instances don't have to store the (questionable) content, but just the post IDs. ->
The 24GiB calculation might be a bit misleading: This is the required number for one of the *largest* hashtags und was just used for considering hash tag splitting necessary or unnecessary.
Thanks for taking the time to make considered and well thought out responses when you've already invested considerable labor on this. It's after midnight here and I feel that I owe you the same courtesy 👍
@yaaps For me it's the morning, so don't worry ;)
@schmittlauch @cj @cwebber @cjd @lain @emacsen
The perception of the Fediverse as having a shape where a ring topology would be a natural choice for routing is highly local. Not only do instances grow towards others with affinities, many aggressively prune unfavorable connections. Unfortunately, this proposal isn't sufficiently agnostic to content for those instances to co-exist despite their antagonisms (cont'd)
@schmittlauch @cj @cwebber @cjd @lain @emacsen
While the network is sufficiently robust against technical attacks, it is wide open on common social vectors. When technical tags are allocated to justice-oriented instances, many will feel that it is not only acceptable, but a moral requirement, to avoid replicating messages and index entries to/from instances that don't conform to their expectations of conduct (cont'd)
@schmittlauch @cj @cwebber @cjd @lain @emacsen
The most common abuse of tags is highjacking. It's fairly common for interests operating in opposition to have turf wars over a hash tag. While this is sometimes necessary and always difficult to prevent, it would be helpful if posts representing the interests in these contests were routed to minimize interactions between the combatants (cont'd)
You need multiple interconnected networks with a division of work that preserves the social affinities of the networks involved and scales with instance size. Instance level blocks should not create undesired behavior and we have to anticipate that average packet sizes will grow with Datashards, Pixelfed, federated blogs, and gaming platforms entering the Fediverse
The requirement for multiple networks can be intuited from graphs of the existing relationships in the fedi or derived from the definition of consent in that meaningful consent cannot be determined in the absence of viable alternatives
@schmittlauch @cj @cwebber @cjd @lain @emacsen
Here's the link to my "n-dimensional hypertorus" post. It's part of a thread, but the only required context for understanding that post is that I'm describing idea spaces using polar (actually hyper-spherical) coordinates
I didn't set out in this thread to promote my own idea. The paper is thorough. It takes a good idea and presents it well, but it's disjoint to the needs of the community. That can be reconciled by rotating the coverage 90° and iterating the pattern over local subsections
> Accessing an object requires knowing it's identifier, possessing the decryption key (if encrypted), and having a relationship with an affinity group homing the object
This sounds more relevant to groups, where like-minded folks are interacting.
But just imagine something like #MeToo being limited too affinity groups: ->
The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!