What's the best way to search the fediverse?

Yingwu@lemmy.dbzer0.com · 11 months ago

What's the best way to search the fediverse?

Lovable Sidekick@lemmy.world · 11 months ago

What if search itself were a federated function? Although I’m a software dev I really don’t know much about the mechanics of large-scale search engines such as Google, but I know their server farms somehow share the load of performing searches and maintaining whatever database they maintain to optimize searching. Seems like the fediverse could do search in a similar way. I’m just saying your critique of the idea, although well thought out, seems like a critique of a particular strategy. It’s not obvious to me that the very idea of federated search is outlandish.

Jupiter Rowland@sh.itjust.works · 11 months ago

Still, the issue would be to find all instances of all Fediverse server applications.

I mean, the idea was to cover the whole Fediverse with that search. Literally everything.

Like, imagine I spin up my own instance of Forte on a home server to try it out and see if it already works.

How’s a Fediverse search engine supposed to know about my brand-new Forte instance? Clairvoyance? Hah. A crawler? Yeah, right, as if any crawler out there was fast enough to discover a brand-new instance of something that doesn’t have a running instance at all yet. At least not beyond enclosed, experimental instances detached from the rest of the Fediverse.

I mean, instead of Forte, I could also install what Forte was forked from, namely something colloquially referred to as (streams). Something that intentionally doesn’t have a name, doesn’t have a brand identity, doesn’t have a unified server identifier. Unlike Mastodon whose instances all identify as “mastodon” and Lemmy whose instances all identify as “lemmy” and Hubzilla whose instances all identify as “hubzilla”, (streams) instances don’t all identify the same. That field is customisable. And it has been customised for as long as (streams) has been around. You can’t reliably crawl (streams) instances. Instead of “streams”, they can identify as “y” (because Y is not X) or “get ready to rumbly” (public instance actually) or “bunny of doom” or “diversi spiritus”.

In fact, crawlers would have to be able to identify any kind of Fediverse server software. Even if someone has only just forked something, a crawler would be able to recognise it as Fediverse server software. If you hard-code server identifiers into the crawler, it’d be out-of-date as soon as someone decides to fork Mastodon or Misskey or Firefish or Sharkey or whatever again. And, as mentioned above, you can’t crawl (streams) instances by identifier.

It simply is impossible to discover and index the whole Fediverse by crawling, Google-style. And if a Fediverse search engine can’t discover a (streams) instance that identifies as “y”, it can’t index the content coming from the man who created (streams) and Forte and still occasionally develops both. The man who created the oldest still existing Fediverse project, Friendica, as well as the Swiss Army knife of the Fediverse, Hubzilla, and the very concept of nomadic identity. One of the most competent and experienced Fediverse devs ever. A crawler couldn’t find him.

Still, the search engine needs to know all Fediverse instances, right?

Well, if crawling fails, and crawling does fail, there’s only one way to achieve that: Each instance would have to announce its presence to anything that’s supposed to be able to search the Fediverse.

But in order to be able that, each instance would have to know everything that can search the Fediverse. And all instances of it. Every single one of them.

And if it shall announce its existence when it spins up for the first time, it will have to know all these search instances immediately before spinning up.

How can it possibly know them all before even going online itself?

Two options. Either a centralised list of all search instance that’s being updated as soon as a new one is spun up.

But you said, “federated.” As in not centralised.

Or the list would have to be built into the source code as it’s being git pulled from the code repository. In fact, the list would have to be git pulled from the code repository immediately before the server spins up so that it’s up-to-date when the server spins up. This would mean that the whole server software would have to be updated before start-up.

Of course, each Fediverse server software project that’s started from scratch would have to implement this list, otherwise its instances couldn’t be found.

But how is this list supposed to be kept up-to-date?

I mean, let’s suppose what has been spun up here is something that has Fediverse search built in. It itself would have to be added to this list so that other new instances can announce themselves to this new instance, so that it can find them and index their content.

So how is this new search-equipped instance supposed to be added to the list of search instances?

Shall it add itself to the list by manipulating the production code of all Fediverse server applications that have Fediverse search built in? Past the maintainers and without their consent?

Perfect search that covers 100% of the Fediverse has to rely on lists of some kind, that’s clear. The Fediverse changes too quickly to be crawlable. It’s too diverse to be crawlable. And it has server software which itself is inherently uncrawlable because it’s undiscoverable by design.

But such lists are impossible to always be kept up-to-date, too.

Lovable Sidekick@lemmy.world · 11 months ago

Thanks for putting so much time and thought into the discussion. All the problems you talk about exist for every search engine in actual use today. For example, publishing a site on a brand new domain has the exact problem you’re describing with spinning up a new Forte instance. There can be a 24-hr lag before DNS can reliably find the site. Perfect search is an aspirational goal. The realistic goal is to satisfy most needs. No matter how many words you throw at it, I don’t think federated search is an outlandish idea at all.