I'm writing a bog-standard Unicode tokeniser to replace the crap one SQLite ships, and I'm wondering why I'm wasting my life writing C code again.

In the same vein of the old post I re-tooted, is there a Rust guru out there that can tell me if:

* Decent ICU bindings or equivalent Unicode normalisation, case folding and word-break analysis exists for Rust? (the latter being key)
* Decent SQLite FTS5 custom tokeniser bindings or equivalent exist for Rust?

@mjog Haven't worked with these problems but perhaps the unicode crates will do for the first point? lib.rs/crates/unicode-segmenta


@YaLTeR thanks for the pointer! I took a look and it might not handle word-segmentation for CJK/Thai/etc, which is mostly the point of using ICU.

I'll check it out though, probably a good project to get my feet wet.

Sign in to participate in the conversation

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!