Wrote some bad Pandas code to try and devise an efficient labelling sampler for unlabelled data. The idea is, you want a labelled dataset but you have a tight budget or little time. So you generate enough features so the data can be clustered, and this method uses HDBSCAN* to cluster, samples from each cluster core, outliers, and unclustered data. You label those, hopefully get enough to do feedback with a crude classifier. Sane?
* A nice, very versatile clustering algorithm/library
The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!