Wrote some bad Pandas code to try and devise an efficient labelling sampler for unlabelled data. The idea is, you want a labelled dataset but you have a tight budget or little time. So you generate enough features so the data can be clustered, and this method uses HDBSCAN* to cluster, samples from each cluster core, outliers, and unclustered data. You label those, hopefully get enough to do feedback with a crude classifier. Sane?

* A nice, very versatile clustering algorithm/library

The motivation here is that I want to start building labelled data out of a huge dataset that contains lots of noise/impertinent data, and I can't afford the time/cost to label a "statistically relevant" sample. So I'm hoping I can optimise the samples I'd need to generate a crude classifier, and I can then try sampling near the decision boundary to get more informative examples for the next round.
But maybe other work has been done here and my approach is dumb? Open to ideas.

Testing my little algorithm on the Iris dataset and I know it's not working because the results look too good to be true. :P

Sign in to participate in the conversation

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!