Testing my little algorithm on the Iris dataset and I know it's not working because the results look too good to be true. :P

The motivation here is that I want to start building labelled data out of a huge dataset that contains lots of noise/impertinent data, and I can't afford the time/cost to label a "statistically relevant" sample. So I'm hoping I can optimise the samples I'd need to generate a crude classifier, and I can then try sampling near the decision boundary to get more informative examples for the next round.
But maybe other work has been done here and my approach is dumb? Open to ideas.

Wrote some bad Pandas code to try and devise an efficient labelling sampler for unlabelled data. The idea is, you want a labelled dataset but you have a tight budget or little time. So you generate enough features so the data can be clustered, and this method uses HDBSCAN* to cluster, samples from each cluster core, outliers, and unclustered data. You label those, hopefully get enough to do feedback with a crude classifier. Sane?

* A nice, very versatile clustering algorithm/library

I really wish there were more opportunities for professional that _weren't_ surveillance capitalism or "finance".

again? :)
I'm into: & evidence-based , I'm either or depending on what we're discussing. Politically fluid but leaning .
I work on & at Scrapinghub. I like , , & .
Trained in (yay !), spent years at & in a home lab. I miss it.
Currently experimenting with invertebrate , , & evidence-based .

Octodon

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!