Currently writing a blogpost on a simple but effective sampling technique I developed for and I have this anxiety that, despite my efforts to find prior art, it's actually super well-known and I'll feel ignorant for thinking it's novel.
Time will tell I guess

Testing my little algorithm on the Iris dataset and I know it's not working because the results look too good to be true. :P

Show thread

The motivation here is that I want to start building labelled data out of a huge dataset that contains lots of noise/impertinent data, and I can't afford the time/cost to label a "statistically relevant" sample. So I'm hoping I can optimise the samples I'd need to generate a crude classifier, and I can then try sampling near the decision boundary to get more informative examples for the next round.
But maybe other work has been done here and my approach is dumb? Open to ideas.

Show thread

Wrote some bad Pandas code to try and devise an efficient labelling sampler for unlabelled data. The idea is, you want a labelled dataset but you have a tight budget or little time. So you generate enough features so the data can be clustered, and this method uses HDBSCAN* to cluster, samples from each cluster core, outliers, and unclustered data. You label those, hopefully get enough to do feedback with a crude classifier. Sane?

* A nice, very versatile clustering algorithm/library

I really wish there were more opportunities for professional that _weren't_ surveillance capitalism or "finance".

again? :)
I'm into: & evidence-based , I'm either or depending on what we're discussing. Politically fluid but leaning .
I work on & at Scrapinghub. I like , , & .
Trained in (yay !), spent years at & in a home lab. I miss it.
Currently experimenting with invertebrate , , & evidence-based .


The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!