Looked at the compiler-generated assembly for the loop that I was going to optimize, and it's actually... very good? Like, I can't really figure out where I'd be able to cut out an instruction. So maybe that loop not being fast enough is not my problem. Back to the drawing board.

