it would be fairly easy to do a hardware CLFLUSH on every context switch.
performance could be retained by windowing the caches (something like what SPARC does with registers).
but this isn't something that is a simple patch to the CPU design, we will probably have to wait for the next couple of microarchs before this is done