Since callgrind allows to control stats collection from the guest, this
allows us to reset the collection right before the benchmark starts.
This change exposes this to the benchmark runner and integrates
callgrind data parsing into bench.py, so that we can run bench.py with
--callgrind argument and, as long as the runner was built with callgrind
support, we get instruction counts from the run.
We convert instruction counts to seconds using 10G instructions/second
rate; there's no correct way to do this without simulating the full CPU
pipeline but it results in time units on a similar scale to real runs.