it doesn't appear to be any faster than a single process.
It won't be any faster than a single process. It's doing replica exchange, so rather than sampling N frames with a single replica, it samples N frames with 10 replicas. So you'll get 10x the sampling in the same amount of walltime. There's almost no communications overhead with replica exchange, so I'd be enormously surprised to see any slowdown.