No, the Costas loop on GPU's actually performs pretty poorly due to the algorithm's sequential calculations. The only way I've found to code it for OpenCL is as a single task-based kernel call so it really only executes like a standard CPU routine on 1 GPU core, not in parallel like one would like, so the performance is pretty low for that block (It drops to less than 2 Msps even on an NVIDIA 1070 card versus 34+Msps for the gr-lfast version on an i7-6700) and the OpenCL performance didn't change much varying the data size. So far the best performance I've gotten out of the Costas Loops is in gr-lfast the the optimized code.
For gr-clenabled, there's a tool that installs called test-clenabled and you can pass it a parameter for the data size and it'll take the timing measurements for both the OpenCL version and CPU version so you can run tests on your hardware with any sizes you'd like to test.
Also, when you get gr-clenabled running it'll create 2 separate gnuradio groups. The OpenCL-Accelerated group are the blocks that actually run faster on the GPU's since the calculations could be done in parallel. Those in the OpenCL-Enabled group function in OpenCL but their performance is generally worse than the native CPU blocks.
I'm also pushing some updates tonight to it to clean up some of the processing, but no major performance updates in this pass.