You are correct, it is very difficult to do testing perfectly, especially if you are not practiced at it and don't have a well polished method. That's me all right. Which I happily accept means I'm not in a position to make a definitive statement that other people can trust and rely on.
I do think, however, that having said all the above, if you set about repeating tests as best you are able, you are eventually going to average out the worst of the problems and end up with a reasonable idea of the typical differences, at least as far as your own workflow is concerned. Maybe not good enough to publish as a formal review, but good enough to inform your own opinion.
Now, I am only really interested in big differences. The kind of differences that would have any viewer say "Wow, that one is so much better". I don't really care at all about the sort of minute differences that means you have to squint at the pictures for half an hour, scratching your head and umming and aaaing until you eventually, doubtfully, say "I think that one might be a tiny bit better, but I'm not really sure". What I'm looking for is differences that would routinely show up in normal fieldwork, not just under careful test conditions.
In that spirit, I can overlook small methodological problems that don't effectively change the result for my own purposes but would give critics cause to complain. But in the interest of fair play, I'm happy to keep repeating the test in an attempt to remove these caveats.
I'd be very surprised if tightening the test parameters made me change my conclusions substantially, the images are just too close to call.