The Why and How of Blind Testing

Let’s face it: listening isn’t objective at all. Especially when it comes to the subtle details. The influence of preconceptions and current state of mind can be huge. And if we can’t back our impressions by measurements, the last resort to get an objective view of sonic reality is blind testing. Here’s a short introduction to the basic issues and techniques.

I just recently experienced the power of autosuggestion in a very impressive way. I was working on a DSP algorithm and made a small change where I wasn’t sure if it would make a big difference. To compare it with the previous version, I abused a knob in my testing environment as a switch, so if I would turn the knob (software, not a real one) more than halfway, my change would become effective. I tested and listened a couple of times, switching back and forth between the two settings. And I swear there was a perfectly audible difference, although I wasn’t sure which one I liked better.

Nevertheless, the difference was subtle, so I became sceptic. To check if this difference was really there, I turned the knob full clockwise, closed my eyes and turned it down again very slowly until I would hear a sudden change in sound. I was sure the sound changed, but when I opened my eyes again, the knob was around 60%, so it didn’t switch at all. I tried it again a couple of times and suddenly I heard no difference at all.

I guess everyone in the audio world knows more than one personal story like this. Tweaking subtle EQ changes for several minutes before realizing that it’s bypassed is probably something everyone can relate to. You really hear what you’re doing until you realize that you’re not doing anything at all.

So far for the “trust your ears” argument in technical discussions. I would phrase it differently: Don’t trust anyone who trusts his ears!

OK, that was probably unfair. Our hearing isn’t that bad. It’s actually pretty amazing. And in fact, measurements and objective listening tests neither have taste nor artistic understanding. But everybody should be sensitive to the issue.

Usually what we want to do is to compare to alternatives to each other and find out what the differences are. Let’s look at some of the problems (and possible solutions) with such comparisons.

Boring Differences

There’s a class of boring differences that is clearly audible and has a great effect on perception, but is sometimes hard to identify as boring. Level differences are probably the best example for such boring differences. A tiny level inaccuracy of 0.5 dB can already make a world of difference, and you won’t necessarily notice it as a level difference at first.

So getting rid of such inaccuracies is the first step in setting up a comparison experiment. Take very good care especially in adjusting levels!

Clicks, Cuts and Side Effects

Next is the switching. Depending on what you are trying to compare it might not be very easy to switch between the options. If you need to crawl behind your racks to change cables, forget it. You need to have a simple and fast switching mechanism in place that lets you switch back and forth freely. Consider making test recordings that you can easily switch.

Bonus points are up for grabs if you can set up switching mechanism that doesn’t make any noise when switching. A click or pop is already a hint for your brain that something should happen. Try to get rid of these expectation-inducing cues!

Another short note on changing cables: sometimes the plugs can degrade over time if they have been plugged for a longer period of time, especially cheap ones. Just unplugging and replugging can already make quite some difference. Keep that in mind when testing expensive cables.

Blind Testing

The next step is to make yourself blind to the knowledge of which option is which. Consider letting a friend do the switching. But make sure he’s a pokerface or avoid having eye contact during the experiment.

Or use a tool that can shuffle and hide the options for you, like Hofa’s free 4U+ Blind Test plugin.

The Advanced Version

Having two options A and B already suggests that there is a difference between the two. This too is a bias that should probably be avoided. To increase the level of difficulty you can do a so-called ABX test, where you switch between three versions: A, B and X. X is the same as either A or B, but you don’t know which one.

This way the test concentrates more strongly on finding out if a difference exists at all. The question about which one is which, or which one is better, can be added complementarily.

Evaluation

Here comes the really unfunny part. For a meaningful experiment, you need to make several trials, record the results, and evaluate them later.

If what you’re testing isn’t super-easy to hear, you won’t make the “right” decision – whatever that might be – in all cases. What you want to know is the percentage of how often you picked the “right” option, whatever that might mean.

For example in an ABX test you would evaluate how often you correctly identified the version that is different from X. You want that percentage to be significantly higher than 50%, which would be the score for a totally random pick. If your score is significantly lower than 50%, congratulations! You perform even worse than flipping a coin.

But what does “significantly” mean? That’s a tough question, because we’re talking about a limited sample from a random process. Even with random choices, there is still a small probability that you get 100 out of 100 trials right.

Let’s look at some examples for a score of 75%.

If you do 4 trials and get 3 of them right, you get that score. The probability to achieve that score by flipping a coin is still 6.25%. Double the number of trials and the probability to get 75% is 3.5%. Doubled again, with 16 trials, it’s around 1%.

So you see you need to make a lot of trials to reduce the probability that you got this result by accident. Quite some effort, right?

Conclusions

Yes, quite some effort. In this post you got a small glimpse of how you hearing can sometimes betray you. You also learned some basic techniques to reduce these dishonest tendencies. And finally you learned the sad truth that you can never rely 100% on such experiments, unless you put in an awful lot of work.

The purpose was to give you an overview and intuition about the challenges and implications of listening tests. The thing is, it takes a lot of effort to get it right. As a consequence, I would advise to stay away from these tests anyway. It’s good to leave that to hearing researchers, and there are enough examples even of experts doing it totally wrong.

Listening tests can be fun though, and I won’t try to stop anyone from doing them. But keep in mind that they probably won’t provide you with a reliable and definite outcome. But that’s probably OK.