How English Speakers Distinguish Similar-Sounding Words

What Your English Textbook Didn't Teach You: How English Speakers Distinguish Similar-sounding Words And The Sounds That Compose Them

Introduction

We’ve all experienced doing something a certain way simply because we were taught that’s the way it should be done. It’s much the same when it comes to learning a language in a class or from a textbook; we’re taught that certain words and letters should be pronounced in a certain way and that the trick to pronouncing them correctly is to get all of the sounds right.

It turns out that language is a little more complex than that, especially English, where the way words are written doesn’t always faithfully correspond to the way words are pronounced. For example, most English speakers would agree that the words race and raise are pronounced the same way, except for the last sound. To keep things simple, let’s say that the last sound in race is something English speakers would call an [s] sound, and that the last sound in raise is something that would be called a [z] sound.

Before we proceed, it’s important to clarify one thing: written letters like the ones you’re reading right now are not sounds; they are symbols that encode sounds. Almost no written languages possess a perfect one-to-one correspondence between the sounds of a language and the symbols that are used to encode them. In English, for example, written c can be used to encode the sound [k] (cat), [s] (celery), and sometimes in borrowed words [ch] (cappuccino). Similarly, it is not the case that every time we see a z it is to be pronounced as [z]. But none of this should be particularly shocking.

In this article, I’m going to build on this observation and discuss some aspects of English pronunciation that are to my knowledge never taught in classes or textbooks. These details are incredibly important however, especially when it comes to teaching or learning English as a second language. My goal is to shed some light on these details, explain why they’re so important, and discuss their implications for teaching.

The take-away here is that in some cases, we can unintentionally bias students to ignore important English pronunciation details.

When “different” sounds aren’t really different

Before we dive into the details, I need to explain a few things that will help us to make sense of the preceding discussion. People who study the sounds of human languages often make use of tools to analyze how these sounds are articulated. One of these tools is called a spectrogram, and it’s basically a visualization of the various wave components of a sound.

We don’t need to know how to interpret a spectrogram or what these waves mean in today’s discussion. All we need to know is one little tiny detail about the difference between how we pronounce [s] and how we pronounce [z].

If you pay attention to yourself when you say ssss (like the sound a snake makes) and zzzz (like the sound of a bee buzzing), you will notice that the vocal cords in your throat do not vibrate when you say [s] but do vibrate when you say [z]. This difference is what linguists call voicing: [s] is a voiceless sound (no vibration), whereas [z] is a voiced sound.

Now, if we look at spectrograms corresponding to [s] and [z], which I have included below (these are measurements of me saying ssss and zzzz), we will see one important difference between these sounds. Looking at the bottom of each graph (the area I’ve boxed in red), any casual observer will notice that [s] has a very light colored bottom area, while [z] has a very dark colored bottom area.

What this tells us is that [s] has very little low-frequency (bass) sound, while [z] has quite a lot of it. A moment’s reflection reveals why we should expect this. When we vibrate our vocal cords (by humming, for example) we create a very low-frequency bass sound. This low-frequency bass sound is exactly what we make when we articulate [z], and this bass sound is absent when we pronounce [s]. The rest of the details in the spectrograms are not important to us at this time. All spectrograms were generated using Praat, an open-source software for acoustic analysis (Boersma & Weenink 2016).

Now that we know what distinguishes a simple [s] from a simple [z], we can take another step forward in this discussion. When I teach English pronunciation to students, there are often situations in which students expect certain words to be pronounced in certain ways. My experience is that they expect this because of the spelling, or because they were taught to pronounce things in a general and prescriptive way. The point I’m about to make is that this is not simply bad practice; it in fact biases learners to ignore important information that English speakers actually make use of in pronouncing and understanding words!

It’s general practice in most cases to teach that the final sound in a word like mace is pronounced as [s] and the final sound in a word like maze is pronounced as [z]. Indeed, if you ask a native speaker of English whether this is the case, the answer will most definitely be ‘yes’. These are two distinct words, distinguished only by switching [s] for [z]. Right?

Actually, this isn’t always true. To support this claim, I’ve included two spectrograms: one of mace [s] and one of maze [z]. I’ve segmented (broken up) these words into pieces based on a traditional analysis of the individual sounds. For those who are familiar with the International Phonetic Alphabet (IPA), I’ve used a very general notation along these conventional lines. For those unfamiliar with the IPA, there’s nothing to worry about. All you need to pay attention to here is the [s] in mace (annotated as [mes]), and the [z] in maze (annotated as [mez]). I’ve marked the critical region with a red box. Also, the duration window is the same for both figures which permits us to compare them directly.

Recall the difference between pronouncing [s] and [z] that I highlighted above. Do you notice anything unexpected when you look at the [s] in mace and the [z] in maze? It’s no trick—there is essentially no difference in how the [s] sounds and how the [z] sounds, and the [z] in maze looks nothing like the isolated [z] I showed you earlier. In fact, if you were to isolate the [s] in mace and the [z] in maze and listen to them independently, you would report for both that you heard an [s] being spoken.

And this is by no means something that I discovered; it’s been the subject of a great deal of linguistic research for decades (Liberman et al. 1957). And just to be absolutely clear here: I am not saying that [s] and [z] are the same sound (they’re not). I am pointing out that in cases like this, we recognize mace and maze as two different words, and even claim that it’s because one has [s] and the other has [z], but in fact in this case there is no clear difference between the [s] and the [z].

But wait: aren’t mace and maze two different words? Yes, they definitely are. Then how are these two words being distinguished by English speakers? If we return to the graphs of mace and maze and look carefully, we’ll notice that there’s something else very different about them: the length of the vowel in maze is much longer than the vowel in mace (in fact, it’s nearly twice as long). Of course, the difference I’m talking about here is not something that will be exactly the same time every time I say (or someone else says) these words. Individual differences vary, and durations can certainly be affected by factors like a speaker’s rate of speech. But the vowel length difference will still be evident if everything else is held constant.

So is it the case that English speakers are using the length of the vowel in these words to distinguish them from each other? The answer is yes; at least, when there is no context. When we say something like “The knight has a ___” it is reasonably easy to predict that the next word is more likely to be mace than maze (if asked to choose between these two words). In other cases, vowel length information is much more crucial.

But why would English speakers make use of vowel length information? English isn’t a language like Japanese, where vowel length is a core component, is it? The answer is that cues like the length of a sound can become important under certain circumstances, especially when other (default) cues are not as informative.

To unpack this statement, I’ve created the following graphs that show what we typically think of as [s] and [z] versus what happens to [s] and [z] when they come after a vowel at the end of a word: like in mace and maze. Let’s imagine that English speakers have some sort of “sound category” in their minds that is based on all of the times they’ve heard a particular sound that they recognize as somehow distinct. Further, let’s imagine that there is a unique sound category for [s] and another unique sound category for [z], and that both are very clearly distinguished by almost no voicing (for [s]) and lots of voicing (for [z]), just as I have shown in the figure on the left below. Let’s assume that this figure reflects the distributions that English speakers have somewhere in their brains when they think about [s] and [z]. For example:

But what happens to these sound categories when [s] and [z] are at the end of a word, like mace and maze, or in the middle of a word? These two distinct categories start to overlap to the point where they are very difficult (if not impossible) to discriminate. When categories are unclear and overlapping as they are in the figure on the above right, it’s natural to look to other information when it comes to distinguishing sound categories and words. (Note: I’m not saying that the overlapping of categories causes speakers to turn to other sources of information. I’m only claiming that these two phenomena are correlated in an interesting way.)

Though I’ve made heavy examples of mace and maze to support my point here, the observation that English speakers use vowel length to distinguish similar-sounding words generalizes much more broadly. For example, this also applies to the sound pairs [t] and [d] (e.g. mate vs. made), [f] and [v] (leaf vs. leave), [sh] vs. [zh] (fisher vs. fissure), and many other sound pairs that differ only in voicing. The important point here is that for most native speakers of American and British English, similar sounds [s] and [z] are not perfectly consistent sound categories. Instead, they are variable, and in some environments such as at the end of a word, they overlap and are not really distinguishable. In cases like this, English speakers appeal to other linguistic cues, such as the length of the preceding vowel, in order to decide what word they heard.

Implications for learning and teaching English

Frequently I hear speakers of English as a second language very forcefully pronounce [z], [d], and [v] in words like raise, raid, and rave (to name a few). Many of these learners have been taught that this is the way such words should be pronounced, or have been led to believe that the spelling of the word dictates this pronunciation. But indeed as we have seen here, neither of these is universally accurate.

As teachers, we should arm ourselves with as much knowledge as we can about what is going on in our brains (and our mouths!) when we distinguish words — both when we produce them and when we understand them. Similarly, as learners, we should be consciously aware that (1) spelling does not give us as much information as we would want it to; and (2) sometimes, cues like vowel length are vital to native speakers’ decision-making process when they decide what word they have just heard. And this is especially the case when sounds that would otherwise be helpful in making the decision are nearly indistinguishable.

References

Boersma, P. & Weenink, D. (2016). Praat: doing phonetics by computer [Computer program]. Version 6.0.14, retrieved 11 February 2016 from http://www.praat.org.
Liberman, A. M., Harris, K. S., Hoffman, H. S. & Griffith, B. C. (1957). The discrimination of speech sounds within and across phoneme boundaries. Journal of Experimental Psychology, 54, 358-368.

Hero image by Lee Campbell on Unsplash