The Evolution of Human Speech

Jan 29, 2022

An individual’s ability to express meaning to another occurs by way of mutually understood symbols or systems. For the modern human, communication is a complex motor activity encompassing the rapid transfer of information from one person, or persons, to another. As the most frequent behavior homo sapiens engage in, human speech and language [note the two aforementioned are two separate entities, often used interchangeably despite the inaccuracy of such use] often go overlooked. “Speech is one characteristic feature that has evolved in humans and is by far the most powerful form of communication in the Kingdom Animalia” (Kumar, Bidarkotimath & Revankar, 2016, p. 1158). Our ability to talk and engage in such complex behavior is what differentiates us from any other species. Canine are capable of understanding words, to some extent, but, they cannot vocally reply using speech production.

In his book, On the Origin of Species, Charles Darwin claimed that our ability to produce speech is what enhanced our biological fitness and because other members of the human lineage, such as Neanderthals, did not inherit this trait, extinction was inevitable. Darwin’s claim begs the question of speech has been necessary for survival. After all, individuals within the deaf population rely on gestures to communicate in the same way a pilot relies on the gestures of a runway worker to assist in maneuvering his plane. How and why human speech came about exists as one of the most profound and fascinating debates that continues to remain unanswered.

Noam Chomsky’s demonstration of certain linguistic principles has had considerable influence in the field of Speech-Language Pathology (SLP). Professionally socialized under this context, I committed early to Chomsky’s nativist model emphasizing language learning as a genetic endowment, an innate characteristic of all human beings. According to Chomsky, a particular gene or “circuit” must exist within the human brain and is responsible for our ability to produce and understand speech. Moreover, Chomsky notes that laws of order (biological, cognitive, and linguistic) are embedded within every organism; our knowledge is imposed upon the perceptual world, not derived from it.

Concerned with origins of knowledge from infancy through the trajectory into adulthood, the field of Applied Behavior Analysis relies heavily on the works of John Piaget and B.F. Skinner. Skinner’s model, both empirical and simplistic, suggests language has reinforcing value, and it is because of those reinforcing components that human beings have evolved to utilize language. Additionally, those same reinforcing components are what drive human behavior. According to Skinner, the father of Behaviorism, human beings behave in ways that work for them- we behave because the environmental variables set up contingencies that allow for the availability of reinforcement or from the escape of an aversive condition, object, or event.

Piaget understood language learning as an intricate interaction between the child and their environment. Throughout the semester, we discuss Piaget as having been very much experience based on his belief that an environment enriched with numerous learning opportunities would propel a child from one cognitive-developmental stage to the next. Just as dedicated as I am captivated by both fields of study, one is inevitably left in a state of quandary. The predicament exists in the nature of human behavior, specifically- how human beings learn`; a complicated method shaped by way of maturational factors and furthered by cognitive development, psychological variables, and sociocultural influencers. My innate curiosity in human cognition and behavior nurses my desire for exploration. Knowing my clinical decisions regarding patient care are the result of my knowledge, or lack thereof enables me to understand the evolutionary history of human speech and language to the best of my ability.

Philip Lieberman (2002), a leading researcher in the Department of Anthropology at Brown University, provides a comprehensive description of current models explaining the origin of human speech. In his publication, On the Nature and Evolution of the Neural Bases of Human Language, Lieberman highlights the difference in communication modalities between apes and modern humans explicitly. His review of the literature includes comparative studies assessing the brain and behavior of other species as well as studies analyzing brain images of living homo sapiens.

Lieberman (2002) compares and contrasts the relevant studies along with his findings to filter the very convoluted topic of human speech. A particular focus of Lieberman's work focuses on brain localization and the inaccuracy of attributing speech to one area of the brain. Instead, there is an emphasis on a blending of processes, an orderly strategy where each neuroanatomical aspect contributes to the formation of cortical-striatal-cortical circuits. 

These circuits account for human behavior, including human speech and language, and our ability to generate and comprehend such complex, natural phenomena (Lieberman, 2002). Moreover, Lieberman highlights the consequence of pivotal moments throughout the evolutionary process, such as the development of the human supralaryngeal vocal tract (SVT), which he attributes to principles first laid out by Charles Darwin. Although the SVT exists in every species, certain qualities of this structure differ in the modern human as opposed to other animals. The shape of the SVT is not fixed as any movement of the tongue, lips, or other articulator influences SVT shape. The SVT of the modern human as opposed to the nonhuman is made up of a larynx resting low in the throat and a posterior tongue round in its shape. These characteristics allow humans to manipulate articulates in a detailed and rapid fashion; the extrinsic tongue muscles maneuver the tongue in a variety of ways, yielding abrupt and extreme changes in the cross-sectional area of the human SVT (Lieberman, 2002). In all other living animals, the larynx is much higher, locking into the nasopharynx.

Lieberman (2002) argues that although there is genetic evidence to support the divergence of Neanderthals from humans about 500,000 years ago, similarities between the two are limited to skeletal morphology with the exception of comparisons made of an infant’s SVT and that of a Neanderthals. From birth until the age of two is the only period of development that a resemblance is observed when compared to the SVT of Neanderthals (Lieberman 2002). A connection exists between the information Lieberman provides and my background knowledge in SLP; knowing typically developing young children are unable to produce vowel sounds, especially /a/ and /i/, until the age of five or six, the similarity is evident. Fitch and Reby, two researchers out of Harvard University, investigated the larynx of two species of thirteen male deer in 2001. Among the many animal behaviors deer exhibits, the roar was noted for displaying a retracting protuberance on the body of each subject that is not observable otherwise. Measuring this behavior was central to the researcher’s hypothesis which theorized the observed marking to be representative of the larynges. Specifically, Fitch & Reby (2001) measured the formant frequency of each roar as well as the position of, what they had hypothesized to be, the larynges of seven French deer.

Confirming the length of the vocal tract is a prerequisite to measuring the larynges (Fitch & Reby, 2001). With this consideration, the researchers measured the distance from the corner of the mouth to the projection for each subject and with this information, they were capable of locating the larynx. The results were far more significant than the researchers had anticipated as the distance measured from the lips to the anatomical mark indicative of the hypothesized larynx retracted so far back, it nearly doubled in measurement size during a roar (Fitch & Reby, 2001).

The observed protuberance was hypothesized to be synonymous with elongation of the subjects vocal tract (Fitch & Reby, 2001). Testing the base of such protraction involved the use of radiographs and post-mortem dissection of four adult red deer and three adult fallow deer. In doing so, researchers discovered the larynges of red and fallow male deer shared a similarity with the larynx of modern humans as it was positioned low within the throat. Conversely, the two differed in that the velum in homo sapiens loses contact with the epiglottis as the larynx descends in early childhood (Lieberman, 2002) whereas, the dissection of the subjects conjured results indicative of an elastic, elongated velum.

Found in the International Journal of Primatology, Neaux, Gilissen, Coudyzer and Guy (2015) analyze the constructs of the human face in an attempt to determine if such arrangement is due to the angle of the cranial base. Understanding the craniofacial structure of the hominid and more specifically, the cranial base angle is pertinent to our understanding of changes that took place during hominid evolution and the implications   alterations have on discovering the origin of speech. Three individual hypotheses exist that associate facial projection, facial orientation, and facial length with basicranium flexion, or the angle of the cranial base. To measure the facial projection, length, and orientation of each subject, markers were indicated according to the operational definition of the corresponding characteristic. Because Pan and Gorilla are the closest living relatives to homo sapiens (Neaux et al., 2015), crania of 32 Pan troglodytes and 27 Gorilla were measured and then compared to the crania of 66 homo sapiens. By scanning each cranium using a medical CT scanner (adjustments were made to the system to account for size and thickness of each specimen), the researchers were able to construct three dimensional models. The images produced were converted into data using a software called Avizo v6.0, a type of visualization software intended for the use of digitized images.

Results of this study indicated that an increase in cranial base flexion, as observed in homo sapiens and pan, is related to facial orientation for both taxa and only related to facial length for the pan taxa. The theory that facial projection and basicranium flexion were dependent upon one another was not supported. Kumar, Bidarkotimath and Revankar (2016), all professors within anatomy departments at varying institutes, researched the origin of human speech with intent to present their hypothesis. The experiment method involved creating sketches of cranial skeletons of all the major classes of phylum vertebrata after analyzing prototypes of each class. Additionally, Kumar et. al. (2016) studied the skeleton of mammal in the Department of Anatomy at Hegde Medical Academy whereas the phylum vertebrata prototypes were studied in museums located in India, including the Museum of Natural History, Imphal.

After examining the crania, the researchers were able to determine a common trait found in all classes of the Animal Kingdom. A successive formation of a deep notch at the base of the skull (Kumar et al., 2016) was perceived to develop more significantly as the hierarchy ascended. As Homo Sapiens advanced to utilize an upright posture, the recognized notch at the base of the skull served as an apparatus for the head to balance on (Kumar et al., 2016). According to Kumar et al. (2016) both bipedal utilization and this small area below the brain are two of the most meaningful adaptations to occur during evolution that has contributed to human speech.

Located in West Africa, Ta| National Park served as the study site for research led by Klaus Zuberbuhler, a researcher at Max Planck Institute for Evolutionary Anthropology. After running a pilot study which focused on the acoustic alarm calls of two different species, Zuberbuhler (2000) further investigated response behavior of Diana monkeys and Campbell monkeys. The aim was to determine Diana monkeys, a different species than Campbell monkeys, would respond to the call of a Campbell monkey as a result of their understanding of the message the call serves to convey i.e. predator warning or, if they are responding to the features of the call. Zuberbuhler utilized a method incorporating playback experiments by tape-recording the monkeys’ vocal response to the playback call from the Campbell monkeys. Specifically, the calls served to signal warning of the presence of leopard or eagle. Zuberbuhler also measured the intensity of each call and created visualizations of the acoustics using a spectrograph. The first of two experiments were to determine if Diana monkeys respond to Campbell’s alarm calls (Zuberbuhler, 2000). The second experiment aimed to investigate if Diana monkeys understood the context of the call. (Zuberbuhler, 2000). “The data are consistent with the hypothesis that non-human primates are able to use acoustic signals of diverse origin as labels for underlying mental representations” (Zuberbuhler, 2000).


Lieberman (2002) disputes the brain-behavior models attributing specific areas of the brain, such as Broca’s and Wernicke’s, to human language. Based on his findings, a complete loss of speech is reliant on damage to the subcortical area of the brain and not merely on Broca or Wernicke. These findings uphold that speech cannot be localized to one area of the brain if damage to that area does not result in a whole loss of speech. Furthermore, Lieberman often

references Charles Darwin to support his explanation and findings of the development of homo sapiens, “A lowered larynx has thus been considered a key anatomical prerequisite for modern  human speech, and extensive debate had focused on precisely when in hominid evolution the larynx descended” (Lieberman, 2002). Based on findings from Fitch and Reby (2001), one may challenge the amount of emphasis Lieberman (2002) gives to the human SVT. To improve the generalization of their findings, Fitch and Reby (2001) went on to replicate their study in other independent lineages and concluded the same results. The researchers note the implications of their finding suggest that a low resting larynx must have some other non-phonetic function given that non-speaking animals also have a low positioned larynx. The research conducted by Fitch and Reby (2001) is noteworthy as it contrasts with arguments made by Lieberman (2002) that the SVT of homo sapiens is diagnostic of speech. Whereas Lieberman’s hypothesis proposes the larynx of homo sapiens is exclusive to only our species, the authors prove this to be an inaccurate claim. Not only did the subjects demonstrate to have a low resting larynx analogous to that of homo sapiens, this defining characteristic heightened when engaging in the observed behavior (Fitch & Reby, 2001). Fitch and Reby (2001) failed to take into deliberation the mechanism by which the observed subjects maneuver throughout their environment. As Homo Sapiens have adopted an erect posture, the same cannot be said for deer as they utilize four limbs rather than two. Perhaps this opposite stance, from horizontal to vertical, permits the production of speech. The researchers counter Lieberman’s theory based on their findings; However, their hypothesis has little validity without taking into consideration this variance.

Finally, Fitch & Reby (2001) did not touch on the neural bases for which language occurs. Although they are accurate in stating the SVT is not unique to the modern human, claiming the SVT is not diagnostic of speech lacks validity without considering neurological influencers. To argue the SVT of the modern human exist in other species is an accurate report based on the findings of this study, as is the claim that the SVT allows other animals to generate vocalizations. Nonetheless, the ability to create sound is a separate ability than the capacity to create and comprehend speech, a complicated syntax. According to Chomsky, syntax, the rules of how to arrange and construct speech, is often seen as the defining aspect of human language (Piattelli-Palmarini, 1983). Neurophysiological studies indicate the cortical-striatal-circuit to be the mechanism found in modern humans making syntax possible (Lieberman, 2002). Cognitive capability is responsible for more than just our capacity to recognize the rules of speech; Our competence to adjust and govern the sequence of individual speech sounds is achievable because of this neurological framework that does not occur in other animals, including deer. Resemblances among research conducted by Lieberman (2002) and Neaux et al. (2015) highlight transformations that took place during hominid evolution that are associated to craniofacial integration. Specifically, a synthesis that resulted in the modern human SVT and brain size. Based on their findings Neaux et al. (2015) draw upon a substantial point; as basicranium flexion increased in each subject, so too did the cranial capacity of each specimen measured.

One can better understand this by tilting the chin downward and as a result, noticing the lengthening that occurs of the posterior aspect of the neck. Space is created and from an evolutionary standpoint, provides more area for an increase in brain size. “As evolution progressed, natural selection played a role in developing our brain size, and this may have played a part in our ability to understand and generate more complexity” (Lieberman, 2002, p. 52). 

Neaux, Gilissen, Coudyzer and Guy (2015) established how the space necessary for the cranial structure to expand by sharing their findings. Furthermore, an increase in brain size formed resulted in neural lexicon acquisition over time (Lieberman, 2002). The two areas responsible for conceptualizing information, the cerebellum and frontal cortex, increased in volume rather than physical size. As described by Lieberman, while the upper anatomical structure of homo sapiens was evolving to make way for a more substantial brain, the SVT was simultaneously developing as a result of these structural changes. The aforementioned is precisely what Kumar et al. (2016) hypothesized. It is perhaps the combination of these two critical changes, a growth in neural ability and formation of the SVT that serves as the birthplace of human speech. The implications of basicranium flexion are two-fold; an increase in brain size as previously mentioned but also too, a reduction in the length of the nasopharynx (Neaux et al., 2015). A longer nasopharynx with complex construction yields an acoustic sound which differs from a shortened, more simplified structure. A simple analogy resembling this concept is the

mechanism by which wind instruments function; a flute is simple in its build- one shape requiring little change in the acoustic waves flowing through the apparatus as compared to a saxophone or trumpet. Neaux et al. (2015) suggest that adaptations during the evolutionary process could have formed in favor of the acoustics necessary for the production of speech; according to Kumar et al. (2016), this is precisely why these adaptations occurred.

A comparison of the two studies indicates Kumar et al.’s (2016) ability to integrate these key points, conceptualizing them from an evolutionary standpoint. The diagram below (left) represent the digitized images of Neaux et al. (2015) study. To the right, is the same image of which I manipulated to demonstrate basicranium flexion. This visualization aids one in understanding the message conveyed by Kumar et al. (2016); change in tilt of the skull, resulting in a more linear appearance of the face, affects airflow out of the nasal and oral passages during speech production.

Parallel to findings published by Neaux et al. (2015), Kumar et al. (2016) underscores the weight of basicranium flexion in contributing to the evolution of speech. Although the two publications mirror one another, Neaux et al. (2015) did not mention this successive formation at the bottom of the skull which, according to Kumar et al. (2016) is the highlight of evolutionary contributions to human speech. “Not only does it serve to balance the head but more notably, it houses the primary vocal resonator chamber” (Kumar et al. 2016).

Despite their practical approach, Kumar et al. (2016) presented clear and concise research, allowing for a better understanding of the literature. However, a lack of detail in explaining their methodology may produce skepticism or the need for a replicated study. For example, when studying the prototypes, were specific structures or markings indicative of the quantity of flexion of the splanchnocranium? Additionally, Kumar et al. (2016) describe the outcome of Homo Sapiens forming bipedal gait as the perfect platform for the evolution of speech yet, they fail to elaborate on this claim. Employing my experience related to SLP, as well as the previous example of wind instruments, one would suspect Kumar et al. (2016) intended to convey that an upright posture inclined the acoustics and resonation of airflow needed to produce speech. Although his work does not touch on speech explicitly, Zuberbuhler (2000) should not be overlooked as his findings bring forth an essential contribution to the discussion in its entirety. To demonstrate monkeys’ ability to discern meaning could conceivably contradict viewpoints described by Neaux et al. (2015) and Kumar et al. (2016). Indeed, the development of a more substantial brain size must have resulted in better neuro circuitry; yet, if Zuberbuhler (2000) was able to demonstrate that some similar circuity must exist in other animals, we are left with a challenging inquiry that requires further study.

Although we cannot conclusively say speech relies specifically on one variable, many lines of evidence, of which I have discussed, suggest that the construct of speech and language has been influenced by evolutionary transformations as well as current influencers in multiple domains. All fields concerned in understanding human speech and language would benefit from conducting research in partnership with one another. On the basis of the current literature, an interdisciplinary approach has the potential to manifest extraordinary revelations. Unification should take place and logically, this will require radical thinking, however, the capacity to remain curious is worth cultivating and nowhere is the need for a conjoint language community more seeming than in the study of how human beings learn. It only makes sense that a topic of investigation as complex as human behavior should incorporate the intellect of an assorted group of scholars.


Fitch, W. T., & Reby, D. (2001). The descended larynx is not uniquely human. Proceedings

of the Royal Society B: Biological Sciences, 268(1477), 1669–1675.

Kumar, S., Bidarkotimath, S., & Revankar, S. K. (2016). Evolution Of Speech: A New

Hypothesis. Journal of Evidence Based Medicine and Healthcare, 3(25), 1158-1161.


Lieberman, P. (2002), On the nature and evolution of the neural bases of human language. Am.

  1. Phys. Anthropol., 119: 36–62. doi:10.1002/ajpa.10171

Neaux, D., Gilissen, E., Coudyzer, W., & Guy, F. (2015). Implications of the Relationship

Between Basicranial Flexion and Facial Orientation for the Evolution of Hominid

Craniofacial Structures. International Journal of Primatology, 36(6), 1120-1131.


Piattelli-Palmarini, M. (1983). Language and learning: the debate between Jean Piaget and

Noam Chomsky. London: Routledge & Kegan Paul.

Zuberbuhler, K. (2000). Interspecies semantic communication in two forest

primates. Proceedings of the Royal Society B: Biological Sciences, 267(1444), 713-718.


If you enjoyed reading, 

The best way to get more is by connecting with me as we travel the journey of life. Only the occasional email here & there :)