BACK


"Data Is" vs. "Data Are"
Copyright © 1999 John Cullen

For the non-specialist, so many of these issues seem like nothing more than a difference of opinion. Even if there are gray areas, it helps to at least understand the background to common grammar and usage problems. "Data is" vs. "Data are" is one of those lightning rod flash points that cause office bickering. Let's take a closer look.

Let me crank up my Latin skills, easily as rusty and ancient as when the last Caesar ruled over the Tiber. The word datum is actually the past participle of the Latin dare, "to give." As a participle, and particularly as a neuter, "datum" literally means "a thing that has been given." Thus, scientists doing a QED use their "givens," or their data. Note that data is the plural of datum. Thus, one seems oblivious of subject/verb number agreement to say something like "The data is being displayed" -- kind of like saying "We is checking dem fakts."

But is that all there is to it? A matter of ignorance in a technically dominated age, in a melting pot society in which speaking non-English has always been the mark of the recently deboated, an embarrassment rather than a gift? Let's take a closer look.

Five hundred years ago, spelling was out the window. The educated still conversed in Latin, and you could be burned alive for owning a vernacular Bible. You could spell "Shakespeare" or "Marlowe" dozens of ways, all right, none wrong. Along came the grammarians to bring order to the language as the British Empire took shape. They (Cawdrey, Cockeram, Johnson, et al.) wrote dictionaries and insisted on standard spellings, even if the result was that hodge-podge of i's before e's (or is it e's before i's?). They borrowed from the logic of Latin, the precision of Greek, and the grandeur of Hebrew, and tamed the wild animal of late Middle English. The resultant ripening gave us the singular beauty of Shakespeare, the King James Bible, a flowering that began in Elizabethan times.

Some 200 years ago, "bad" could mean "good," and vice-versa, because language constantly evolves. Here is the grammarian's dilemma: just when he has it figured out, laid everything out on paper and the ink is just drying, in blows a wind that changes everything again... The grammarian cannot dictate to the living, breathing populace who actually use the language, but must try to keep what is orderly and unchanging in place while letting the language go freely in the direction it is headed on the tongues of entrepreneurs, innovators, and the naive.

So why can't we get our datum and data (worse yet, datums and datas) straight? Are we having a meltdown?

Here's my theory, after I have spent enough time in the systems development industry to stop cringing (much) when I hear "the data is." When a scientist talks about data, she means discrete points or values that have no real information content beyond what they signify (a "5" on a chart, for example). The scientist's job is to assemble data -- to see how they are grouped, what the distances and relationships among them are, to come up with a unifying observation called, at the end of the process, "information."

In data processing, we do something rather opposite. We put the data together and process it, as the term indicates. Data in this context is not a group of individual things, like marbles in a jar, but a plastic entity whose life span and meaning exist in a unified form, kind of like ground beef going through a strainer. It's not a "they" but an "it."

If they are anything, good programmers and system developers are logical. It's the only way to survive and prosper vis-a-vis a compiler; one error, and the program bombs. The grammar of the programming languages is merciless -- as shown in the early 1960's, when a missing hyphen in a program caused the loss of an expensive Mars mission. English is a much higher level language that is more forgiving -- hence its ability to adapt, to change, to sometimes reverse polarities ("good" and "bad"), and to annoy grammarians who like to keep things tidy.

Now perhaps the most surprising thing: "datum, data" is not the only word to be thus defenestrated. Consider the opera. Or is it the opus? The Latin opus (singular) means "work," as in a work of art or a musical composition -- hence, Symphony 42, Opus 3. However, the guys across the street, who wear makeup and sing, they don't produce an opus -- they produce an opera. Now isn't that silly? Opus and opera are the same word (like man and men -- one is plural, the other singular), but through some adaptive contortion of the language, this usage has become as fixed as concrete.

Other Latin words undergoing similar torture include bacterium, the plural of which is bacteria, or medium/media. Remember that English is a much less inflected language than, say, Latin. We often don't know what to do with suffixes that contain information about gender, number, and case in words borrowed from a more inflected language, so we assume that all words ending in "a" anywhere in the world are feminine, and all words ending in "o" are male. (Like thinking all dogs are male and all cats are female). This is why you cannot convince American writers writing about Mexican women to call them "Rosario" rather than "Rosaria" (does not exist) or "Consuelo" vs. "Consuela" (not a real name either in proper Spanish). The real names are Consuelo and Rosario because, even though the bearer is a woman, the word itself is properly masculine in the language.

We have a convenience called the "collective noun" in English. If I say "the bacterium is..." I am being correct in using it as a singular noun. Of course, who looks at individual bacteria? If I say "the bacteria are..." then I am being precisely correct. If I say "the bacteria is..." That's a newly coined form of an ancient word, used as a collective noun. It may be incorrect usage in its classical form, but it is street-logical, and quite likely a form that will soon predominate (much as I detest it).

How about curriculum/curricula or criterion/criteria? You'll hear "the criteria is..." or "the curricula is..." and shrug sadly at where language is going; but it's not a recent ill, but a trend stretching over centuries.

Are there a price to pay for indiscriminate usage? (Sorry, that construction won't be "in" for another century, probably). The price, I think, is a loss of clarity, not to mention connectedness with the past. In this age of conspiracies, millennialism, whining, and suspicion, one does cringe when one hears people say: "It's all the fault of the media! The media is doing this, that, or the other thing..." See it there again? "The media is...?" No, dear, "The media are..." because there isn't one giant world conspiracy that owns all the media and cranks out propaganda that will end with us all having computer chips embedded in our glutei maximi; no, "The media are..." because the news editors went to similar schools and took similar courses, and recognize a story when they see one, but but they don't belong to a giant conspiracy. In such subtle distinctions lies the chasm between two radically different views of the world. Clarity. May we keep it.

John Cullen welcomes your comments.


TOP