languages


My wife has just returned from Peru, and brought back a list of Peruvian names from the newspapers. Odd spellings for foreign names are muy de onda (very hip).

Sthefany

Lesly

Jhony

Mijael

Yeni

Airon (Aaron?)

Jhair

Yanet

Exavier

Yodi

Jeylo (J. Lo)

Jhunior Brayan

Lian

Itan (Ethan?)

Johan Jonathán

Jilmer

Bili

Yordi

Yandy

Jannet

Jhoselin

 

Ginés

Yanika

I’ve updated the Numbers from 1 to 10 page!  For the first time in, well, many years.

Note: if you don’t see the new page, because the old page is cached, you may have to hit shift-refresh.

The major features:

  • It now makes extensive use of Unicode to finally present the numbers as they were intended to be seen. (If you can’t see all the characters it’s dredged up, check the notes page for how to download comprehensive Unicode fonts.)
  • As a corollary, I’ve started to include the native writing system for key languages.
  • The families are color-coded to help you navigate.
  • The page uses Javascript to allow you to customize the results.

Now the story behind the update. The original source file was an enormous Mac Word 5.1 file. To generate the html files, I would output the source file into RTF (which is how you were supposed to access .doc files). Then I ran a custom C program that converted the RTF into html.

So far so good, only my old PowerPC died a few years back, which meant I could no longer run Mac Word 5.1, which meant I couldn’t generate the RTF or the html files, which meant no updates period.

Sigh, Mac Word 5.1, released in 1991, was a thing of beauty. It had little of the cruft of later versions of Word, I had all the commands in muscle memory, and on the PowerPC it was damn fast. Plus it never crashed. I had to switch to Word 2008 when I needed Unicode, but I kept using 5.1 until I couldn’t. I’ve gotten used to Word 2008, but it is just not the reliable workhorse that 5.1 was. It crashes unpredictably with certain large files, especially if they have a lot of formatting— as many of my books do.

Word 5.1 had a neat feature that I used extensively for the numbers list: you could overtype characters. This was necessary to represent the many many arcane and wacky characters that linguists have used over the last couple centuries to write their grammars and wordlists. Word 2008 can read these, but apparently can’t create them.

I had long envisioned a database or a text document that could hold the numbers, letting the web page itself be very simple. I was a bit worried that the database would be huge and slow, but then I remembered that most web pages these days pull down megabytes of cruft.

So, the source file is now plaintext.  I still use Word to create it, because it looks better there and I can use bolding to help me navigate, but all I do to make the plaintext file is copy and paste into TextEdit. It turns out that the whole file is only 400K, far smaller than the 1.4M html file that was the old mondo partly-Unicoded version. The text file is human-readable, but some pretty simple Javascript reads and prettifies it for the actual web page.

There are probably some typos in the file, due to quirks in the old process which I may have missed or messed up during the conversion process.  On the other hand, there were a lot of kludges in the old html version; the new version is much closer to the original sources.

I haven’t dealt with the sources page yet. (The problems are similar; it will be a another fairly tedious project to update the document and access page.)

Edit: The sources page is done now too! As with the numbers page, you can zero in on specific regions.

If you happen to be a linguist or for some other reason study the less-spoken languages, I’m always open to additions and corrections, and finally I can make them again.

 

In your review of Overwatch, you said that you appreciate the fact that characters speak appropriately in Chinese, Korean, Russian, and French. However, I have read some complaints that the French accent of Widowmaker sounds fake. Since I have heard similar complaints about Leliana of the Dragon Age series, and since both are voiced by French people, I would like to know if this perception comes from actors deliberately exaggerating their pronunciation, or if Hollywood or something similar have misled people into what constitute a true foreign accent.
Cordially,
Antonin BRAULT

Standards are changing, so I think this issue is in flux.

I can tell you what isn’t acceptable any more: mangling foreigners’ accents as in this book.

That is, it would be completely offensive if instead of having a Korean-Japanese-American woman (Charlet Chung) voice D.Va, they’d had a white American attempt a Korean accent.

So far as I can judge, Chloé Hollings, the voice of Widowmaker, pronounces the French perfectly— as she should; she’s French.

Is her French accent exaggerated? Yes, of course; Hollings is bilingual and speaks excellent English. I don’t have any inside knowledge of Blizzard’s production, but one can imagine for many of these voices a scene something like this:

Voice actor: (pronounces a line perfectly)

Director: Great! Only… can you make it sound more French?

And the director does have a point! If they’ve gone to the trouble of hiring bilingual voice actors, they kind of don’t want perfectly unaccented English. The characters are supposed to be cartoony, so they want to reach the sweet spot where the accents communicate the character but remain attractive. (Americans, at least, react negatively to a heavy foreign accent, but find a light accent enchanting.)

With Dragon Age, I saw a page that noted that Corinne Kempa (voice of Leliana) simply didn’t have the type of French accent Americans expect to hear. Again, American viewers aren’t very sophisticated here; few could even identify different varieties of French. (I liked Leliana— it was nice to have a fantasy game that didn’t over-rely on British accents.)

It’s hard to make everybody happy, but I think Blizzard took a pretty good approach. I also like the fact that, except for the two ninjas, the characters aren’t defined by their nationalities. E.g. Mei is a climatologist, who just happens to be Chinese. Zarya is much more defined as “butch power-lifting soldier” than as Russian. They do paint with a broad brush, but they’re nodding much more to media images than to ethnic stereotypes— e.g. McCree is a version of Clint Eastwood; Junkrat refers to Mad Max.  One character they could have done better with, in my opinion, is Pharah, who should speak some Arabic.

Edit: The new character, Ana, does speak some Arabic.

I saw this on Twitter, and decided that this was an important phrase to learn in Chinese:

CliN-G-UgAA8dB_

網上虛擬交心不宜

wǎng-shàng xūnǐ jiāoxīn bù yí

web-above virtual entrust not should

You should not make virtual commitments online.

 

While we’re at it, my Overwatch pals have been quoting D.Va’s comments in Korean, so let’s look at those in more detail.

안녕하세요!

a̠nɲjʌ̹ŋ ɦa̠sʰe̞jo

Annyeong haseyo!

peace you.have

Do you have peace? = How are you?

That first word is a borrowing from Chinese 安寧— Mandarin ānníng ‘peace, tranquility’. You will undoubtedly recognize the first character from 西安 Xī’ān, the ancient capital of China; also Heian, the ancient name for Kyoto.

D.va is very informal and also from the future, so she just says Annyeong!

감사합니다

ˈka̠ːmsʰa̠ɦa̠mnida̠

Kamsa hamnida!

thanks have.assertive

I am thankful! = Thank you!

Again, the first word is a borrowing: 感謝 gǎnxiè ‘gratitude’; the common way to say “Thank you” in Mandarin— which you can hear Mei say in Overwatch— is 謝謝 xièxiè.

And again, D.Va informally says just Kamsa!

Mei’s “Hello” is 你好 Nǐhǎo, literally “you good?”

 

Is there any advice which you used to give to conlangers but now
consider misguided? What was it, why did you think it was good advice,
and how has your attitude changed to make it not-good?

—Thomas

This is going to be pretty boring, but: nah, not really. My stuff is mostly not advice per se; it’s just introducing linguistics to people. When I do have regrets, it’s usually that I haven’t covered somnething, and the solution is usually to write another book.🙂  So a lot of things that didn’t get into the LCK got into ALC instead.

I did take the opportunity to revise the LCK to give a better introduction to aspect, though.

I always wanted to give a better overview of transformations and how they revolutionized syntax. I studied that a lot in college and found it fascinating.  On the other hand… well, I can’t really say a conlanger has to know that stuff.  Plus the field never reached a consensus on the best way to handle syntax.  (My Axunašin grammar attempts to do justice to transformations, though I think it’d have to be three times as long, and include lots of cumbersome trees, to really explain the concept.)

 

The Five Year Plan has come in from the Marketing Commissar here at the Zompist Fortressplex. That is, I thought I’d talk about the next books I’m working on.

First: a book on Quechua. Long ago I actually wrote, for myself, a reference grammar and dictionary. That was a good start, but they need a lot of refinement. Plus I need to work through my best sources to absorb more of the language myself.

peru-market

One reason I wanted to visit the Seminary Co-op bookstore last weekend was to check if they had anything on Quechua… if there was a really good book on it in English I might have just recommended that.  But they didn’t (indeed, their stock of language and linguistics books is, sadly, less than a quarter of what it once was). The best materials on Quechua are all in Spanish; I think there should be a good introductory textbook/dictionary in English, and so that’s what I’m aiming to produce.

After that I’d like to write about India, parallel to my book on China. I’ve already started the research on this, and the books I did pick up at the Co-op were grammars of Hindi and Sanskrit. I’m already excited about the material: India has an incredibly rich history, and it’s even less known in the West than China’s. But I want to spread out the research and reading a lot more, partly because I’m starting much more from scratch, and partly because I can already see that finding the narrative through line is going to be more difficult.

Chinese history is a story— you can tell it well or badly, but it’s hard for it not to be coherent, because it’s the story of one ethnicity, one language family, and for the most part one empire, which collapses and suffers invasions but always returns to itself.

India is not like that. India is unavoidably miscellaneous, and Indian history has no coherence at all. Empires rise and fall, but they’re not the same empires. You can list the major kingdoms of a particular time and it tells you nothing about other periods. (Plus there’s a lot we just don’t know. One of my books mentions that a particular king probably lived in the first century, but we can’t pin him down for sure anywhere within a 200-year period.)

Now, this is pretty much true of Europe and the Middle East too, but there we have the advantage of familiarity, and traditional identifications… Americans are not 99.6% not Greeks, and yet we read about the ancient Greeks as if there were the direct ancestors of our civilization.

One fascinating bit about India, which I get from Alain Daniélou, is that whenever some group started a kingdom or a religion in India, they’re still there. Ancient hunter-gatherers, Dravidians, Indic peoples, Persians, Muslims, Mongols, Portuguese, Brits, all came to India and you can still find them and their religions today.

It also strikes me that Westerns don’t know much about India in part because our maps stop too soon. A map of Europe + India stretches out too far; to make it fit nicely on the page, we cut it off somewhere east of Palestine. So one of the neat bits in reading Indian history is discovering the eastern half of many stories. Most of the big conquerors in the West— the Greeks, the Persians, the Huns, the Mongols, the Arabs— showed up in India too. The Greeks set up kingdoms in the Indus valley; the Romans traded with South India; the Mughals claimed descent from Genghis Khan.

Finally, for the few but anxious people who wonder if there will be another Incatena book: yes, though being able to pay rent and buy groceries is the higher priority, which is why the non-fiction books go first.  I have a few chapters written. Though honestly, this year has been discouraging for satirists. How do you top the absurdity that the daily news has been piling on us?

 

 

 

Recently I wrote about a commercial NLP project that bit off more than it could chew. In response an alert reader sent me a fascinating paper by Ami Kronfeld, “Why You Still Can’t Talk to Your Computer”.  It’s unfortunately not online, and Kronfeld is sadly no longer with us, but it was presented publicly at the International Computer Science Institute, so I figure it’s fair game.

Kronfeld worked for NLI (Natural Language Incorporated), which produced a natlang interface to relational databases. The project was eventually made part of Microsoft SQL Server (apparently under the name English Query), but it was allowed to die away.

It worked pretty well— Kronfeld gives the sample exchange:

Does every department head in Center number 1135 have an office in Berkeley?
[Answer: “No. All heads that work for center number 1135 are not located in an office in Berkeley”]

Who isn’t?
[Answer: Paul Rochester is the head not located in an office in Berkeley that works in center number 1135]

He points out that language and relational databases share an abstract structure: they have things (nouns, entities) which have properties (adjectives, values) and relate to one another (verbs, cross-references). This sort of matchup doesn’t always occur nicely.  (E.g. your word processor understands characters and paragraphs, but it hasn’t the slightest idea what any of your words mean.)

But the interesting bit is Kronfeld’s analysis of why NLI failed. One aspect was amusing, but also insightful: we humans don’t have a known register for talking to computers. For instance, one executive sat down at the NLI interface and typed:

How can we make more money?

The IT guys reading this are groaning, but the joke’s on us. If you advertise that a program can understand English, why be surprised that people expect that it can understand English?

Curiously, people attempting to be “computery” were no easier to understand:

Select rows where age is less than 30 but experience is more than 5

This seems to be an attempt to create an on-the-fly pidgin between SQL and English, and of course the NLI program could make nothing of it.

Of course there were thousands of questions that could be properly interpreted. But the pattern was not obvious. E.g. an agricultural database had a table of countries and a table of crops.  The syntactic template S grow O could be mapped to this— look for S in the country table, O in the crops— allowing questions like these to be answered:

  • Does Italy grow rice?
  • What crops does each country grow?
  • Is Rice grown by Japan?
  • Which countries grow rice?

But then this simple question doesn’t work:

  • Does rice grow in India?

Before I say why, take a moment to guess.  We have no trouble with this question, so why does the interface?

The answer: it’s a different syntactic template.  S grows in O is actually the reverse of our earlier template— before, the country was growing things, here the rice is growing, all by itself, and a location is given in a prepositional phrase. As I said before, language is fractally complicated: you handle the most common cases, and what remains is more complicated that all the rules you’ve found so far.

Now, you can of course add a new rule to handle this case.  And then another new rule, for the next case that doesn’t fit.  And then another.  Kronfeld tells us that there were 700 separate rules that mapped between English and the database structure.  And that’s one database.

So, the surprising bit from Kronfeld’s paper is not “natural language is hard”, but that the difficulty lives in a very particular area: specifying the semantics of the relational database. As he puts it:

I realized from the very start that what was required for this application to work was nothing short of the creation of a new profession: the profession of connecting natural language systems to relational databases.

So, that’s a way forward if you insist on having a natlang interface for your database!  NLP isn’t just a black box you can tack on to your program. That is, parsing the English query, which is something you could reasonably assign to third-party software, is only part of the job.  The rest is a detailed matchup between the syntactic/semantic structures found, and your particular database, and that’s going to be a lot more work than it sounds like.

 

 

Next Page »