Recently I wrote about a commercial NLP project that bit off more than it could chew. In response an alert reader sent me a fascinating paper by Ami Kronfeld, “Why You Still Can’t Talk to Your Computer”.  It’s unfortunately not online, and Kronfeld is sadly no longer with us, but it was presented publicly at the International Computer Science Institute, so I figure it’s fair game.

Kronfeld worked for NLI (Natural Language Incorporated), which produced a natlang interface to relational databases. The project was eventually made part of Microsoft SQL Server (apparently under the name English Query), but it was allowed to die away.

It worked pretty well— Kronfeld gives the sample exchange:

Does every department head in Center number 1135 have an office in Berkeley?
[Answer: “No. All heads that work for center number 1135 are not located in an office in Berkeley”]

Who isn’t?
[Answer: Paul Rochester is the head not located in an office in Berkeley that works in center number 1135]

He points out that language and relational databases share an abstract structure: they have things (nouns, entities) which have properties (adjectives, values) and relate to one another (verbs, cross-references). This sort of matchup doesn’t always occur nicely.  (E.g. your word processor understands characters and paragraphs, but it hasn’t the slightest idea what any of your words mean.)

But the interesting bit is Kronfeld’s analysis of why NLI failed. One aspect was amusing, but also insightful: we humans don’t have a known register for talking to computers. For instance, one executive sat down at the NLI interface and typed:

How can we make more money?

The IT guys reading this are groaning, but the joke’s on us. If you advertise that a program can understand English, why be surprised that people expect that it can understand English?

Curiously, people attempting to be “computery” were no easier to understand:

Select rows where age is less than 30 but experience is more than 5

This seems to be an attempt to create an on-the-fly pidgin between SQL and English, and of course the NLI program could make nothing of it.

Of course there were thousands of questions that could be properly interpreted. But the pattern was not obvious. E.g. an agricultural database had a table of countries and a table of crops.  The syntactic template S grow O could be mapped to this— look for S in the country table, O in the crops— allowing questions like these to be answered:

  • Does Italy grow rice?
  • What crops does each country grow?
  • Is Rice grown by Japan?
  • Which countries grow rice?

But then this simple question doesn’t work:

  • Does rice grow in India?

Before I say why, take a moment to guess.  We have no trouble with this question, so why does the interface?

The answer: it’s a different syntactic template.  S grows in O is actually the reverse of our earlier template— before, the country was growing things, here the rice is growing, all by itself, and a location is given in a prepositional phrase. As I said before, language is fractally complicated: you handle the most common cases, and what remains is more complicated that all the rules you’ve found so far.

Now, you can of course add a new rule to handle this case.  And then another new rule, for the next case that doesn’t fit.  And then another.  Kronfeld tells us that there were 700 separate rules that mapped between English and the database structure.  And that’s one database.

So, the surprising bit from Kronfeld’s paper is not “natural language is hard”, but that the difficulty lives in a very particular area: specifying the semantics of the relational database. As he puts it:

I realized from the very start that what was required for this application to work was nothing short of the creation of a new profession: the profession of connecting natural language systems to relational databases.

So, that’s a way forward if you insist on having a natlang interface for your database!  NLP isn’t just a black box you can tack on to your program. That is, parsing the English query, which is something you could reasonably assign to third-party software, is only part of the job.  The rest is a detailed matchup between the syntactic/semantic structures found, and your particular database, and that’s going to be a lot more work than it sounds like.