Fortnightly Mailing: The Unreasonable Effectiveness of Data

Archives

Here is a current article by Google's Alon Halevy, Peter Norvig, and Fernando Pereira [376 kB PDF], from the IEEE's March/April issue of Intelligent Systems. It is intelligible to a lay person, and it contrasts lucidly two broad approaches to extracting meaning from information. Though the article does not use the terms, you might describe these two approaches as taxonomic and statistical. These excerpts - the second is the concluding paragraph - give you a flavour:

"The biggest successes in natural-language-related machine learning have been statistical speech recognition and statistical machine translation. The reason for these successes is not that these tasks are easier than other tasks; they are in fact much harder than tasks such as document classification that extract just a few bits of information from each document. The reason is that translation is a natural task routinely done every day for a real human need (think of the operations of the European Union or of news agencies). The same is true of speech transcription (think of closed-caption broadcasts). In other words, a large training set of the input-output behavior that we seek to automate is available to us in the wild. In contrast, traditional natural language processing problems such as document classification, part-of-speech tagging, named-entity recognition, or parsing are not routine tasks, so they have no large corpus available in the wild. Instead, a corpus for these tasks requires skilled human annotation. Such annotation is not only slow and expensive to acquire but also diffi cult for experts to agree on, being bedeviled by many of the diffi culties we discuss later in relation to the Semantic Web. The first lesson of Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn’t available. For instance, we find that useful semantic relationships can be automatically learned from the statistics of search queries and the corresponding results or from the accumulated evidence of Web-based text patterns and formatted tables, in both cases without needing any manually annotated data."

"So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail. For natural language applications, trust that human language has already evolved words for the important concepts. See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words. Now go out and gather some data, and see what it can do."

Comments

I read the article carefully. I admit it is fascinating. First, its title "The Unreasonable Effectiveness of Data" is fully intentional, admitted reference to “The Unreasonable Effectiveness of Mathematics in the Natural Sciences” by Eugene Wigner. Nice play with titles, but it is a bit misleading. The role of mathematics in natural science is just the opposite to the role of pure data in human knowledge. I could elaborate on it longer, but - what strikes me deeply, is another thing.

It seems that authors dismiss the message of Semantic Web advocates, among them, Tim Berners-Lee, for reasons that are not very clear.
Let me cite: " (...) But even if we have a formal Semantic Web “Company Name” attribute, we can’t expect to have an ontology for every possible value of this attribute. For example, we can’t know for sure what company the string “Joe’s Pizza” refers to because hundreds of businesses have that name and new ones are being added all the time. We also can’t always tell which business is meant by the string HP.(...)”

Well, in all Semantic Web proposals we do not care what "Joe's Pizza" or "HP" means!
We care about one thing, that "Joe's Pizza" is The Company Name. We do not need ontology for the name itself, we need it for different potential "Company Name" concept !!!

Not everything there is plainly bad, though.

What I liked was the call "So, follow the data" - in some vague sense they reaffirmed the principle of least action of Tim Berners Lee.

I also must admit, that the distinction of “Semantic Web” from “Semantic Interpretation” is very convincing and it is another good part of the article.

Finally, I often think, that Google would be The One who could push Semantic Web forward. And for some reason they don't. They could simple cry out loudly: "Hi webmasters around the world - use RDF or microformats to mark your contact/author data and we will use it in our search engine!"

Apart from conspiracy theories, there is something in this article, written by Google researches that justifies their unwillingness to start the ball rolling....

Mirek
PS. I posted the same post on David's Weinberger blog and will probably post on m own :-)

Posted by: Mirek Sopek | 30/03/2009 at 21:32

Fortnightly Mailing

Categories

Archives

The Unreasonable Effectiveness of Data

Comments

Recent Posts

Recent Comments