Monday, September 7, 2009

It Depends on What "Semantic" Is: NetBase And Natural Language Processing Hit Hiccups with HealthBase

The word "semantic" is bandied about quite a bit these days in online publishing, a term that is used to label everything from systems that automatically categorize content based on the presence of key terms in its body to more human-assisted forms of content organization. Whatever the particular technology or methodology, though, using the language structure of content and queries to infer more than merely the presence of key terms or concepts can get a little tricky with content on the open Web. An example of the challenges found in implying meaning from both search queries and related online content surfaced recently with the launch of the new HealthBase online portal.

HealthBase is a showcase for the technologies of NetBase, a Mountain View, CA-based company specializing in using semantic language processing to unearth relationships in conent collections not easily revealed by traditional keyword technologies. NetBase claims that HealthBase can help people to sort through Web content to find solutions to medical problems by parsing their queries through natural language semantic filters and then using semantic processing to find content organized by specific aspects of possible causes and solutions for medical problems. While HealthBase attracted some kind words from Search Engine Land, some test queries by Technorati delivered less flattering results. For the search query "aids," for example, a list of possible causes identified in Web content by HealthBase included "Jews," based on HealthBase interpreting the word "aids" as the word describing assisting people rather than the disease's acronym. The possible cures for this possible cause for "aids" included "salt" and "alcohol."

There can be little doubt that NetBase took an enormous risk by exposing its cutting edge technology in an open Web service focused on something as critical as healthcare, a field in which services from many well-funded providers have been focused for several years online. With many people doubting the reliablity of the Web as a source of medical information, glitches in a new service are not likely to make people feel more comfortable with using online content from unvetted sources to consider courses of treatment. But the real problem is not the NetBase technology so much as the expectations of how well some technologies can deal with a wide array of semantic issues found in subject domains only tangentally related to a field of science.

The idea of exploring sources of content using semantic tools to parse out possible causal relationships can be made to work, but these technologies need a lot of pre-defined context to guide their efforts. For example, semantic analysis tools tend to work well on documents that are either highly structured - say, a research paper abstract or a news article in which a lede paragraph contains key information in a fairly structured pattern. To get semantic processing working on more unstructured sources of content such as emails, Web pages and other more open-ended content formats requires a lot of "training data," documents that are typical of successful matches for a given domain of information. Similarly, search engines or databases that use natural language processing to infer a particular kind of topic from a query entered in a text interface may lack enough words to infer the right kind of context to be implied from those words in relation to a specific subject.

Keyword-0riented search engines such as Google remain popular in part because they don't try to infer too much semantic knowledge from a given query. Instead, they rely on the human understanding of the semantic context of a given keyword - for example, looking at the number of people visiting or linking to a page that appears to be a match - to help select possible matches for a given keyword. Type "aids" into Google, for example, and you get a lot of documents relevant to the disease AIDS. If you had this type of collection as a starting point and then applied semantic filters to look at causal relationships, then you'd probably be in a better context for applying domain-specific semantic processing tools.

Semantic processing applied in the manner of HealthBase can help to expose exciting possible relationships between different sets of content that may have otherwise never surfaced, making its potential worthy of being taken very seriously. But like someone trying to learn a foreign language by just walking down the streets of an unfamiliar country, applying the assumptions of one subject domain to any number of generally unrelated domains is not always the most efficient or reliable way to discover the most obvious causal relationships. Being able to learn and to apply lessons rapidly from a wide range of experiences is key to making such semantic processing work effectively. To some degree these kinds of services must offer "self-learning," that is, the ability of the semantic technology to be trained to recognize automatically when it's made mistakes based on human input and to be tuned rapidly by humans who will understand complex semantic relationships more rapidly than most software.

No doubt HealthBase will benefit from such tuning over time. The expectations of people looking for concrete causal relationships, though, may take more time. HealthBase is an exciting experiment in technology, which will benefit from more experiments in how to apply these technologies effectively to specific market needs.
Post a Comment