The very mention of the word "metadata" is enough to make some weary eyes in the content industry gloss over, perhaps triggering memories of conference speakers who droned on endlessly about precision, recall and taxonomies. Metadata is extremely important stuff, though, the "under the bonnet" information that powers the semantic organization of content into useful forms and helping to maximize search engine exposure and valuable categorization tools.
Thomson Reuters takes metadata very seriously, though, and has been working via its Calais initiative to promote the use of its semantic processing capabilities to create more valuable content through metadata generation and source linking. The latest output from Calais is improved handling of its processing of documents that enables a publisher's partners to access the rich metadata and source linking available via Calais by using a simple document code generated by the Calais document parser. Instead of having to settle for just a hyperlink to an online document, these codes allow linking into specific data and information sources related to the metadata in a document. So, for example, Calais can return links to articles in Wikipedia on topics that surfaced from the metadata. That's a free example, but obviously premium content sources can be swapped into the picture for those wanting to develop more sophisticated content products and services.
Calais now includes a preview tool that can take any document and parse it out, providing a human-friendly display of the results or an XML-formatted RDF document that provides the original text with the Calais metadata and link tags inserted into the appropriate spots in the document. While the tool is not a demonstration of a production-ready process, it's easy enough to get the picture of how one could apply the Calais document processor in a production environment.
I fed a recent weblog entry into the preview tool to get a flavor of its capabilities. In general semantic processors don't do a great job finding relevance in short documents - just not enough content to "chew" and weigh - and Calais shows typical limits in some of its concept extraction. Nevertheless, it did an impressive job at pulling out a long list of concept keywords relating to the media topics covered in the post as well as entity extraction for the people and publishers mentioned as well as their parent companies that weren't mentioned directly in the blog post. In other words, from a simple text document Calais can take you to a fully metatagged document with "hooks" in it that can pull in financial and company background information, biographies and other valuable content in a flash.
Calais is stretching its wings not only with media-oriented content but as well with enterprise-oriented content sources. The release summary mentions enhancement for product identification, competitive intelligence and judicial events and automated document level categorization for recreation, environment, weather and legal content, which should give you a hint as to the types of organizations that are starting to put Calais through its paces.
While Calais remains a relatively low-profile project at Thomson Reuters, it's clear that they are working on unfolding a sophisticated scheme for profiting from the virtual aggregation of content linked primarily through metadata tools such as Calais. In other words, why own the data when you can own the data relationships that add the most value to a content source? It's a compelling concept, one that has a lot of potential value for enterprise and media content markets and that is likely to grow in importance over time. I recommend stopping by the Calais site to poke around a bit and to get your own ideas as to how applying both metadata and linking capabilities to your own content sources can help to extend their value rapidly. One recalls that the city of Calais on the coast of France was the decoy landing site for the Allies' D-Day invasion of Normandy in 1944; perhaps the Calais initiative may not look like a real product in and of itself but it may serve as a beachhead for a broader product vision before long.