Knowledge Extraction

Semantic Studio includes a content extraction, categorization and retrieval system for unified search and discovery of structured and unstructured data.

Semantic Studio exploits the semantics (the structures and assumptions of meaning) that govern specific application domains and kinds of documents. Semantic Studio understands the intentions, interests, and vocabulary underlying content and implicit in user behavior.

The value and significance of Semantic Studio is directly related to two broad issues:

  • How can content be more accurately cataloged and accessed?
  • How can data that is presently locked inside of documents be extracted in a high quality, consistent,structured form?

The answers have a common basis. Semantic Studio does not treat all content as if it is the same. By distinguishing different document types (a scientific article, a resume, a job listing, a news article, a product blurb, etc.) Semantic Studio is able to:

  • Enhance content through automatic recognition, classification, and inference of important features within each document and related data and metadata.
  • Exploit the context in which content is used by focusing on features that are most relevant. For example, the skills that are expressed or implied in a resume are considered more relevant when they appear in more recent work experience as opposed to less recent work experience.
  • Increase understanding through pattern discovery and data mining of documents that have prior known characteristics.

Semantic Studio is able to achieve much higher categorization and retrieval quality because for a given document type (say, a resume) the features that are relevant and interesting (skills, positions held, degrees earned, etc.) are able to be automatically recognized and extracted as fields in a structured XML format. Then that fielded data can be combined with other fielded data, inferences may be performed, and the document may be indexed for much higher retrieval quality.

Semantic Studio attempts to extract as much meaning as it can from a document of a known type. It then can exploit that meaning in creating effective summarizations and high quality index entries.

Below is a diagram that illustrates the interactions between the information extraction components and the lexicon / ontology conponents of Semantic Studio.

 

 

Factory Schematic

 

Below is a screenshot of a Semantic Studio tool which has created a semantically marked-up version of a PDF scientific article.  Semantic Studio has

  • identified the actual text of the article;
  • eliminated visual mark-up (such as headers and footers);
  • pulled out floating text, such as captions of figures;
  • reflowed content text that may have been broken by pagination, footnotes, or illustrations;
  • created semantic mark-up of entities, concepts, and "gist" within the text, and
  • created an XML equivalent that structures the knowledge that had been locked within the PDF.

 

 

Screenshot of the Semantic Factory

 

 

A Slightly Potted History of Information Extraction Technology


Information Extraction technology has evolved in a series of generational leaps. Understanding the assumptions that underlie each of these approaches provides a greater understanding of why the categorization and retrieval technology inside of Semantic Studio is superior.

In the beginning (Generation 0) of content management systems, inverted indexes of all words within documents (or those remaining after simple stemming and stopping) were used in conjunction with Boolean query processing. And it was ok, but not great. Often queries led to nasty surprises – seemingly irrelevant documents being returned. Some important documents being missed.

Gerard Salton begat the Vector Space Model and thus Generation 1 of content management systems took hold. It was better, but still not great. And, with hindsight, we can understand why through an examination of the underlying assumptions of Generation 1:

  • The content is just a bag of documents. By this we mean that all documents are treated the same way – even if the documents have a very different structure or intent. A short note or email is indexed using the same mechanism as a resume or a 10-K document.
  • Each document is just a bag of words. Unlike how a document is written and read, to a Generation 1 system, word order did not matter and the part of speech of individual words did not matter. At most, some words may be considered as noise and not indexed (such as “the”, “and”, and “with”) and others may be processed to a common root (“educator”, “education”, “educates” would all go to “educat”). This processing is known as “stopping and stemming.”
  • The retrieval (and ranking) of documents is computed using a notion of similarity that mathematically reduces to a weighted dot product of query terms with terms in each document. Weights are assigned based on word occurrence frequencies – these weights can be stored within the indexes for efficiency. Also, lists of synonyms may be used to broaden queries.

These assumptions amount to a plausible model of retrieval that suffices for many applications. However, many people observed that documents are often related by having similar “themes” that often are reflected only very approximately by the actual words that are used. Themes are more general than synonyms.

And thus, these statistically minded people thought that themes were more important than literal words, and likened the problem to “factor analysis” and “singular value decompositions” and did say that latent semantic indexing (LSI) was good. Thus, Generation 2 of content management systems was born. It was (sometimes) a bit better than Generation 1. And again, with hindsight, we can examine the underlying assumptions of Generation 2:

  • The content is still considered to be merely a bag of documents – all of the documents are treated the same way. No particular difference from Generation 1.
  • Each document is still considered to be just a bag of words, but with a different twist from Generation 1. Usually stemming and stopping was not applied. And the math of indexing was very different. An implicit (or “latent”) structure was uncovered that tends to capture those words that have a strong tendency to appear together. This creates (in a way that is made mathematically precise) a moderate number of dimensions that a document can fall within. In an index that takes each document and assigns its place using these dimensions, “similar” documents will tend to cluster together. Now here a claim can be made that words really do not matter at all. No matter what language, no matter what topics, the same mathematical model applies.
  • Retrieval (and ranking) is computed using some math that transforms the words that form a query into the same dimensions that are used for indexing the documents. But something seems a bit odd – much of the precision of the Boolean queries of Generation 0 and Generation 1 is lost.

Arity (and others) have recognized that the assumptions of previous generations were out of step with actual applications and business needs. Semantic Studio, a Generation 3 categorization and retrieval product is based on new assumptions:

  • The content is no longer considered to be merely a bag of documents. Unlike previous generations, documents of like kind are processed based on common patterns and expected features and conventions. Documents generally have sections that govern many assumptions of how they should be read. In a resume, working for a university is quite a different matter than attending that university.
  • Words matter! Managing domain specific vocabularies and recognizing names of things (people, companies, dates, locations, etc.) And structure counts. Again, in a resume, not everyone who works for Dow Chemical is a chemist!
  • The context of usage counts. If an application is matching documents to their counterparts (say, a job listing to candidate resumes) then the ranking is clearly a function of most relevant skills, location (including ability to relocate), and the level of previous experience (previous job title, length of employment). If, on the other hand, we are trying to find and match people by similar backgrounds, then we focus on companies, schools, and hobbies and interests that are in common. A different intent.

Semantic Studio exploits these new assumptions. And the difference and significance is evident in comparing the quality of retrieval typified by the Vector Space Model, LSI, and Semantic Studio.

 

Important Elements of Knowledge Extraction from Documents

What is to be Recognized

Clues within Document

What is Extracted

 
Subject Area of Document Overall structure, vocabulary Triggers for subject area-specific templates (XML)
Section themes and "gist" Ontology terms, context Activation of major and nested sections of XML templates
Knowledge within sections Ontology terms, named entities, spans of thematically related text Semantic mark-up, semantic tags
Citations, figure captions, and other semi-structured text Regular structures, breaks in text flow Factually fielded data, often composite and nested
Body text Linguistically analysed text Concepts, named entities, relationships and "gist"