Adapting Dbpedia for EXPRESSway
EXPRESSway allows users to draw information out of Knowledge Servers based on Semantic Studio. Most Knowledge Servers are relatively narrow in scope, addressing a particular domain such as nanotechnology, pharmaceuticals, or publicly-traded companies.
Arity has explored several options for creating Knowledge Servers that provide general knowledge (broad coverage, but at the expense of depth in any particular area). One resource that comes to mind is Wikipedia, of course, and there are several projects aiming at extracting structured data, including Freebase and DBpedia which then may be interlinked with other sources.
Quality of data derived from Wikipedia and other sources is an issue in two principal ways. First, the accuracy of links and of extracted facts needs to be high enough that users can feel confident in its use. This is no small matter - for example, Yago (a very large extract of Wikipedia) claims a manual verification of 95% accuracy - which unfortunately translates to one out of twenty facts being wrong, probably very unacceptable.
And second, the coverage of data needs to be reasonably uniform – if the Knowledge Server provides relatively detailed information in an area for some topics (say, software products) then we come to expect that the area is reasonably well represented without obvious topics missing and obscure topics included. It is on this criteria that we have not been satisfied with Freebase.
The DBpedia Ontology is a subset of the entirety of DBpedia (about 1,478,00 instances organized into about 260 classes and about 1200 relationship types) that improves on accuracy over the full DBpedia with (we believe) an acceptable loss of coverage. I am sketching here the steps that I used to adapt the DBpedia Ontology to create a Knowledge Server using Semantic Studio.
First, the Classes (concepts) and relationships types of DBpedia are defined using OWL. These are imported and prepared for further processing. The primary hurtle is the computation of the “classification graph” which specifies the more general classes and more specific classes related to each class. For example, a “company” is more specific than an “organization” and less specific than a “publicly-traded company.” And, “company” and “person” are non-comparable (and in fact, disjoint, which means that the only thing they have in common is that instances of either belong to the class “Thing.”)
Arity has developed a reasoner which computes (among other things) subsumption graphs – based on the EL++ description logic. While for this particular problem using our EL++ reasoner is like swatting a fly with a sledgehammer, it is convenient. (DBpedia as distributed does not completely specify the semantics of the ontology - omitting important information about the relationships used. This includes at a minimum assertions about reflexive relationships and relationship hierarchies. These needed to be added by Arity.)
Second, DBpedia data is provided in n-tuple format, which is one popular way of encoding RDF triples. Each statement states a property or relationship, such as an assertion that a particular thing (an instance) is a member of a particular class; that two instances are related by a particular relationship type (“Honda” “makes” “Civic"); or that an instance has faceted value (“Water” "freezesAt" “0 degrees C").
These DBpedia assertions are not complete as stated. Again, a kind of reasoning applies. Each relationship type in the Ontology has a domain (the classes that apply to the “source”) and a range, which, if refering to an instance (and not a value) states the classes that apply to the “destination.” The relationship assertions on instances leave these domain and range class assertions implicit. A process similar to the EL++ reasoning that computed the subsumption graph can be used to make all instance class memberships explicit.
Third, we can “mix in” additional knowledge that is available in sets of linkages between sources. For example, DBpedia has links to other data sources such as WordNet (a broad coverage thesaurus of the English language or the data provided by the US Census (www.rdfabout.com/demo/census/).
Fourth, we use the data packaging process of Semantic Studio to make this version of DBpedia available for use by EXPRESSway. The entire process can be automated to update the data whenever new versions of DBpedia are published.
Finally, we can extend this general knowledge server in a number of interesing directions using the information extraction and reasoning capabilities within Semantic Studio.
