STLabWikifier
From STLab
Contents |
Description
STLab wikifier is a tool that links automatically any text to its related DBPedia entities. It looks for relevant terms (denoting either entities or concepts) within the text, and selects their corresponding Wikipedia pages. To this aim, a Word Sense Disambiguation approach is used. In fact, terms are highly ambiguous in Wikipedia. WSD is performed based on the popularity of the destination page (measured by the number of input links), and on the similarity between the text to be analyzed, and the destination pages.
Technology
The tool is based on the following components:
Input
The input of the service is any Italian or English text.
Output
The output of the service is a list of URI linking to DBPedia entities and the top ten categories identified by dc-terms related to the text.
DEMO
An on line demo of the service is available here. (For the old demo click here.)
Using the tool (PRIVATE)
The tool can be used by shell command line. There is also available java API to develop a single wikifier. For more details see the javadoc and the README file inside the directory API.
The process
The process is compound of two steps: first we have to create an index that reflect DBPedia features important for our purposes; second we have to search inside index's fields by using the same Lucene analyzer for indexing (very important). Look at the basic flow chart of the process.
Indexing
We use the DBPedia resources to create the index for the wikifier. What we need are information about the uri of the resource (called path in the index), its label of the resource (called term in our index), its wiki-page (called page in the index), the resource's comment or abstract (called cont in he index), the categories (called category in the index), the types (called type in the index) and the number of input link to the resource i.e. the resources that point to this resource (called npl in the index).
Get data by sparql queries
So the information gathered via http://stlab.istc.cnr.it/stlab/STLabWikifier/sparql_queries sparql query from a sparql endpoint are:
- the resources: from DBPedia we get all the resources that have a label except for disambiguation pages. (DBPedia's dataset: dbpedia_labels_en/it; Property: http://www.w3.org/2000/01/rdf-schema#label).
- the labels: from the resources gathered at the first point we get their label and also labels coming from wikipedia redirect page associated to the resource. (DBPedia's dataset: dbpedia_labels_en/it and dbpedia_redirects_en; Properties: http://www.w3.org/2000/01/rdf-schema#label and http://dbpedia.org/ontology/wikiPageRedirects. N.B. the redirects are only for the english version).
- the resource's content: it is chosen between abstract or comment. If there is the abstract we take it otherwise we get the comment. (DBPedia's datasets: dbpedia_short_abstracts_en/it and dbpedia_long_abstracts_en/it; Properties: http://dbpedia.org/ontology/abstract and http://www.w3.org/2000/01/rdf-schema#comment).
- the resource's page: this is the wikipedia page. (DBpedia's dataset: dbpedia_wikipedia_links_en/it; Property: http://xmlns.com/foaf/0.1/page).
- the resource's types: this is the ontology type. DBPedia's dataset: dbpedia_instance_types_en; Property: http://www.w3.org/1999/02/22-rdf-syntax-ns#type).
- the resource's categories: these are the dc-terms associated to the resource. The categories are used to classify the text to be wikifier. From the output resources of the wikifier we count the number of associated categories and then we list the top ten to classify the text. (DBPedia's dataset: dbpedia_article_categories_en; Property: http://purl.org/dc/terms/subject).
- the number of input link to the resource: it is the number of resources that link to the analyzed resource. (DBPedia's dataset: dbpedia_page_links_en/it; Property: http://dbpedia.org/ontology/wikiPageWikiLink).
To run the query we used the virtuoso sparql endpoint and the jar provider by Virtuoso and Jena. The wikifier provides a set of default queries but there is the possibility to input a file contains the sparql queries. It is important that the user queries gather at least information about label, page, comment and categories.
Create the Lucene document
With the results of sparql query we create a Lucene document for indexing and storing. The document represents a DBPedia resource and has the following fields:
- path: the URI of the DBPedia resource. This field is not analyzed and it is stored in the index.
- page: the wikipedia page of the DBPedia resource. This field is not analyzed and stored in the index.
- type: a string contains the resource types separated by ";". This field is not analyzed and stored in the index.
- category: a string contains the resource category separated by ";". This field is not analyzed and stored in the index. This is used the categorize the text to be wikifier.
- pagelink: the number of page input link to the resource. This is the count of the resources that link to this one.
- termLANGUAGE: a set of terms related to the labels for the resource. The labels are the resource's label and those of redirects to this resource. This field has the stop words removed, is stemmed and is stored in the index.
- termLANGUAGEkey: a set of term related to the labels for the resource as before but inserted it is treated by Lucene KeyWordAnalyzer. This field is only stemmed but before indexing and is stored in the index.
- contLANGUAGE: the abstract or the comment associated to the resource. This field has the stop words removed, is stemmed and is stored in the index.
Example of a document:
- path: stored,indexed; path: http://dbpedia.org/resource/Amaranthaceae
- page: stored,indexed; page: http://en.wikipedia.org/wiki/Amaranthaceae
- type: stored,indexed; type: http://dbpedia.org/ontology/Species; http://dbpedia.org/ontology/Plant; http://dbpedia.org/ontology/Eukaryote
- category: stored,indexed; category: Caryophyllales_families, Amaranthaceae
- pagelink: stored,indexed; pagelink: 337
- termENGLISH: amaranthaceae
- termENGLISH: goosefoot family
- termENGLISHkey: amaranthacea
- termENGLISHkey: goosefoot famili
- contENGLISH: family Amaranthaceae, the Amaranth family, contains about 160 etc ...
To create the index we use a PerFieldAnalyzerWrapper of Lucene. This class construct a multi-analyzer one for each document field to process:
PerFieldAnalyzerWrapper aWrapper = new PerFieldAnalyzerWrapper(new SimpleAnalyzer(Version.LUCENE_31)); aWrapper.addAnalyzer(("termENGLISH"), stemanalyzerEN); //Terms stemmed. aWrapper.addAnalyzer(("termENGLISHkey"), keyanalyzer); //Terms as kery word. aWrapper.addAnalyzer(("cont"+language.toUpperCase()), stemanalyzerStopEN); //Content (the text). Stemmed and stop-word removed.
the wrapper contains the Snowball analyzer EnglishAnalyzer (or italian) and the KeywordAnalyzer this perform a "Tokenizes" on the entire label as a single token:
EnglishAnalyzer stemanalyzerEN = new EnglishAnalyzer(Version.LUCENE_31,emptyset); //Only stemming EnglishAnalyzer stemanalyzerStopEN = new EnglishAnalyzer(Version.LUCENE_31,(new StopWords(language)).getStopWord()); //Stemmer plus stop-words KeywordAnalyzer keyanalyzer = new KeywordAnalyzer(); //Key words analyzer
The only fields analyzed are:
- termLANGUAGE. This field is stemmed and stop word are removed. This contains all the labels of a resource. There is field for each label.
- termLANGUAGEkey. This field contains the labels of a resource. There is a field for each label. The label is pre-processed by stemming.
- contENGLISH. This contains either comment or abstract of a resource. This is stemmed and the stop words are removed.
N.B. It is important that the SEARCH STEP USE THE SAME WRAPPER ANALYZER FOR INDEXING.
Searching
We have three kind of searching distributed on two levels: exact searching, proximity searching and filter searching. Exact and proximity searching are on the same level and the filter searching is under these two. Exact and proximity use the key-words and searching between termLANGUAGE and termLANGUAGEkey instead the filter search use the entire text or the sentences where there are the key words and search on a subIndex contains the results of the first two.
In the searching we used the lucene score (ls), the number of page link (NPL) and the total score that is: Wxlog2(NPL)+(1-W)*ls with 0<W<1.
Exact Searching
This will search on the field termLANGUAGEkey and use the same wrapper analyzer of the indexing. The key-words are pre-processed by stemming and than searched in the index by using lucene. If we get some results then we order them by number of page link (NPL) and get the first 50 results. These are send to the filter search that will search among them in the field contLANGUAGE with the input text or the sentence where there are the key-words. If there are results from the filter search we take the first two with the highest total score. Otherwise we get the first two results from exact search with the highest total score:
- KeyWords in input
- Preprocessing of the key-word with the stemming.
- Search documents that exactly match the keyword in the field termLANGUAGEkey and get the first 50 results with the highest lucene score
- Order the results by the number of input link (NPL)
- Create a subIndex with such documents
- Search in the new subindex, within the field contLANGUAGE, for the input text or the sentences contains the key-words and get the first 50 results with the highest lucene score
- If there are results we get the first two with the highest total score
- If the filter search not produce results then we take the first two result with the highest total score from exact search.
Proximity Searching
This will search on the field termLANGUAGE and use the same wrapper analyzer of the indexing. The key-word is searched by assuming that can be exist among word compound the key-word a distance of zero, one or two (i.e with a key-words as A B we search for A B, A w B and A w w B). Starting from zero to two we stop when we encounter the first results and if there aren't we do a simple search on the key-word and select such results with the number of words equal or higher to the number of words that compound the key-word. Got the results then we order them by number of page link (NPL) and get the first 50 results. These are send to the filter search that will search among them in the field contLANGUAGE with the input text or the sentence where there are the key-words. If there are results from the filter search we take the first two with the highest total score. Otherwise we get the first two results from proximity search with the highest total score:
- KeyWords in input
- Start a cicle to search the key-word with distance among words of zero, one or two.
- Search documents that have the key-words assuming a possible distance in the field termLANGUAGE and get the first 50 results with the highest lucene score
- If there are results these are ordered by the number of input link (NPL)
- If there aren't results we search the key-words as it is, always in termLANGUAGE, and we choose the results with number of words compound the key-words equal or higher the length of the key-word.
- Create a subIndex with such documents
- Start a cicle to search the key-word with distance among words of zero, one or two.
- Search in the new subindex, within the field contLANGUAGE, for the input text or the sentences contains the key-words and get the first 50 results with the highest lucene score
- If there are results we get the first two with the highest total score
- If the filter search not produce results then we take the first two result with the highest total score from proximity search.
Filter Searching
The filter search creates a subindex with the results coming from exact and proximity search. In this subindex we search the input text or the sentence contains the key-word inside field contLANGUAGE. If we get somethings then the first two result with the highest total score are chosen:
- Create a subindex with the output coming from exact and proximity search.
- Search within field contLANGUAGE the input text or the sentence with the key-word.
- From results we choose the first two with the highest total score.
The filter search is used for disambiguation.