Feature Generation for Text Categorization Using World Knowledge

This is an online appendix for the paper "Feature Generation for Text Categorization Using World Knowledge" by Evgeniy Gabrilovich and Shaul Markovitch, Nineteenth International Joint Conference on Artificial Intelligence (IJCAI), Edinburgh, Scotland, UK, August 2005 [PDF]

Abstract

We enhance machine learning algorithms for text categorization with generated features based on domain-specific and common-sense knowledge. This knowledge is represented using publicly available ontologies that contain hundreds of thousands of concepts, such as the Open Directory; these ontologies are further enriched by several orders of magnitude through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts, which in turn induce a set of generated features that augment the standard bag of words. Feature generation is accomplished through contextual analysis of document text, implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses the two main problems of natural language processing---synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the documents alone. Experimental results confirm improved performance, breaking through the plateau previously reached in the field.


Here we provide the details of the datasets that have been omitted from the paper owing to lack of space.

Dataset Categories comprising the dataset
Topic-16 (RCV1) e142, gobit, e132, c313, e121, godd, ghea, e13, c183, m143, gspo, c13, e21, gpol, m14, c15
Topic-10A (RCV1) e31, c41, c151, c313, c31, m13, ecat, c14, c331, c33
Topic-10B (RCV1) m132, c173, g157, gwea, grel, c152, e311, c21, e211, c16
Topic-10C (RCV1) c34, c13, gtour, c311, g155, gdef, e21, genv, e131, c17
Industry-16 (RCV1) i81402, i79020, i75000, i25700, i83100, i16100, i1300003, i14000, i3302021, i8150206, i0100132, i65600, i3302003, i8150103, i3640010, i9741102
Industry-10A (RCV1) i47500, i5010022, i3302021, i46000, i42400, i45100, i32000, i81401, i24200, i77002
Industry-10B (RCV1) i25670, i61000, i81403, i34350, i1610109, i65600, i3302020, i25700, i47510, i9741110
Industry-10C (RCV1) i25800, i41100, i42800, i16000, i24800, i02000, i34430, i36101, i24300, i83100


Evgeniy Gabrilovich
gabr@cs.technion.ac.il

Last updated on April 4, 2005