www.uhasselt.be
DSpace

Document Server@UHasselt >
Research >
Research publications >

Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/747

Title: Properties of the n-overlap vector and n-overlap similarity theory
Authors: EGGHE, Leo
Issue Date: 2006
Publisher: Wiley
Citation: JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 57(9). p. 1165-1177
Abstract: In the first part of this paper we define the n-overlap vector whose coordinates consist of the fraction of the objects (e.g. books, N-grams,…) that belong to 1, 2,…, n sets (more generally: families) (e.g. libraries, databases,…). With the aid of the Lorenz concentration theory we build a theory of n-overlap similarity and corresponding measures, such as the generalized Jaccard index (generalizing the well-known Jaccard index in case ). n=2 Next we determine the distributional form of the n-overlap vector assuming certain distributions of the object’s and of the set (family)-sizes. In this section the decreasing power law and decreasing exponential distribution is explained for the n-overlap vector. Both item (token) n-overlap and source (type) n-overlap are studied. The final section is devoted to the n-overlap properties of objects indexed by a hierarchical system (e.g. books indexed by numbers from a UDC or Dewey system or by N-grams). We show that the results of Section II can be applied here. We also show that the Lorenz-order of the n-overlap vector is respected by an increase or a decrease of the level of refinement in the hierarchical system (e.g. the value N in N-grams).
URI: http://hdl.handle.net/1942/747
DOI: 10.1002/asi.v57:9
ISI #: 000238519600003
ISSN: 1532-2882
Category: A1
Type: Journal Contribution
Validation: ecoom, 2007
Appears in Collections: Research publications

Files in This Item:

Description SizeFormat
Published version149.98 kBAdobe PDF
Peer-reviewed author version618.18 kBAdobe PDF

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.