M. H. Heine has shown that if one follows the retrieval procedure associated with Swets' model of an information retrieval system it is possible that the inverse relationship…
Abstract
M. H. Heine has shown that if one follows the retrieval procedure associated with Swets' model of an information retrieval system it is possible that the inverse relationship between Recall and Precision may not hold. In this paper we extend Heine's result to the case where the criterion‐parameter can assume discrete as well as continuous values. A plausible model of this kind is described and it is shown that for that model, Recall and Precision are in fact inversely related. The condition under which this relation may possibly not hold is then examined, and the conclusion is reached that this behaviour is an effect of the customary retrieval procedure, rather than anything intrinsic in the Swets model itself. An alternative logic is proposed in which the expected relationship is restored, as well as performance improved.
Bradford distributions describe the relationship between ‘journal productivities’ and ‘journal rankings by productivity’. However, different ranking conventions exist, implying…
Abstract
Bradford distributions describe the relationship between ‘journal productivities’ and ‘journal rankings by productivity’. However, different ranking conventions exist, implying some ambiguity as to what the Bradford distribution ‘is’. A need accordingly arises for a standard ranking convention to assist comparisons between empirical data, and also comparisons between empirical data and theoretical models. Five ranking conventions are described including the one used originally by Bradford, along with suggested distinctions between ‘Bradford data set’, ‘Bradford distribution’, ‘Bradford graph’, ‘Bradford log graph’, ‘Bradford model’ and ‘Bradford’s Law‘. Constructions such as the Lotka distribution, Groos droop (generalised to accommodate growth as well as fall‐off in the Bradford log graph), Brookes hooks, and the slope and intercept of the Bradford log graph are clarified on this basis. Concepts or procedures questioned include: (1) ‘core journal’, from the Bradfordian viewpoint; (2) the use of traditional statistical inferential procedures applied to Bradford data; and (3) R(n) as a maximum (rather than median or mean) value at tied‐rank values.
Details
Keywords
‘Language measures’ such as Swets's E or Brookes's S, which measure the separation of the PMFs defined by a weighting formula applied to the sets of relevant and non‐relevant…
Abstract
‘Language measures’ such as Swets's E or Brookes's S, which measure the separation of the PMFs defined by a weighting formula applied to the sets of relevant and non‐relevant documents, are different in kind to probabilistic ‘system measures’ such as Precision (P) and Recall (R). For a given query and collection the subset of {P} × {R} defined by varying the threshold is unaffected by monotonic transformations of the weighting formula used. If S is redefined so as to relate only to ranked probabilities it will retain its value, and so reflect this constancy in the graph, under such transformations. S, as redefined, is also an (approximate) indicator of a rule to retrieve documents with weights greater [less] than the threshold when S is positive [negative]. Language measures can also be used to determine the retrieval algorithm when multivariate weights are used.
The dispersion or ‘scatter’ of documents over some set of values of a document attribute is usually described by means of a frequency distribution. When the attribute is…
Abstract
The dispersion or ‘scatter’ of documents over some set of values of a document attribute is usually described by means of a frequency distribution. When the attribute is qualitative an order distribution can be defined, as in the usual descriptions of Bradford's law. A more succinct description is offered by an order statistic, such as Singleton's index. A novel order statistic, the ‘adapted Gini index’, is introduced and related to the conventional form of Bradford's law. Some simple properties of it are described. An alternative index of dispersion, not an order statistic, based on the relative entropy of the frequency distribution is also defined. For sets of bibliographies such indices themselves have distributions, and it is suggested that, in particular, the distribution pertaining to an indexed data base provides an objective characterization of the data base in so far as indexing terms have been applied to the items in it. A variety of experimental data is reported. This includes the distribution of two indices for samples of bibliographies taken from British Technology Index and Index Medicus, and studies of the variation of the indices with time when the attribute is that of journal title. Whether a new area of knowledge becomes less or more dispersed in its journals as it progresses depends in part on which index is chosen to represent the dispersion, and on whether a series of cumulative or cross‐section bibliographies is chosen.
This paper is concerned with recent work in the theory of information retrieval. More particularly, it is concerned with theories which tackle the problem of retrieval…
Abstract
This paper is concerned with recent work in the theory of information retrieval. More particularly, it is concerned with theories which tackle the problem of retrieval performance, in a sense which will be explained. The aim is not an exhaustive survey of such work; rather it is an analysis and synthesis of those contributions which I feel to be important or find interesting.
BRIAN VICKERY and ALINA VICKERY
There is a huge amount of information and data stored in publicly available online databases that consist of large text files accessed by Boolean search techniques. It is widely…
Abstract
There is a huge amount of information and data stored in publicly available online databases that consist of large text files accessed by Boolean search techniques. It is widely held that less use is made of these databases than could or should be the case, and that one reason for this is that potential users find it difficult to identify which databases to search, to use the various command languages of the hosts and to construct the Boolean search statements required. This reasoning has stimulated a considerable amount of exploration and development work on the construction of search interfaces, to aid the inexperienced user to gain effective access to these databases. The aim of our paper is to review aspects of the design of such interfaces: to indicate the requirements that must be met if maximum aid is to be offered to the inexperienced searcher; to spell out the knowledge that must be incorporated in an interface if such aid is to be given; to describe some of the solutions that have been implemented in experimental and operational interfaces; and to discuss some of the problems encountered. The paper closes with an extensive bibliography of references relevant to online search aids, going well beyond the items explicitly mentioned in the text. An index to software appears after the bibliography at the end of the paper.
Valery J. Frants, Jacob Shapiro and Vladimir G. Voiskunskii
A simple notation for describing the internal structure of a document is presented, and contrasted with other, more conventional notations for describing documents, in particular…
Abstract
A simple notation for describing the internal structure of a document is presented, and contrasted with other, more conventional notations for describing documents, in particular those related to subject‐classification systems and document description for bibliographic purposes, as well as with document metalanguage codes such as those of SGML. It is suggested such a notation should assist the science of human messaging through (1) permitting hypotheses to be more readily expressed and/or tested concerning document structure, and (2) facilitating the formation of taxonomies of documents based on their structures. Such a notation should also be of practical value in contributing to the processes of document specification, building and testing, and possibly also contribute to new generations of IR systems which link retrieval against record databases to the search systems internal to specific documents. It is suggested that, following formative criticism, professional standards for describing document structure should be sought based on the notation. The notation is at present limited to linear documents, but extensions to it to accommodate documents in non‐linear form (e.g. hypertext documents) and/or existing in physically distributed form, could usefully be constructed. Examples of the application of the notation are provided.