A new and promising approach to document clustering consists of utilizing previously formed clusters of queries to cluster documents. To employ this approach in practice a…
A new and promising approach to document clustering consists of utilizing previously formed clusters of queries to cluster documents. To employ this approach in practice a similarity measure for queries must be available. This requirement does not cause any problem in the case of information retrieval systems in which both the search request formulations and document representations are sets of weighted or unweighted index terms. However, in most operational retrieval systems search request formulations are Boolean combinations of index terms. Research into similarity measures for search request formulations of this type has already been undertaken by the author and reported elsewhere. The present paper provides further results of investigations in this area. The novelty of the approach discussed is the incorporation within the methodology described earlier of a weighting mechanism to indicate the relative importance of particular attributes of a given Boolean search request formulation. A modification suggested is based on the standard probabilistic approach to information retrieval.
A need for developing an information retrieval technique maintaining the appeal of Boolean retrieval schemes and in addition providing the advantages of a ranked search output has…
A need for developing an information retrieval technique maintaining the appeal of Boolean retrieval schemes and in addition providing the advantages of a ranked search output has been pointed out in the literature for many years. However, a previous attempt to incorporate into the Boolean retrieval schemes a weighting mechanism to produce ranked lists of documents has not been fully successful. Specifically, further research has demonstrated that the theory behind the previous approach is characterized by disturbing ambiguities and inconsistencies, with equivalent Boolean search request formulations yielding different rankings of documents retrieved. As a result of this more recent research an alternative approach has been outlined. However, a closer analysis of this second approach reveals that it is also not free from some intrinsic weaknesses. The present paper provides the results of this new analysis and suggests a more rigorous methodology.