To read this content please select one of the options below:

Framework for entity extraction with verification: application to inference of data set usage in research publications

Svetlozar Nestorov (Department of Information Systems and Supply Chain Management, Loyola University Chicago, Chicago, Illinois, USA)
Dinko Bačić (Department of Information Systems and Supply Chain Management, Loyola University Chicago, Chicago, Illinois, USA)
Nenad Jukić (Department of Information Systems and Supply Chain Management, Loyola University Chicago, Chicago, Illinois, USA)
Mary Malliaris (Department of Information Systems and Supply Chain Management, Loyola University Chicago, Chicago, Illinois, USA)

The Electronic Library

ISSN: 0264-0473

Article publication date: 27 July 2022

Issue publication date: 8 August 2022

99

Abstract

Purpose

The purpose of this paper is to propose an extensible framework for extracting data set usage from research articles.

Design/methodology/approach

The framework uses a training set of manually labeled examples to identify word features surrounding data set usage references. Using the word features and general entity identifiers, candidate data sets are extracted and scored separately at the sentence and document levels. Finally, the extracted data set references can be verified by the authors using a web-based verification module.

Findings

This paper successfully addresses a significant gap in entity extraction literature by focusing on data set extraction. In the process, this paper: identified an entity-extraction scenario with specific characteristics that enable a multiphase approach, including a feasible author-verification step; defined the search space for word feature identification; defined scoring functions for sentences and documents; and designed a simple web-based author verification step. The framework is successfully tested on 178 articles authored by researchers from a large research organization.

Originality/value

Whereas previous approaches focused on completely automated large-scale entity recognition from text snippets, the proposed framework is designed for a longer, high-quality text, such as a research publication. The framework includes a verification module that enables the request validation of the discovered entities by the authors of the research publications. This module shares some similarities with general crowdsourcing approaches, but the target scenario increases the likelihood of meaningful author participation.

Keywords

Acknowledgements

This work was supported by an Alfred P. Sloan foundation grant (G201514139) in the EERE initiative.

Citation

Nestorov, S., Bačić, D., Jukić, N. and Malliaris, M. (2022), "Framework for entity extraction with verification: application to inference of data set usage in research publications", The Electronic Library, Vol. 40 No. 4, pp. 453-471. https://doi.org/10.1108/EL-03-2022-0071

Publisher

:

Emerald Publishing Limited

Copyright © 2022, Emerald Publishing Limited

Related articles