Search results

1 – 2 of 2
Per page
102050
Citations:
Loading...
Access Restricted. View access options
Article
Publication date: 16 October 2018

Rajeswari S. and Sai Baba Magapu

The purpose of this paper is to develop a text extraction tool for scanned documents that would extract text and build the keywords corpus and key phrases corpus for the document…

276

Abstract

Purpose

The purpose of this paper is to develop a text extraction tool for scanned documents that would extract text and build the keywords corpus and key phrases corpus for the document without manual intervention.

Design/methodology/approach

For text extraction from scanned documents, a Web-based optical character recognition (OCR) tool was developed. OCR is a well-established technology, so to develop the OCR, Microsoft Office document imaging tools were used. To account for the commonly encountered problem of skew being introduced, a method to detect and correct the skew introduced in the scanned documents was developed and integrated with the tool. The OCR tool was customized to build keywords and key phrases corpus for every document.

Findings

The developed tool was evaluated using a 100 document corpus to test the various properties of OCR. The tool had above 99 per cent word read accuracy for text only image documents. The customization of the OCR was tested with samples of Microfiches, sample of Journal pages from back volumes and samples from newspaper clips and the results are discussed in the summary. The tool was found to be useful for text extraction and processing.

Social implications

The scanned documents are converted to keywords and key phrases corpus. The tool could be used to build metadata for scanned documents without manual intervention.

Originality/value

The tool is used to convert unstructured data (in the form of image documents) to structured data (the document is converted into keywords, and key phrases database). In addition, the image document is converted to editable and searchable document.

Details

The Electronic Library, vol. 36 no. 5
Type: Research Article
ISSN: 0264-0473

Keywords

Access Restricted. View access options
Article
Publication date: 30 November 2018

Sudarsana Desul, Madurai Meenachi N., Thejas Venkatesh, Vijitha Gunta, Gowtham R. and Magapu Sai Baba

Ontology of a domain mainly consists of a set of concepts and their semantic relations. It is typically constructed and maintained by using ontology editors with substantial human…

417

Abstract

Purpose

Ontology of a domain mainly consists of a set of concepts and their semantic relations. It is typically constructed and maintained by using ontology editors with substantial human intervention. It is desirable to perform the task automatically, which has led to the development of ontology learning techniques. One of the main challenges of ontology learning from the text is to identify key concepts from the documents. A wide range of techniques for key concept extraction have been proposed but are having the limitations of low accuracy, poor performance, not so flexible and applicability to a specific domain. The propose of this study is to explore a new method to extract key concepts and to apply them to literature in the nuclear domain.

Design/methodology/approach

In this article, a novel method for key concept extraction is proposed and applied to the documents from the nuclear domain. A hybrid approach was used, which includes a combination of domain, syntactic name entity knowledge and statistical based methods. The performance of the developed method has been evaluated from the data obtained using two out of three voting logic from three domain experts by using 120 documents retrieved from SCOPUS database.

Findings

The work reported pertains to extracting concepts from the set of selected documents and aids the search for documents relating to given concepts. The results of a case study indicated that the method developed has demonstrated better metrics than Text2Onto and CFinder. The method described has the capability of extracting valid key concepts from a set of candidates with long phrases.

Research limitations/implications

The present study is restricted to literature coming out in the English language and applied to the documents from nuclear domain. It has the potential to extend to other domains also.

Practical implications

The work carried out in the current study has the potential of leading to updating International Nuclear Information System thesaurus for ontology in the nuclear domain. This can lead to efficient search methods.

Originality/value

This work is the first attempt to automatically extract key concepts from the nuclear documents. The proposed approach will address and fix the most of the problems that are existed in the current methods and thereby increase the performance.

Details

The Electronic Library, vol. 37 no. 1
Type: Research Article
ISSN: 0264-0473

Keywords

1 – 2 of 2
Per page
102050