Search results

1 – 1 of 1

Per page

10 20 50

(0)

Citations:

View access options

Article

Publication date: 1 June 2012

“At scale” author name matching with Hadoop/MapReduce

James Powell, Linn Collins, Ariane Eberhardt, David Izraelevitz, Jorge Roman, Thomas Dufresne, Mark Scott, Miriam Blake and Gary Grider

The purpose of this paper is to describe a process for extracting and matching author names from large collections of bibliographic metadata using the Hadoop implementation of…

HTML

PDF (650 KB)

Downloads

472

Abstract

Purpose

The purpose of this paper is to describe a process for extracting and matching author names from large collections of bibliographic metadata using the Hadoop implementation of MapReduce. It considers the challenges and risks associated with name matching on such a large‐scale and proposes simple matching heuristics for the reduce process. The resulting semantic graphs of authors link names to publications, and include additional features such as phonetic representations of author last names. The authors believe that this achieves an appropriate level of matching at scale, and enables further matching to be performed with graph analysis tools.

Design/methodology/approach

A topically‐focused collection of metadata records describing peer‐reviewed papers was generated based upon a search. The matching records were harvested and stored in the Hadoop Distributed File System (HDFS) for processing by hadoop. A MapReduce job was written to perform coarse‐grain author name matching, and multiple papers were matched with authors when the names were very similar or identical. Semantic graphs were generated so that the graphs could be analyzed to perform finer grained matching, for example by using other metadata such as subject headings.

Findings

When performing author name matching at scale using MapReduce, the heuristics that determine whether names match should be limited to the rules that yield the most reliable results for matching. Bad rules will result in lots of errors, at scale. MapReduce can also be used to generate or extract other data that might help resolve similar names when stricter rules fail to do so. The authors also found that matching is more reliable within a well‐defined topic domain.

Originality/value

Libraries have some of the same big data challenges as are found in data‐driven science. Big data tools such as hadoop can be used to explore large metadata collections, and these collections can be used as surrogates for other real world, big data problems. MapReduce activities need to be appropriately scoped so as to yield good results, while keeping an eye out for problems in code which can be magnified in the output from a MapReduce job.

Details

Library Hi Tech News, vol. 29 no. 4

Type: Research Article

DOI:

ISSN: 0741-9058

Keywords

Access

Year

All dates (1)
From To Go

Content type

Article (1)