James Powell, Linn Collins, Ariane Eberhardt, David Izraelevitz, Jorge Roman, Thomas Dufresne, Mark Scott, Miriam Blake and Gary Grider
The purpose of this paper is to describe a process for extracting and matching author names from large collections of bibliographic metadata using the Hadoop implementation of…
Abstract
Purpose
The purpose of this paper is to describe a process for extracting and matching author names from large collections of bibliographic metadata using the Hadoop implementation of MapReduce. It considers the challenges and risks associated with name matching on such a large‐scale and proposes simple matching heuristics for the reduce process. The resulting semantic graphs of authors link names to publications, and include additional features such as phonetic representations of author last names. The authors believe that this achieves an appropriate level of matching at scale, and enables further matching to be performed with graph analysis tools.
Design/methodology/approach
A topically‐focused collection of metadata records describing peer‐reviewed papers was generated based upon a search. The matching records were harvested and stored in the Hadoop Distributed File System (HDFS) for processing by hadoop. A MapReduce job was written to perform coarse‐grain author name matching, and multiple papers were matched with authors when the names were very similar or identical. Semantic graphs were generated so that the graphs could be analyzed to perform finer grained matching, for example by using other metadata such as subject headings.
Findings
When performing author name matching at scale using MapReduce, the heuristics that determine whether names match should be limited to the rules that yield the most reliable results for matching. Bad rules will result in lots of errors, at scale. MapReduce can also be used to generate or extract other data that might help resolve similar names when stricter rules fail to do so. The authors also found that matching is more reliable within a well‐defined topic domain.
Originality/value
Libraries have some of the same big data challenges as are found in data‐driven science. Big data tools such as hadoop can be used to explore large metadata collections, and these collections can be used as surrogates for other real world, big data problems. MapReduce activities need to be appropriately scoped so as to yield good results, while keeping an eye out for problems in code which can be magnified in the output from a MapReduce job.