A three-phase mapreduce-based algorithm for searching biomedical document databases

Milana Grbić

Апстракт


Retrieving information from large document databases is in the focus of scientific research in recent years. In this paper, a parallel algorithm for searching biomedical documents based on the MapReduce technique is presented. The algorithm consists of three phases: preprocessing phase, document representation phase, and searching phase. In the first phase, lemmatization and elimination of stop words are performed. In the second phase, each of the documents is represented as a list of pairs (word, tf-idf index of the word). The third phase represents the main searching procedure. It uses a specially designed ranking criterion, which is based on a combination of the term frequency - inverse document frequency (tf-idf) index and the indicator function for each query word. Four different versions of ranking criteria are proposed and analyzed. The algorithm performances are tested on different subsets of the large and well-known PubMed biomedical document database. The results obtained by the experiments indicate that the proposed parallel algorithm succeeds in finding high-quality results in a reasonable time. Comparing to the sequential variant of the algorithm, the experiments show that the parallel algorithm is more efficient since it finds high-quality solutions in significantly less time.

Пуни текст:

PDF


DOI: http://dx.doi.org/10.7251/IJEEC1901001G

Рефбекови

  • Тренутно не постоје рефбекови.


e-ISSN: 2566-3682
UDC: 621.3:004
Publication frequency: twice a year (June, December)
University of East Sarajevo, Faculty of Electrical Engineering
Vuka Karadžića 30, 71123 East Sarajevo, Republic of Srpska, Bosnia and Herzegovina
Phone/Fax: +38757342788
Web: http://www.ijeec.org
E-mail: ijeec.journal@gmail.com