Lucene Search Engine Development: A Beginner’s Experience

  • Azilawati Azizan Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, Perak Branch Tapah Campus, Perak, Malaysia
  • Najwa Izzah Najihah Mohd Sanusi Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, Perak Branch Tapah Campus, Perak, Malaysia
  • Nurkhairizan Khairuddin Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, Perak Branch Tapah Campus, Perak, Malaysia
  • Ana Salwa Shafie Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, Pahang Branch Jengka Campus, Pahang, Malaysia

Abstract

Lucene provides a basic library package for building a complete text-based search engine. It can be used in various ways to benefit both researchers and users. However, for a beginner, to create a search engine utilizing Lucene, require a thorough understanding of the procedures and library packages. Therefore, this project seeks to explore and demonstrate the development of a search engine by employing the Malay Quran translation text as the dataset for testing purposes. This project applied the fundamental Information Retrieval (IR) model as the main methodology for developing the search engine. Apache Lucene framework, a full-text search engine library which is written in JAVA was used to construct the whole search engine components namely the indexer, searcher, query processor, and ranker. Then, the developed search engine was evaluated using a standard IR measurement, where it achieved 67% of precision and 32% recall value. This paper provides a basic approach to developing a text-based search engine that can be used for any IR testing purposes. The result of this project may also benefit the IR community in comparing the retrieval performance.

References

[1] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. ACM Press. Addison
Wesley, 1999.
[2] M. Alecci, T. Baldo, L. Martinelli, and E. Ziroldo, “Development of an IR system for argument
search,” CEUR Workshop Proc., vol. 2936, pp. 2302–2318, 2021.
91
[3] F. Kasmani, R. Maniyar, and M. Narvekar, “Content Based Search Engine for E-Books,” 2020
6th Int. Conf. Adv. Comput. Commun. Syst. ICACCS 2020, pp. 528–533, 2020.
[4] Z. Youzhuo, F. Yu, Z. Ruifeng, H. Shuqing, and W. Yi, “Research on Lucene Based Full-Text
Query Search Service for Smart Distribution System,” 2020 3rd Int. Conf. Artif. Intell. Big Data,
ICAIBD 2020, pp. 338–341, 2020.
[5] J. Lin, “A Prototype of Serverless Lucene,” 2020.
[6] A. Białecki, R. Muri, and G. Ingersoll, “Apache Lucene 4,” Proc. SIGIR 2012 Work. Open
Source Inf. Retr., pp. 17–24, 2012.
[7] M. M. Otis Gospodnetic, Erik Hatcher, Lucene in Action 2nd Edition, 2nd ed. Simon and
Schuster (Manning Publications), 2010.
[8] A. Grand, R. Muir, J. Ferenczi, and J. Lin, From MaxSCORE to block-max WAND: The story
of how lucene significantly improved query evaluation performance, vol. 12036 LNCS.
Springer International Publishing, 2020.
[9] P. Yang, H. Fang, and J. Lin, “Anserini: Enabling the Use of Lucene for Information Retrieval
Research,” J. Data Inf. Qual., vol. 10, no. 4, pp. 1–20, 2017.
[10] A. Y. Aldailamy, N. A. W. A. Hamid, and M. Abdulkarem, “Distributed indexing: Performance
analysis of solr, terrier and katta information retrievals,” Malaysian J. Comput. Sci., vol. 31,
no. 5, pp. 87–104, 2018.
[11] J. Lin et al., “Supporting Interoperability between Open-Source Search Engines with the
Common Index File Format,” SIGIR 2020 - Proc. 43rd Int. ACM SIGIR Conf. Res. Dev. Inf.
Retr., pp. 2149–2152, 2020.
[12] W. Iqbal, W. I. Malik, F. Bukhari, K. M. Almustafa, and Z. Nawaz, “Big data full-text search
index minimization using text summarization,” Inf. Technol. Control, vol. 50, no. 2, pp. 375–
389, 2021.
[13] M. Alhawarat, M. Hegazi, and A. Hilal, “Processing the Text of the Holy Quran : a Text Mining
Study,” vol. 6, no. 2, pp. 2–7, 2015.
[14] A. Azizan, Z. Abu Bakar, N. A. Rahman, S. Masrom, and N. Khairuddin, “A comparative
evaluation of search engines on finding specific domain information on the web,” Int. J. Eng.
Technol., vol. 7, no. 4, pp. 1–4, 2018.
[15] M. Agosti, G. Maria, D. Nunzio, and S. Marchesin, “An Analysis of Query Reformulation
Techniques for Precision Medicine,” in SIGIR, 2019, pp. 973–976.
Published
2022-11-15
How to Cite
AZIZAN, Azilawati et al. Lucene Search Engine Development: A Beginner’s Experience. Mathematical Sciences and Informatics Journal, [S.l.], v. 3, n. 2, p. 80-92, nov. 2022. ISSN 2735-0703. Available at: <https://myjms.mohe.gov.my/index.php/mij/article/view/20290>. Date accessed: 15 sep. 2024. doi: https://doi.org/10.24191/mij.v3i2.20290.
Section
Articles

Most read articles by the same author(s)

Obs.: This plugin requires at least one statistics/report plugin to be enabled. If your statistics plugins provide more than one metric then please also select a main metric on the admin's site settings page and/or on the journal manager's settings pages.