Lucene Search Engine Development: A Beginner’s Experience
Abstract
Lucene provides a basic library package for building a complete text-based search engine. It can be used in various ways to benefit both researchers and users. However, for a beginner, to create a search engine utilizing Lucene, require a thorough understanding of the procedures and library packages. Therefore, this project seeks to explore and demonstrate the development of a search engine by employing the Malay Quran translation text as the dataset for testing purposes. This project applied the fundamental Information Retrieval (IR) model as the main methodology for developing the search engine. Apache Lucene framework, a full-text search engine library which is written in JAVA was used to construct the whole search engine components namely the indexer, searcher, query processor, and ranker. Then, the developed search engine was evaluated using a standard IR measurement, where it achieved 67% of precision and 32% recall value. This paper provides a basic approach to developing a text-based search engine that can be used for any IR testing purposes. The result of this project may also benefit the IR community in comparing the retrieval performance.
References
Wesley, 1999.
[2] M. Alecci, T. Baldo, L. Martinelli, and E. Ziroldo, “Development of an IR system for argument
search,” CEUR Workshop Proc., vol. 2936, pp. 2302–2318, 2021.
91
[3] F. Kasmani, R. Maniyar, and M. Narvekar, “Content Based Search Engine for E-Books,” 2020
6th Int. Conf. Adv. Comput. Commun. Syst. ICACCS 2020, pp. 528–533, 2020.
[4] Z. Youzhuo, F. Yu, Z. Ruifeng, H. Shuqing, and W. Yi, “Research on Lucene Based Full-Text
Query Search Service for Smart Distribution System,” 2020 3rd Int. Conf. Artif. Intell. Big Data,
ICAIBD 2020, pp. 338–341, 2020.
[5] J. Lin, “A Prototype of Serverless Lucene,” 2020.
[6] A. Białecki, R. Muri, and G. Ingersoll, “Apache Lucene 4,” Proc. SIGIR 2012 Work. Open
Source Inf. Retr., pp. 17–24, 2012.
[7] M. M. Otis Gospodnetic, Erik Hatcher, Lucene in Action 2nd Edition, 2nd ed. Simon and
Schuster (Manning Publications), 2010.
[8] A. Grand, R. Muir, J. Ferenczi, and J. Lin, From MaxSCORE to block-max WAND: The story
of how lucene significantly improved query evaluation performance, vol. 12036 LNCS.
Springer International Publishing, 2020.
[9] P. Yang, H. Fang, and J. Lin, “Anserini: Enabling the Use of Lucene for Information Retrieval
Research,” J. Data Inf. Qual., vol. 10, no. 4, pp. 1–20, 2017.
[10] A. Y. Aldailamy, N. A. W. A. Hamid, and M. Abdulkarem, “Distributed indexing: Performance
analysis of solr, terrier and katta information retrievals,” Malaysian J. Comput. Sci., vol. 31,
no. 5, pp. 87–104, 2018.
[11] J. Lin et al., “Supporting Interoperability between Open-Source Search Engines with the
Common Index File Format,” SIGIR 2020 - Proc. 43rd Int. ACM SIGIR Conf. Res. Dev. Inf.
Retr., pp. 2149–2152, 2020.
[12] W. Iqbal, W. I. Malik, F. Bukhari, K. M. Almustafa, and Z. Nawaz, “Big data full-text search
index minimization using text summarization,” Inf. Technol. Control, vol. 50, no. 2, pp. 375–
389, 2021.
[13] M. Alhawarat, M. Hegazi, and A. Hilal, “Processing the Text of the Holy Quran : a Text Mining
Study,” vol. 6, no. 2, pp. 2–7, 2015.
[14] A. Azizan, Z. Abu Bakar, N. A. Rahman, S. Masrom, and N. Khairuddin, “A comparative
evaluation of search engines on finding specific domain information on the web,” Int. J. Eng.
Technol., vol. 7, no. 4, pp. 1–4, 2018.
[15] M. Agosti, G. Maria, D. Nunzio, and S. Marchesin, “An Analysis of Query Reformulation
Techniques for Precision Medicine,” in SIGIR, 2019, pp. 973–976.