Topic Modeling for Web Page using LDA Algorithm and Web Content Mining: Testing and Evaluation

Authors

  • Noor Muneam Abbas University of Information Technology and Communications
  • Raad Mahmood Mohammed
  • Hasan Aqeel Abbood
  • Mohammed Ali Mohammed University of Information Technology and Communications

Keywords:

Topic Modeling, Web Content Mining, MRH dataset, LDA algorithm

Abstract

In recent years, the content of websites has become useful and is increasing rapidly, this information plays an important role in discovering various knowledge on the web. This paper aims to test and evaluate our previous work with the new dataset. The previous system applied the LDA Algorithm for Topic Modelling in Web content mining, which was tested and discussed on: different science content, a large dataset, and similarity value. According to the results on our new dataset (No. of rows: 298, No. of columns: 6, Computer, Mathematical, Physics, Chemistry Sciences), the system approves that the LDA algorithm is the best on the web content mining dataset.

Author Biographies

  • Noor Muneam Abbas, University of Information Technology and Communications

    Computer Science Department, University of Technology, Baghdad, Iraq.

  • Raad Mahmood Mohammed

    College of Business Informatics, University of Information Technology and Communications (UOITC), Baghdad, Iraq.

  • Hasan Aqeel Abbood

    College of Business Informatics, University of Information Technology and Communications (UOITC), Baghdad, Iraq.

  • Mohammed Ali Mohammed, University of Information Technology and Communications

    College of Business Informatics, University of Information Technology and Communications (UOITC), Baghdad, Iraq.

References

[1] D. Navadiya and R. Patel, “Web Content Mining Techniques – A Comprehensive Survey,” Int. J. Eng. Res. Technol. (IJERT), vol. 1, no. 10, pp. 1–6, 2012. [Online]. Available: https://doi.org/10.17577/IJERTV1IS10269

[2] S. A. Inamdar and G. N. Shinde, “An agent based intelligent search engine system for web mining,” Res., Reflections and Innovations in Integrating ICT in Educ., 2000. [Online]. Available: https://doi.org/10.20935/AcadNano7479

[3] K. R. Srinath, “An overview of web content mining techniques,” Int. Res. J. Eng. Technol. (IRJET), vol. 4, no. 11, pp. 1258–1261, 2017. [Online]. Available: https://www.irjet.net/archives/V4/i11/IRJET-V4I11335.pdf

[4] R. H. Salman, M. Zaki, and N. A. Shiltag, “A studying of web content mining tools,” Al-Qadisiyah J. Pure Sci., vol. 25, no. 2, 2020. [Online]. Available: https://doi.org/10.29350/2411-3514.1202

[5] D. Florescu, A. Levy, and A. Mendelzon, “Database techniques for the World-Wide Web: A survey,” ACM SIGMOD Rec., vol. 27, no. 3, pp. 59–74, 1998. [Online]. Available: https://doi.org/10.1145/290593.290605

[6] M. A. Mohammed, R. M. Mohammed, and H. A. Abbood, “Topic modeling for web page using LDA algorithm and web content mining,” J. Educ. Pure Sci., vol. 15, no. 3, in press, 2025.

[7] M. A. Mohammed, H. A. Abbood, and R. M. Mohammed, “MRH: A large-scale text dataset for web content mining,” J. Port Sci. Res., vol. 8, no. 4, pp. 321–326, 2025. [Online]. Available: https://doi.org/10.36371/port.2025.4.2

[8] GeeksforGeeks. [Online]. Available: https://www.geeksforgeeks.org/. Accessed: May 3, 2025.

[9] Wolfram MathWorld. [Online]. Available: https://mathworld.wolfram.com. Accessed: May 4, 2025.

[10] Chemguide. [Online]. Available: https://www.chemguide.co.uk. Accessed: May 6, 2025.

[11] The Physics Classroom. [Online]. Available: https://www.physicsclassroom.com. Accessed: May 7, 2025.

[12] Socratic. [Online]. Available: https://socratic.org. Accessed: Mar. 7, 2025.

[13] K. Sharma, G. Shrivastava, and V. Kumar, “Web mining: Today and tomorrow,” in Proc. 3rd Int. Conf. Electron. Comput. Technol., 2011, vol. 1, pp. 399–403. [Online]. Available: https://doi.org/10.1109/ICECTECH.2011.5941631

[14] Y. Yang, Y. Liu, X. Lu, J. Xu, and F. Wang, “A named entity topic model for news popularity prediction,” Knowl.-Based Syst., vol. 208, p. 106430, 2020. [Online]. Available: https://doi.org/10.1016/j.knosys.2020.106430

[15] Y. Lee and J. Cho, “Web document classification using topic modeling based document ranking,” Int. J. Electr. Comput. Eng., vol. 11, pp. 2386–2392, 2021. [Online]. Available: https://doi.org/10.11591/ijece.v11i3.pp2386-2392

[16] H. H. Altarturi, M. Saadoon, and N. B. Anuar, “Web content topic modeling using LDA and HTML tags,” PeerJ Comput. Sci., vol. 9, p. e1459, 2023. [Online]. Available: https://doi.org/10.7717/peerj-cs.1459

[17] S. Shahid et al., “HyHTM: Hyperbolic geometry based hierarchical topic models,” arXiv preprint, arXiv:2305.09258, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.09258

[18] G. Team et al., “Gemini: A family of highly capable multimodal models,” arXiv preprint, arXiv:2312.11805, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2312.11805

[19] J. Ye, “Cosine similarity measures for intuitionistic fuzzy sets and their applications,” Math. Comput. Model., vol. 53, no. 1–2, pp. 91–97, 2011. [Online]. Available: https://doi.org/10.1016/j.mcm.2010.07.022

Downloads

Published

2025-07-22

Issue

Section

Articles

How to Cite

Noor Muneam Abbas, Raad Mahmood Mohammed, Hasan Aqeel Abbood, & Mohammed Ali Mohammed. (2025). Topic Modeling for Web Page using LDA Algorithm and Web Content Mining: Testing and Evaluation. International Journal of Computer (IJC), 55(1), 117-129. https://www.ijcjournal.org/InternationalJournalOfComputer/article/view/2405