Paradigms of ELT Data Pipeline Architectures for LLM Training
Keywords:
ELT pipeline, large language models, data architecture, data transformation, agent-oriented systems, LLM infrastructure, multimodal data, dynamic routing, cognitive processing, system scalabilityAbstract
This article presents a systematic analysis of ELT pipeline architectures used in the training of large language models. The study is based on an interdisciplinary approach that integrates engineering principles of data infrastructure design, theoretical foundations of transformer architectures, and data flow automation practices under conditions of high source variability. Particular attention is given to the content analysis of scientific and applied publications addressing the role of LLMs in transformation loops, the implementation of agent-oriented solutions, and the support of multimodal adaptive pipelines. Key ELT architecture types are identified, including prompt-driven, agent-based, high-throughput, and cognitively enhanced solutions, reflecting varying levels of model involvement in data processing. The analysis shows that architectural shifts toward feedback integration and dynamic routing enable the creation of robust and adaptive solutions suited to contemporary training scenarios. Special emphasis is placed on issues related to data stream instability, the lack of benchmarks for agent-based systems, and insufficient integration of pipelines with model evaluation mechanisms. The paper proposes a conceptual classification of ELT paradigms and an outline of their adaptive evolution toward building scalable and logically coherent infrastructures. The article will be of interest to researchers in machine learning systems, LLM infrastructure developers, data platform architects, and professionals in digitalization and automation of AI training workflows.
References
[1]. Barbon Junior, S., Ceravolo, P., Groppe, S., Jarrar, M., Maghool, S., Sèdes, F., Sahri, S., & van Keulen, M. (2024). Are large language models the new interface for data pipelines? (arXiv:2406.06596). arXiv. https://doi.org/10.48550/arXiv.2406.06596
[2]. Colombo, A., Bernasconi, A., & Ceri, S. (2025). An LLM-assisted ETL pipeline to build a high-quality knowledge graph of Italian legislation. Information Processing & Management, 62(4), 104082. https://doi.org/10.1016/j.ipm.2025.104082
[3]. Garcia, A. A., Candello, H., Badillo-Urquiola, K., & Wong-Villacres, M. (2025). Emerging data practices: Data work in the era of large language models. In CHI '25: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (Article 846, pp. 1–21). Association for Computing Machinery. https://doi.org/10.1145/3706598.3714069
[4]. He, C., Li, S., Soltanolkotabi, M., & Avestimehr, S. (2021). PipeTransformer: Automated elastic pipelining for distributed training of transformers (arXiv:2102.03161). arXiv. https://doi.org/10.48550/arXiv.2102.03161
[5]. Jin, T., Zhu, Y., & Kang, D. (2025). ELT-Bench: An end-to-end benchmark for evaluating AI agents on ELT pipelines (arXiv:2504.04808v2). arXiv. https://doi.org/10.48550/arXiv.2504.04808
[6]. Li, S., & Hoefler, T. (2021). Chimera: Efficiently training large-scale neural networks with bidirectional pipelines (arXiv:2107.06925). arXiv. https://doi.org/10.48550/arXiv.2107.06925
[7]. Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., & Mian, A. (2024). A comprehensive overview of large language models (arXiv:2307.06435). arXiv. https://doi.org/10.48550/arXiv.2307.06435
[8]. Raza, M., Jahangir, Z., Riaz, M. B., & others. (2025). Industrial applications of large language models. Scientific Reports, 15, 13755. https://doi.org/10.1038/s41598-025-98483-1
[9]. Xue, Z., Hu, H., Chen, X., Jiang, Y., Song, Y., Mi, Z., Zhu, Y., Jiang, D., Xia, Y., & Chen, H. (2025). PipeWeaver: Addressing data dynamicity in large multimodal model training with dynamic interleaved pipeline (arXiv:2504.14145). arXiv. https://doi.org/10.48550/arXiv.2504.14145
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Olusesan Ogundulu

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who submit papers with this journal agree to the following terms.