Introduction

In this day and age, language models are ubiquitous. We constantly interact with them even without knowing. For example, applications such as, next word predictions while typing on our smartphone keyboard, suggestions while writing an email, or simply converting text-to-speech, use language models in some form or the other. Furthermore, new research is constantly being published at an unprecedented rate. But recently, these language models are getting big, and are now being referred to as Large Language Models or LLMs.

The size of these LLMs are in the order of billions (number of parameters) and keeps growing larger. They also require colossal datasets as training datasets, to produce meaningful results. In fact, one of the most used LLM nowadays, ChatGPT, was trained on approximately 570GB of data (which is surprisingly not large at all!) [2]. Given that the Internet has existed for a few decades now, it is not difficult to find data of such scale. The point to highlight here is the sheer size of such LLMs. The ChatGPT model is based on InstructGPT [3], and has around 175 Billion parameters! This large number of parameters allows the model to learn complex underlying characteristics of the data and results in a system (model) to have a good understanding of the data, but let's pause for a moment and think of the compute required to achieve such good performance 🤔.

How much compute is required?

The amount of compute required to train these models is significant. Let's look at a few examples of LLMs and their respective compute requirements. The first model is a (relatively) small model called GPT-NeoX-20B [4], which has 20 Billion parameters. It required 96 40GB Nvidia A100 GPUs for training. A more recent model LLaMA2 [5] has around 70 Billion parameters and required 2000 80 GB Nvidia A100 GPUs for training. Another model called BLOOM [6] has close to 176 Billion parameters and required over 100 days to train using 384 80GB Nvidia A100 GPUs. Surely, using these many GPUs uses a lot of electricity and power, and as models get bigger or if we need to train them for longer to achieve the desired performance, the power consumption increases further.

If it were the case that language models are released every few months of so, then this hunger for GPUs could be satisfied fairly quickly and the power consumptions could have been kept in check. But this isn't the case. Take a look at the figure below, which shows the cumulative number of arxiv papers on language models published over the years (for the query term "Language Model" and "Large Language Model", Figure taken from [1]).

This means that as more language models are trained, the more compute resources are being used, leading to an ever growing need for electricity to run these computational resources. Is this something to be concerned about?

A simple comparison to daily life

Given the times we live in, it is important to recognize the impact of the increasing power consumption on climate. Let's take the example of the BLOOM model that we talked about before. This 176 Billion parameter is estimated to have consumed around 433 MWh (Mega Watt hours) cumulatively during the training procedure (See the Figure below for more statistics - taken from [7]). Moreover, it is estimated that the carbon emissions are approximately 25-30 tonnes of CO2 equivalent [7] during training. These numbers are certainly not something to be overlooked. For reference, 1 MW (Mega Watt) of can power between 400-1000 homes in a year. So definitely, this is something to pay attention to.

Figure taken from the paper estimating carbon footprint of the BLOOM model [7].

But one might say, "We need these big model to provide us with the best results!". Certainly! And as we saw before, this increase in model size has come with the advantages of improved performance on NLP tasks, improved understanding of language in general, and emergent capabilities too! But we can definitely do more to address the challenges above that come with training LLMs.

Recently, big companies such as Meta participate in sustainability programs in order to offset the emitted carbon during the development of LLMs. This is a right step towards addressing the issues above, but can we do more?

Possible solutions to tackle such challenges

All of the above points should convince us to foster more research in domains such as, reducing the size of LLMs while preserving model performance, efficient utilization of resources, or clever model training strategies. In fact, research to efficiently utilize resources for model training is being carried out nowadays. Techniques such as Parameter-Efficient-Fine-Tuning (PEFT [8]) have revolutionized the process of fine-tuning LLMs in order to have them adapt to downstream tasks. In addition, big corporations are trying to develop small and better models compared to their larger counterparts (For example, LLaMA2 has 70 Billion parameters vs GPT-3, which has 175-Billion parameters, but LLaMA2 performs very well on numerous NLP tasks and is better than ChatGPT in various aspects [10]). So research is definitely moving in the right direction. Are there any other methods that might be worth investigating?

A method worth investigating?

Given my fascination with the human brain 🧠, I recently started investigating model learning techniques inspired by the human mind. One of such techniques is known as Curriculum Learning [9]. Although, there will be a more detailed post on this later, the general idea of curriculum learning is to train the model (LLM) by showing it examples in an order of increasing difficulty. The idea is to follow the notion of how humans and animals learn complex concepts - first learn about the easy examples, and then move on to the difficult samples. Numerous works have shown that this method of training leads to improved performance on natural language understanding tasks. Furthermore, curriculum learning also has been shown to decrease convergence time of models, which means that fewer computational resources are used leading to less electricity consumption.

Okay, so why doesn't everyone use this method? One of key challenges of curriculum learning is the actual task of designing the curriculum itself. Generally, this is highly task specific and there is no one solution fits all. In fact, it is highly likely that designing the task specific curriculum itself will lead to more electricity consumption compared to training an LLM without curriculum learning. This will lead to the original premise of reducing electricity consumption falling apart. In addition, as datasets get larger, it gets even more complex to design a sophisticated curriculum. That being said, as more research is carried out in the domain of clever training strategies, such as curriculum learning, higher is the possibility of inventing new ways to more efficiently train LLMs.

In another post, we will see some specifics of curriculum learning and how it has improved language model performance in many NLP tasks!

NOTE: The blog has been moved to substack for a better user experience. Further posts will be published on substack. https://rohansaha60.substack.com/

References

1. Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., … Wen, J.-R. (2023). A Survey of Large Language Models (Version 12). arXiv. https://doi.org/10.48550/ARXIV.2303.18223

2. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners (Version 4). arXiv. https://doi.org/10.48550/ARXIV.2005.14165

3. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2203.02155

4. S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Golding, H. He, C. Leahy, K. McDonell, J. Phang, M. Pieler, U. S. Prashanth, S. Purohit, L. Reynolds, J. Tow, B. Wang, and S. Weinbach, “Gptneox-20b: An open-source autoregressive language model,” CoRR, vol. abs/2204.06745, 2022.

5. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and finetuned chat models,” arXiv preprint arXiv:2307.09288, 2023.

6. T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilic, D. Hesslow, R. Castagne, A. S. Luccioni, F. Yvon, M. Gall ´ e,´ J. Tow, A. M. Rush, S. Biderman, A. Webson, P. S. Ammanamanchi, T. Wang, B. Sagot, N. Muennighoff, A. V. del Moral, O. Ruwase, R. Bawden, S. Bekman, A. McMillan-Major, I. Beltagy, H. Nguyen, L. Saulnier, S. Tan, P. O. Suarez, V. Sanh, H. Laurenc¸on, Y. Jernite, J. Launay, M. Mitchell, C. Raffel, A. Gokaslan, A. Simhi, A. Soroa, A. F. Aji, A. Alfassy, A. Rogers, A. K. Nitzav, C. Xu, C. Mou, C. Emezue, C. Klamm, C. Leong, D. van Strien, D. I. Adelani, and et al., “BLOOM: A 176b-parameter open-access multilingual language model,” CoRR, vol. abs/2211.05100, 2022

7. Luccioni, A. S., Viguier, S., & Ligozat, A.-L. (2022). Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2211.02001

8. https://github.com/huggingface/peft

9. Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning. ICML ’09: The 26th Annual International Conference on Machine Learning held in conjunction with the 2007 International Conference on Inductive Logic Programming. ACM. https://doi.org/10.1145/1553374.1553380

10. https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/

Search This Blog

simpleparadox

The Insatiable Hunger for Compute: Powering Large Language Models