Foundation models: Transformative shift in AI

Published in Computational Sciences
Foundation models: Transformative shift in AI

Foundation models are being used more and more, even without explicitly recognized as such. These robust artificial intelligence models play a crucial role in numerous sectors, propelling progress in natural language processing, computer vision, and beyond. Individuals frequently engage with applications driven by these models, enjoying their proficiency in producing text resembling human language and executing intricate vision tasks. Nevertheless, there exists a noticeable void in the overall comprehension of foundation models, which serve as advanced, pre-trained models forming the basis for a variety of AI applications. 

Beyond pre-trained models: Scale and homogenization 

Foundation models are large, pre-trained models that act as the basis or initial point for diverse machine learning tasks. These models undergo training on extensive datasets to acquire patterns, features, and representations beneficial for a broad array of downstream applications. Technologically speaking, foundation models are not novel as they rely on self-supervised learning and deep neural networks, concepts that have been around for years. However, the very large scale and scope of foundation models in recent years have pushed the boundaries regarding what is achievable. Their potency stems from scale that requires advancements in computer hardware, and accessibility of a significantly larger pool of training data. This led them to an exceptional level of homogenization. Indeed, researchers and developers can exploit the knowledge acquired by foundation models and adapt it to a diverse range of tasks without training from scratch, thanks to different approaches of transfer learning. Certainly, their robustness to domain shifts and out-of-distribution data remains questionable.

Foundational models are transforming the fields of natural language processing and computer vision by exhibiting zero-shot and few-shot generalization capabilities. This can extend their applicability to tasks that go beyond what they were exposed to during their training. When these models are scaled up and trained with vast text datasets, their zero and few-shot performances match, and in some instances, even surpass that of fine-tuned models. This ability is often facilitated through in-context learning and prompt engineering, instructing the models to generate a valid response to unseen tasks and data. Empirical observations indicate that these performances are enhanced as the model scale, dataset dimensions, and overall training computational resources increase.

Foundation models for Natural Language Processing 

At first, the introduction of foundation models marked a significant milestone in the field of natural language processing. The origin can be linked to the creation of transformer-based models. These latter, introduced with Generative Pre-trained Transformer GPT-1, are based on the idea of pre-training on extensive volumes of text data to allow a comprehensive grasp of language. Subsequently, GPT-2 emerged, illustrating the scalability of the method with a remarkable surge in the quantity of parameters. GPT-3, with its 175 billion parameters, is the impressive progression of such models, highlighting unparalleled capabilities in language comprehension and generation. Obviously, other large language models have emerged, e.g., T5 (Text-To-Text Transfer Transformer) with about 220 million parameters, but the ones mentioned are the most popular.

Foundation models reforming Computer Vision

Foundation models have also been explored in computer vision field, although more recently and to a lesser extent. A notable example involves aligning paired text and images sourced from the web. For instance, CLIP and ALIGN exploit contrastive learning to train encoders aligning both text and images modalities. Following training, engineered text prompts allow generalization to various downstream tasks and visual data. Recently, Meta AI introduced the Segment Anything Model (SAM) as the pioneering foundation model for image segmentation. SAM undergoes training on its dedicated dataset (SA-1B), which is the most extensive segmentation dataset to date, featuring over 1 billion masks on 11 million licensed and privacy-respecting images. The model is deliberately designed and trained to be promptable, allowing for zero-shot transfer to new tasks such as text-to-mask segmentation, instance segmentation, and edge detection.

What risks lie beneath?

Foundation models, especially specific-purpose ones, bring significant benefits to various practical and research domains, such as education, law, and biology. However, their use poses substantial challenges and risks. Two of them are presented here: 

- Because of their scale, foundation models have the potential to harm the environment by contributing to increased carbon emissions, especially if those developing the models are not cautious. Training these models involves extensive use of data, sometimes spanning several months and utilizing numerous GPUs. Therefore, it is crucial to prioritize addressing these emissions. The negative environmental impact can be lessened in various ways. For instance, training models in regions with low carbon intensity or opting for more effective models and hardware are viable approaches. If all available mitigation measures have been considered and further mitigation is not feasible, a careful evaluation of the costs and assets to society is necessary. This evaluation should determine whether deploying a larger foundation model is justified over a smaller more efficient model. It is important to recognize that the initial costs of a large foundation model may be spread out over the model lifespan.

 - Due to their extensive generative abilities, some foundation models can be used improperly for misleading activities. One alarming possibility involves the generation of deepfake content, where these models can be employed to create convincing and deceptive audio or video materials that are difficult to distinguish from real sources in order to disseminate misinformation and influence public perception. This is indeed the case of the investigative journalist from India, Rana Ayyub, who became the target of a sophisticated deepfake that digitally manipulated her appearance in a pornographic video, prompting her withdrawal from public life for several months.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in