Are Large Multimodal Models Gen AI’s New LLMs?

artificial intelligence

Six months ago, an open letter called to pause further generative artificial intelligence (AI) development.

It was signed by tech luminaries including Elon Musk and Steve Wozniak and was published by the Future of Life Institute, a nonprofit focusing on existential risks surrounding AI and other emerging technology. The public plea called for a pause on the creation of AI models whose capabilities surpassed those of GPT-4.

Now, they are going to have to write another letter.

That’s because there is already a new foundational generative AI model — a multimodal one that can operate seamlessly between text and visual prompts, even including video.

In a new paper titled “The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)” published Friday (Sept. 29), researchers from Microsoft show how large multimodal models (LMMs) can evolve the capabilities of large language models (LLMs) in ways that are completely unprecedented.

“LMMs extend LLMs with multi-sensory skills, such as visual understanding, to achieve stronger generic intelligence … Overall, this integration of LMMs with a pool of multimodal plugins opens up new possibilities for enhanced reasoning and interaction, leveraging the strengths of both language and vision capabilities,” the paper stated.

The researchers added that “[the] unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods … The flexibility of multimodal chains allows for a more comprehensive understanding and analysis of multimodal data, and can potentially lead to improved performance in various applications.”

Read also: What’s Next for AI? Experts Say Going More Multimodal

Supercharged AI Models Will Reshape Industries

It will be a year next month since OpenAI first launched its ChatGPT product.

The generative AI solution was like nothing else the world had seen, becoming the fastest growing app of all time and spurring the creation of an entirely new AI industry: more than $40 billion in venture capital funding has been allocated to AI firms since 2023 began.

And much of that buzz and excitement centered around a variety of generative AI that has already become outdated and replaced by the cross-channel capabilities of new AI models, which can even start to train themselves using data recognition across modalities.

“As a natural progression, LMMs should be able to generate interleaved image-text content, such as producing vivid tutorials containing both text and images, to enable comprehensive multimodal content understanding and generation. Additionally, it would be beneficial to incorporate other modalities, such as video, audio, and other sensor data, to expand the capabilities of LMMs,” the Microsoft research stated.

“Regarding the learning process, current approaches predominantly rely on well-organized data, such as image-tag or image-text datasets. However, a more versatile model may be able to learn from various sources, including online web content and even real-world physical environments, to facilitate continuous self-evolution,” the paper added.

As reported by PYMNTS, SoftBank CEO Masayoshi Son on Wednesday (Oct. 4) said at the annual SoftBank World event that artificial general intelligence (AGI) — a computer system that can match human thought and reasoning — will be 10 times more powerful than all of humanity by the end of the decade.

“Do you want to be a goldfish?” Son asked, arguing that people who avoid AI and those who use it will be as different as apes and humans in intellectual abilities.

Mustafa Suleyman, co-founder of DeepMind and Inflection AI, said in September that “The third wave [of AI] will be the interactive phase. That’s why I’ve bet for a long time that conversation is the future interface. You know, instead of just clicking on buttons and typing, you’re going to talk to your AI.”

Read more: Google and Microsoft Spar Over Training Rights to AI Data

The Future of AI Will Be Hyper-Specialist vs Hyper-Generalist

Still, despite the revolutionary promise of LMMs and their multimodal capabilities, building an AI model that is capable of everything is an incredibly expensive endeavor.

High computing costs and data constraints limit all but the most well-funded organizations from embarking on building out those phase shift foundational models meant to one day mimic and scale the boundless capabilities of the human mind.

These costs, however, have spurred the proliferation of smaller models trained on specific data sets to do specific things — and they promise to have revolutionary applications in everything from accounting to healthcare and drug discovery.

Fine-tuning foundation models with fit-for-purpose data represents a new way for AI to be democratized, and for solutions to have a greater, more focused impact.

According to a count done by the open source AI startup Hugging Face, around 1,500 versions of such fine-tuned models exist.