Google throws down the gauntlet with Gemini — its multimodal genAI engine

Google has introduced the market's first native multimodal generative AI model capable of ingesting and providing content based on text, audio, images, and video.

Senior Reporter, Computerworld |

Google on Thursday announced it has reconstructed and renamed its Bard chatbot — now called Gemini — to offer enterprises and consumers the industry's first multimodal generative AI (genAI) platform that no longer relies only on text to provide human-like responses.

The release of Gemini represents a direct challenge to Microsoft’s Copilot, which is based on OpenAI's ChatGPT, and every other chatbot based only on large language model (LLM) technology.

“It [Gemini] is currently the only native multimodal generative AI model available,” said Chirag Dekate, a vice president analyst at Gartner. “Google is no longer playing catch-up. Now, it is the other way around.”

Dekate called Gemini “a really big deal” because with a multimodal model, a single genAI engine performs individual tasks more accurately because it’s learning from a vastly larger body of knowledge. It essentially catapults Google to the head of the genAI pack.

Google first unveiled its Gemini AI model in December, touting multimodal capabilities that allow it to combine different types of information — inputs and outputs — including text, code, audio, images, and video.

Unlike LLM-only AI engines such as OpenAI’s GPT, Meta’s Llama 2, or even Google’s own PaLM 2 — all of which power today’s chatbots — Gemini doesn’t rely on that same technology. Instead, it can be trained using all types of media and content.

That matters because an enterprise can now create a chatbot that'is no longer confined to loading answers to queries from text on which its LLM has been trained.

“When I watch a movie, I am watching the video, I am reading the text (subtitles), I am listening to the audio, and it’s all happening simultaneously creating a hyper immersive experience," Dekate said. “This is multimodality in a nutshell. Compare this to experiencing a movie by reading its script alone (LLM); this is the difference between LLM and multimodality.”

Last year, Dekate said, was a year of ideation as enterprises and consumers learned about genAI and chatbots in the wake of ChatGPT’s release in late 2022. Now, enterprises better understand the possibilities of genAI and are opening their wallets to spend a significant amount to infuse back-end and front-end systems with it.

If you’re a healthcare company, for example, trying to design a more immersive chatbot for physicians, a multimodal genAI engine can ingest a physician’s audio snippets, radiological images, and MRI video scans to create vastly more accurate prognoses and treatment outcomes.

“This creates a hyper-immersive, personal experience. None of this is possible using a simple LLM experience,” Dekate said. “If Google can enable enterprises and consumers to experience this multimodal experience, then Google has the chance to change the marketshare.”

In 2024, spending on genAI solutions is expected to reach $40 billion, up from $19.4 billion in 2023. By 2027, genAI spending is expected to hit $143 billion, with a five-year compound annual growth rate of 73.3%, according to research firm IDC.

“What we saw last year was the emergence of task-specific models — text-to-text, text-to-image, text-to-video, image-to -ext, etc.,” Dekate said. “Each task had its own model. So, if you have a narrow task of text-to-text, then LLMs perform well.”

Google’s $20-per month Gemini subscription model also appears aimed at taking market share from leader Microsoft.

US customers can subscribe for $19.99 a month to access Gemini Advanced, which includes a more powerful Ultra 1.0 AI model. Subscribers will receive two terabytes of cloud storage that typically cost $9.99 a month, and will soon gain access to Gemini in Gmail and Google's productivity suite.

Google's new One AI Premium plan is its answer to Microsoft and its genAI partner OpenAI, which developed the GPT LLM that powers ChatGPT.

"Part of it is competing with Microsoft, and part of it is to offer premium services to its premium [customers], mostly business office users who are already paying," said Jack Gold principal analyst at J.Gold Associates. "Also, if you charge a fee, you limit the number of users that would have signed on for free. That gives you the opportunity to fix any problems seen by a more limited number of users, and provides a revenue stream to keep up the engineering, rather than relying on ads to pay for it."

There is also the issue of cost for Google, because it's not cheap to train a large AI model in data centers.

"Not sure how they get paid with running all that AI in the background, which takes a lot more processing power, and power is one of the biggest expenses of running a cloud/data center," Gold said.

“What’s amazing about Gemini is that it’s so good at so many things,” said Google’s DeepMind CEO Dennis Hassabis. “As we started getting to end of the training, we started to see that Gemini was better than any other model out there on these very important benchmarks. For example, each of the 50 different subject areas that we tested on, it’s as good as the best expert human in those areas.”

OpenAI's ChatGPT Plus a year ago pioneered the market for buying early access to AI models and other features, while Microsoft recently announced a competing subscription for AI in programs such as Word and Excel. Both subscriptions cost $20 a month in the United States.

Senior Reporter Lucas Mearian covers AI in the enterprise, Future of Work issues, healthcare IT and FinTech.

It’s time to break the ChatGPT habit