Google turbocharges its genAI engine with Gemini 1.5

Only a week after releasing Gemini 1.0, Google has pushed out for testing its latest multimodal AI model; it offers long-context understanding that can accept more than one million tokens.

Google

Only a week after releasing its latest generative artificial intelligence (genAI) model, Google on Thursday unveiled that model’s successor, Gemini 1.5. The company boasts that the new version bests the earlier version on almost every front.

Gemini 1.5 is a multimodal AI model now ready for early testing. Unlike OpenAI's popular ChatGPT, Google said, users can feed into its query engine a much larger amount of information to get more accurate responses.

(OpenAI also announced a new AI model today: Sora, a text-to-video model that can generate complex video scenes with multiple characters, specific types of motion, and accurate details of the subject and background "while maintaining visual quality and adherence to the user’s prompt." The model understands not only what the user asked for in the prompt, but also how those things exist in the physical world.)

OpenAI

A movie scene generated by Sora.

Google's Gemini models are the industry’s only native, multimodal large language models (LLMs); both Gemini 1.0 and Gemini 1.5 can ingest and generate content through text, images, audio, video and code prompts. For example, user prompts in the Gemini model can be in the form of JPEG, WEBP, HEIC or HEIF images.

"Both OpenAI and Gemini recognize the importance of multi-modality and are approaching it in different ways. Let us not forget that Sora is a mere preview/limited availability model and not something that will be generally available in the near-term," said Arun Chandrasekaran, a Gartner distinguished vice president analyst.

OpenAI's Sora will compete with start-ups such as text-to-video model maker Runway AI, he said.

Gemini 1.0, first announced in December 2023, was released last week. With that move, Google said it had reconstructed and renamed its Bard chatbot.

Gemini has the flexibility to run on everything from data centers to mobile devices.

Though ChatGPT 4, OpenAI’s latest LLM, is multimodal, it only offers a couple of modalities such as images and text or text to video, according to Chirag Dekate, a Gartner vice president analyst.

“Google is seizing its role as the leader as an AI cloud provider. They’re no longer playing catch up. Others are,” Dekate said. "If you’re a registered user of Google Cloud, today you can access more than 132 models. Its breadth of models is insane.”

"Media and entertainment will be the vertical industry that may be early adopters of models like these, while business functions such as marketing and design within technology companies and enterprises could also be early adopters," Chandrasekaran said.

Currently, OpenAI is working on its next-generation GPT 5; that model is likely to also be multimodal. Dekate, however, argued that GPT 5 will consist of many smaller models cobbled together, and won't be not natively multimodal. That will likely result in a less-efficient architecture.

The first Gemini 1.5 model Google has offered for early testing is Gemini 1.5 Pro, which the company described as "a mid-size multimodal model optimized for scaling across a wide-range of tasks." The model performs at a similar level to Gemini 1.0 Ultra, its largest model to date, but requires vastly fewer GPU cycles, the company said. 

Gemin 1.5 Pro also introduces an experimental feature in long-context understanding, meaning it allows developers to prompt the engine with up to 1 million context tokens. 

Developers can sign up for a Private Preview of Gemini 1.5 Pro in Google AI Studio.

Google AI Studio is the fastest way to build with Gemini models and enables developers to integrate the Gemini API in their applications. It’s available in 38 languages across more than 180 countries and territories.

Google

A comparison between Gemini 1.5 and other AI models in terms of token context windows.

Google’s Gemini model was built from the ground up to be multimodal, and doesn’t consist of multiple parts layered atop one another as competitors' models are. Google calls Gemini 1.5 “a mid-size multimodal model” optimized for scaling across a wide range of tasks; while it performs at a similar level to 1.0 Ultra, it does so by applying many smaller models under one architecture for specific tasks.

Google achieves the same performance in a smaller LLM by using an increasingly popular framework known as “Mixture of Experts,” or MoE. Based on two key architecture elements, MoE layers a combination of smaller neuro networks together  and it runs a series of neuro-network routers that dynamically drive query outputs.

“Depending on the type of input given, MoE models learn to selectively activate only the most relevant expert pathways in its neural network. This specialization massively enhances the model’s efficiency,” Demis Hassabis, CEO of Google DeepMind, said in a blog post. “Google has been an early adopter and pioneer of the MoE technique for deep learning through research such as Sparsely-Gated MoEGShard-Transformer,  Switch-Transformer, M4 and more.”

The MoE architecture allows a user to input an enormous amount of information but enables that input to be processed with vastly fewer compute cycles in the inference stage. It can then deliver what Dekate called “have hyper-accurate responses.”

“Their competitors are struggling to keep up, but their competitors don’t have DeepMind or the GPU [capacity] Google has to deliver results,” Dekate said.

With the new long-context understanding feature, Gemini 1.5 has a 1.5 million-token context window, meaning it can allow a user to type in a single sentence or upload several books worth of information to the chatbot interface and receive back a targeted, accurate response. By comparison, Gemini 1.0, had a 32,000 token context window.

Rival LLMs are typically limited to about 10,000 token context windows — with the expection of GPT 4, which can accept up to 125,000 tokens.

Natively, Gemini 1.5 Pro comes with a standard 128,000 token context window. Google, however, is allowing a limited group of developers and enterprise customers to try it in private preview with a context window of up to 1 million tokens via AI Studio and Vertex AI; it will grow from there, Google said.

“As we roll out the full one-million token context window, we’re actively working on optimizations to improve latency, reduce computational requirements and enhance the user experience,” Hassabis said.