Google Gemini, the multimodal AI model, is here; Know its features and use cases

Google Gemini was unveiled by Alphabet CEO Sundar Pichai and the company’s AI research division DeepMind’s CEO Demis Hassabis yesterday, December 6. Leaving PaLM-2 behind, it has now become the largest large language model released by the company so far. With its size, it also gains new capabilities. Being a multimodal AI model, its highest variant, Gemini Ultra, is capable of responding with text, images, videos, and audio, pushing the boundaries of what a general-purpose foundation model can do. So, if you have been wondering about the features and use cases of Gemini AI, then check them below.

After the announcement of its new AI model, Google posted a YouTube video where it showcased the capabilities of Google Gemini. The video mentions, “We’ve been capturing footage to test it on a wide range of challenges, showing it a series of images, and asking it to reason about what it sees”. The entire video highlights some of the more advanced features and use cases of Gemini.

Google Gemini features

Throughout the video, Gemini has been given access to a camera and it can see whatever the user is doing. The video puts the AI model through several tests, where it has to analyze whatever is going on in the visual medium.

1. Multimodal dialogue

In the first segment, the user draws on a piece of paper and asks Gemini to guess what it sees. The AI model keeps guessing the image as the user continues to add more complexities to it. At each step, Gemini is capable of offering a reasonable analysis of the drawing and providing additional information about the object. It also recognized objects and offered information about what they might be made up of.

2. Multilinguality

In the second segment, the user asks the AI to tell him how to pronounce a word in a different language. Not only does the AI show the response in text format, but it also offers an audio response to help the user pick up the dialect. It also helped him with the pronunciation.

3. Game creation

In the third segment, the user puts a world map and a rubber duck on the table and asks the AI to create a fun game based on it and to use emojis for the game. Gemini obliges and creates a country guessing game where the user will have to guess the name of the country based on three emojis.

4. Visual puzzles

In the next segment, the AI is put to the test and is asked to solve some puzzle presented to it in the real world. The video shows it to be capable enough to easily follow the puzzles in real time and solve them.

5. Making connections

In the next segment, the user keeps two random objects on the table and asks Gemini what it sees. Based on the visual context, the AI is able to make a connection between the two objects and categorize them. The user keeps switching out objects, but each time it is able to find a correct category to group the items together.

6. Image and text generation

Next, the user keeps two balls of yarn of different colors on the table and asks the AI to suggest what could be made using them. The AI comes up with different things that can be made. While the primary response is in text format, it also shows an AI-generated reference image to help the user visualize the final result.

7. Logic and spatial reasoning

The AI is also shown to be comfortable with answering logic-based visual puzzles and correctly identifying various aspects of it before offering a solution.

8. Translating visuals

In the last segment, Google Gemini is asked to identify what the user is drawing. As he draws a guitar, the AI identifies it and plays AI-generated guitar music. The user keeps adding more instruments and themes, and the AI is able to change the music based on the new elements added.

The video highlights many of its capabilities and how once the AI model is equipped with different devices and turned into specific AI tools, may help users in different situations.

Leave a Comment