Unveiling the Future of Content Search: Multimodal Prompting in Action with Gemini 1.5 Pro and a Classic Film

In a ground breaking demonstration, Google AI Studio unveiled the capabilities of their newest model, Gemini 1.5 Pro, which features an experimental function known as “long context understanding”. This advanced feature was showcased through a screen-recording using a 44-minute Buster Keaton film, translating to an impressive 600,000 tokens of data.

The demo involved uploading the film to Google AI Studio and issuing a complex prompt: to identify the exact moment a piece of paper is retrieved from someone’s pocket, along with key details written on it, including the timecode. The process, although sped up in the screen capture, revealed real-time processing durations for each prompt, emphasizing that processing times may vary.

Gemini 1.5 Pro responded with remarkable accuracy, pinpointing the timecode at 12:01 and providing detailed information about the paper being a pawn ticket from Goldman & Co Pawn Brokers, including the date and cost. A verification of this timecode confirmed the model’s precision in locating the specific scene and accurately extracting the text.

The demonstration also explored the model’s multimodal capabilities by presenting a simple drawing of a scene and querying the corresponding timecode. The model successfully returned the correct timecode of 15:34, showcasing its ability to interpret and match abstract visual details to specific moments in the video content.

These examples highlight Gemini 1.5 Pro’s potential in understanding and processing extensive multimodal contexts, up to 1 million tokens, with minimal input. While the model’s responses may vary and are not always flawless, its ability to interpret complex prompts and abstract visuals without extensive explanations sets a new standard for generative AI models.