Experimenting with multimodal LLMs
An experiment… 🎾 can a multimodal LLM be a good tennis commentator? I live in Melbourne, and every year we are fortunate to have the Australian Open come to town. A few weeks ago OpenAI released Live Video in ChatGPT, and I wanted to see if it can commentate the womans’ final. Over 3.5 minutes of play I asked my AI tennis companion 18 questions about the match, and I was surprised by how well it did. It answered 13 correctly, and got 5 wrong 🤖
The area it was really let down was in watching longer rallies, where it quickly lost track of what was happening and then would happily hallucinate with an incorrect answer.
To keep it simple, for this test my questions were factual (i.e. what is the score?, who won the point?), rather than opinion or insights about the play, strategy, etc (which unlike McEnroe, the LLM cannot do because it cannot keep up with the play). Human commentators are safe for some time yet.
Our team at Time Under Tension have been experimenting also with the Gemini Live Video model, which allows us to build custom apps and add a knowledge-base. For example, we can train the LLM on player bios, and their previous Aus Open matches. This additional context would make for a more well-rounded commentator, but the same limitations are there for fast action and keeping up with the play.
These are version 1.0 of Multimodal LLMs, and they’re only going to get better. Imagine having your own (simple) AI commentator when watching the tennis, in any language, and personalised to you. By the time the next AO comes around, I expect this will be reality.