What is the relationship between LLMs and multi-modality?

Published on ● Video Link: https://www.youtube.com/watch?v=O_G5WJPekTc



Duration: 3:12
77 views
4


Check out my essays: https://aisc.substack.com/
OR book me to talk: https://calendly.com/amirfzpr
OR subscribe to our event calendar: https://lu.ma/aisc-llm-school

AF: One of the interesting strategies that Cohere is using versus other providers where you're specializing a bunch of different models for different types of things: you have a re-ranker, you have a few other things that you mentioned. In a lot of real world systems, as you presented, you want to use other tools because they're specialized and much better at doing a particular task than LLMs would ever be, potentially. So, expand on that philosophy.

JA: This is one thing that sets Cohere apart, the focus on practical applications right now instead of chasing training AGI, or creating superhuman intelligence. So a lot of it is how can we build the best AI systems to empower the next generation of software systems, and that to us breaks down into these two families of models, the search and retrieval models that has been one of the deepest areas of computer science. It's just a massive and fascinating research area that continues to be improved. We have teams who are specifically focused on search and retrieval.

Companies want the ability to chat with their data or to make sense of their internal data through even private deployments. They want the model to come to their data, which is another focus area.

I also speak with a lot of developers that they're like, can I do this with a language model? Yes, you can send that a trillion parameter model to solve a problem, but a lot of the times it will be done better with a 300 million parameter model that's specifically geared for this use case. It does it at much better latency. You don't have to deploy it and shard it across 100 GPUs.

So just being practical and focusing on the best tool for the problem and efficiency is the major focus here because yes, you can build massive models that are general problem solvers, but then to deploy things into production, you need to think about costs, about latency, about how many GPUs, about memory space.

AF: And robustness. You could over engineer a huge system that is not fine tuned to do anything specifically, and then you would complain, why is it hallucinating all of these things?

What you presented was more focused on what happens to text, but the majority of the structurally available data is a structured and might be in other modalities. How does what you spoke about apply to those? Language and text is one portion of data. It is untapped but we also have a lot of systems that are set up to handle those other types of data. So how do how do all these different worlds talk to each other?

JA: I'm excited about multi modal embedding models. That's an area that can fit other modalities into vector search as it exists. One modality that is relevant to all of this and builds on text is code. Once you improve the code generation capability, you can improve things like tool use and reasoning and the abilities of the model to become an operator of the of these tools. I would rank that as the second highest important modality.

Based on use case, different modalities will be different. If you're working in media, it's going to be video and audio. If you're working in music, audio and waveforms are probably the modality that would be relevant to you.







Tags:
deep learning
machine learning