HelpingAI-Vision test/demo video
Channel:
Subscribers:
208
Published on ● Video Link: https://www.youtube.com/watch?v=H4qsKL-AbRU
The fundamental concept behind HelpingAI-Vision is to generate one token embedding per N parts of an image, as opposed to producing N visual token embeddings for the entire image. This approach, based on the HelpingAI-Lite and incorporating the LLaVA adapter, aims to enhance scene understanding by capturing more detailed information.
For every crop of the image, an embedding is generated using the full SigLIP encoder (size [1, 1152]). Subsequently, all N embeddings undergo processing through the LLaVA adapter, resulting in a token embedding of size [N, 2560]. Currently, these tokens lack explicit information about their position in the original image, with plans to incorporate positional information in a later update.
Other Videos By OEvortex
Tags:
artificial intelligence
ai
data science
what is artificial intelligence
artificial intelligence for beginners