_______               __                   _______
|   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
|       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
|___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                      on Gopher (inofficial)
Visit Hacker News on the Web

COMMENT PAGE FOR:
  All-in-one embedding model for interleaved text, images, and screenshots

 jonathan-adly wrote 15 min ago:
 If you are interested in that space, would throw our project in the mix
 which uses ColPali under the hood transparently. [1] The main benchmark
 for this is the Vidore leaderboard. Where we would love to see where
 VoyageAI performs compared to the more open-source implementations.
 
 [1]: https://github.com/tjmlabs/ColiVara
 Zopieux wrote 40 min ago:
 API-only model. No thanks but congrats anyway.

 djoldman wrote 51 min ago:
 This is a key observation that is simple and intuitive:
 
 >All CLIP-like models perform poorly on mixed-modality search due to a
 phenomenon known as the modality gap. As illustrated in the figure
 below, the closest vector to the snippet “I address you, members of
 the Seventy-Seventh Congress…” is not its screenshot, but other
 texts. This leads to search results that are skewed towards items of
 the same modality; in other words, text vectors will be closer to
 irrelevant texts than relevant images in the embedding space.

 djoldman wrote 56 min ago:
 This is a cool way to look at multimodal embeddings. They look at
 performance as the the percentage of inputs slides from one modality to
 another:
 
 [1]: https://i0.wp.com/blog.voyageai.com/wp-content/uploads/2024/11...
 mech4lunch wrote 1 hour 10 min ago:
 The colab measures dot product values 0.428 and 0.498, describing them
 as "...similarity value is quite high." Is that high? Can you design a
 system that confidently labels data with a 0.4 threshold?

 greatgib wrote 1 hour 24 min ago:
 Indeed, sad that their models are both commercial proprietary and API
 only.

 FergusArgyll wrote 2 hours 5 min ago:
 I'm missing something. Shouldn't any llm that's 'natively multimodal'
 somehow include embeddings which are multi-modal? for ex here's googles
 blogpost on Gemini
 
   Until now, the standard approach to creating multimodal models
 involved 
   training separate components for different modalities and then
 stitching them 
   together to roughly mimic some of this functionality. These models
 can 
   sometimes be good at performing certain tasks, like describing
 images, but  
   struggle with more conceptual and complex reasoning.
 
   We designed Gemini to be natively multimodal, pre-trained from the
 start on 
   different modalities. Then we fine-tuned it with additional
 multimodal data to 
   further refine its effectiveness. This helps Gemini seamlessly
 understand and 
   reason about all kinds of inputs from the ground up, far better than
 existing 
   multimodal models — and its capabilities are state of the art in
 nearly every 
   domain.

   aabhay wrote 1 hour 46 min ago:
   LLM embedding contain super positions of many concepts so while they
   might predict the next token they don’t actually out perform
   contrastively pretrained embedding models.

 unit149 wrote 3 hours 8 min ago:
 In the traditional Python API, the Voyage engine will tokenize blocks
 of text and output a string of characters. This model seems to be doing
 that by vectorizing images in space.
 
 Words like 'you' and 'apple' will be a unitary token. More complex
 terms like 'pikachu' may be divided into pik-a-chu.
 
 [1] 
 
 [1]: https://docs.voyageai.com/docs/tokenization
 carschno wrote 4 hours 40 min ago:
 This does read very impressive.
 Any critical perspectives on the presented evaluation?
 What about noon-English text?
 
 I understand the model is, like for other commercial ones, available
 exclusively through their API, right?

   stephantul wrote 4 hours 6 min ago:
   Yes, voyage models are API only.
   
   There was a part here about multilingualism but that was wrong!
   Sorry!
   
   FWIW: Voyage also has separate `law`, `code`, and `finance` models.
   See [1] Really cool results, anyway.
   
   [1] 
   
   [1]: https://docs.voyageai.com/docs/embeddings
     fzliu wrote 3 hours 42 min ago:
     Glad you liked the results! We do have multilingual models (and
     rerankers) -- voyage-3, in particular, is multilingual: [1]
     voyage-multimodal-3 is multilingual as well, supporting the same
     set of languages as voyage-3.
     
     [1]: https://blog.voyageai.com/2024/09/18/voyage-3/
       stephantul wrote 3 hours 39 min ago:
       Sorry for spreading false information. I edited the post above.
       
       It is interesting that you’re not as up front about
       multilingualism compared to cohere. They seem to mention it a
       lot, which led to my confusion.

         fzliu wrote 3 hours 35 min ago:
         No worries at all. That's great feedback and an area of
         improvement for us when it comes to future posts -- we'll be
         more explicit about multilingualism in blogs and in our docs.

<- back to front page