OpenAI Embeddings (and Controversy?!)

Subscribers:
291,000
Published on ● Video Link: https://www.youtube.com/watch?v=5skIqoO3ku0



Duration: 15:58
28,594 views
889


#mlnews #openai #embeddings

COMMENTS DIRECTLY FROM THE AUTHOR (thanks a lot for reaching out Arvind :) ):
1. The FIQA results you share also have code to reproduce the results in the paper using the API: https://twitter.com/arvind_io/status/1488257004783112192?s=20&t=gB3c79VEX8hGJl6WfZa2iA There's no discrepancy AFAIK.
2. We leave out 6 not 7 BEIR datasets. Results on msmarco, nq and triviaqa are in a separate table (Table 5 in the paper). NQ is part of BEIR too and we didn't want to repeat it. Finally, the 6 datasets we leave out are not readily available and it is common to leave them out in prior work too. For examples, see SPLADE v2 (https://arxiv.org/pdf/2109.10086.pdf) also evaluates on the same 12 BEIR datasets.
3. Finally, I'm now working on time travel so that I can cite papers from the future :)
END COMMENTS FROM THE AUTHOR

OpenAI launches an embeddings endpoint in their API, providing high-dimensional vector embeddings for use in text similarity, text search, and code search. While embeddings are universally recognized as a standard tool to process natural language, people have raised doubts about the quality of OpenAI's embeddings, as one blog post found they are often outperformed by open-source models, which are much smaller and with which embedding would cost a fraction of what OpenAI charges. In this video, we examine the claims made and determine what it all means.

OUTLINE:
0:00 - Intro
0:30 - Sponsor: Weights & Biases
2:20 - What embeddings are available?
3:55 - OpenAI shows promising results
5:25 - How good are the results really?
6:55 - Criticism: Open models might be cheaper and smaller
10:05 - Discrepancies in the results
11:00 - The author's response
11:50 - Putting things into perspective
13:35 - What about real world data?
14:40 - OpenAI's pricing strategy: Why so expensive?

Sponsor: Weights & Biases
https://wandb.me/yannic

Merch: store.ykilcher.com

ERRATA: At 13:20 I say "better", it should be "worse"

References:
https://openai.com/blog/introducing-text-and-code-embeddings/
https://arxiv.org/pdf/2201.10005.pdf
https://beta.openai.com/docs/guides/embeddings/what-are-embeddings
https://beta.openai.com/docs/api-reference/fine-tunes
https://twitter.com/Nils_Reimers/status/1487014195568775173?s=20&t=NBF7D2DYi41346cGM-PQjQ
https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9
https://mobile.twitter.com/arvind_io/status/1487188996774002688
https://twitter.com/gwern/status/1487096484545847299
https://twitter.com/gwern/status/1487156204979855366
https://twitter.com/Nils_Reimers/status/1487216073409716224
https://twitter.com/gwern/status/1470203876209012736
https://www.reddit.com/r/MachineLearning/comments/sew5rl/d_it_seems_openais_new_embedding_models_perform/
https://mobile.twitter.com/arvind_io/status/1488257004783112192
https://mobile.twitter.com/arvind_io/status/1488569644726177796

Links:
TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
LinkedIn: https://www.linkedin.com/in/ykilcher
BiliBili: https://space.bilibili.com/2017636191

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n




Other Videos By Yannic Kilcher


2022-02-26Can Wikipedia Help Offline Reinforcement Learning? (Paper Explained)
2022-02-23[ML Olds] Meta Research Supercluster | OpenAI GPT-Instruct | Google LaMDA | Drones fight Pigeons
2022-02-21Listening to You! - Channel Update (Author Interviews)
2022-02-20All about AI Accelerators: GPU, TPU, Dataflow, Near-Memory, Optical, Neuromorphic & more (w/ Author)
2022-02-18[ML News] Uber: Deep Learning for ETA | MuZero Video Compression | Block-NeRF | EfficientNet-X
2022-02-17CM3: A Causal Masked Multimodal Model of the Internet (Paper Explained w/ Author Interview)
2022-02-16AI against Censorship: Genetic Algorithms, The Geneva Project, ML in Security, and more!
2022-02-15HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning (w/ Author)
2022-02-10[ML News] DeepMind AlphaCode | OpenAI math prover | Meta battles harmful content with AI
2022-02-08Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents (+Author)
2022-02-07OpenAI Embeddings (and Controversy?!)
2022-02-06Unsupervised Brain Models - How does Deep Learning inform Neuroscience? (w/ Patrick Mineault)
2022-02-04GPT-NeoX-20B - Open-Source huge language model by EleutherAI (Interview w/ co-founder Connor Leahy)
2022-01-29Predicting the rules behind - Deep Symbolic Regression for Recurrent Sequences (w/ author interview)
2022-01-27IT ARRIVED! YouTube sent me a package. (also: Limited Time Merch Deal)
2022-01-25[ML News] ConvNeXt: Convolutions return | China regulates algorithms | Saliency cropping examined
2022-01-21Dynamic Inference with Neural Interpreters (w/ author interview)
2022-01-19Noether Networks: Meta-Learning Useful Conserved Quantities (w/ the authors)
2022-01-11This Team won the Minecraft RL BASALT Challenge! (Paper Explanation & Interview with the authors)
2022-01-05Full Self-Driving is HARD! Analyzing Elon Musk re: Tesla Autopilot on Lex Fridman's Podcast
2022-01-02Player of Games: All the games, one algorithm! (w/ author Martin Schmid)



Tags:
deep learning
machine learning
arxiv
explained
neural networks
ai
artificial intelligence
paper
natural language processing
mlnews
openai
openai embeddings
nils reimers
beir dataset
beir benchmark
text similarity
neural embeddings
gpt-3 embeddings
gpt 3
openai api
openai gpt embeddings
splade
sentencebert
neural retrieval
neural search engine
vector search engine
inner product search
semantic search engine
gpt-3 search
faiq dataset
how good is openai