OpenAI Embeddings (and Controversy?!)

Channel:

Yannic Kilcher

Subscribers:

300,000

Published on February 7, 2022 8:10:42 PM ● Video Link: https://www.youtube.com/watch?v=5skIqoO3ku0

Duration: 15:58

28,594 views

889

#mlnews #openai #embeddings

COMMENTS DIRECTLY FROM THE AUTHOR (thanks a lot for reaching out Arvind :) ):
1. The FIQA results you share also have code to reproduce the results in the paper using the API: https://twitter.com/arvind_io/status/1488257004783112192?s=20&t=gB3c79VEX8hGJl6WfZa2iA There's no discrepancy AFAIK.
2. We leave out 6 not 7 BEIR datasets. Results on msmarco, nq and triviaqa are in a separate table (Table 5 in the paper). NQ is part of BEIR too and we didn't want to repeat it. Finally, the 6 datasets we leave out are not readily available and it is common to leave them out in prior work too. For examples, see SPLADE v2 (https://arxiv.org/pdf/2109.10086.pdf) also evaluates on the same 12 BEIR datasets.
3. Finally, I'm now working on time travel so that I can cite papers from the future :)
END COMMENTS FROM THE AUTHOR

OpenAI launches an embeddings endpoint in their API, providing high-dimensional vector embeddings for use in text similarity, text search, and code search. While embeddings are universally recognized as a standard tool to process natural language, people have raised doubts about the quality of OpenAI's embeddings, as one blog post found they are often outperformed by open-source models, which are much smaller and with which embedding would cost a fraction of what OpenAI charges. In this video, we examine the claims made and determine what it all means.

OUTLINE:
0:00 - Intro
0:30 - Sponsor: Weights & Biases
2:20 - What embeddings are available?
3:55 - OpenAI shows promising results
5:25 - How good are the results really?
6:55 - Criticism: Open models might be cheaper and smaller
10:05 - Discrepancies in the results
11:00 - The author's response
11:50 - Putting things into perspective
13:35 - What about real world data?
14:40 - OpenAI's pricing strategy: Why so expensive?

Sponsor: Weights & Biases
https://wandb.me/yannic

Merch: store.ykilcher.com

ERRATA: At 13:20 I say "better", it should be "worse"

References:
https://openai.com/blog/introducing-text-and-code-embeddings/
https://arxiv.org/pdf/2201.10005.pdf
https://beta.openai.com/docs/guides/embeddings/what-are-embeddings
https://beta.openai.com/docs/api-reference/fine-tunes
https://twitter.com/Nils_Reimers/status/1487014195568775173?s=20&amp;amp;amp;amp;t=NBF7D2DYi41346cGM-PQjQ
https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9
https://mobile.twitter.com/arvind_io/status/1487188996774002688
https://twitter.com/gwern/status/1487096484545847299
https://twitter.com/gwern/status/1487156204979855366
https://twitter.com/Nils_Reimers/status/1487216073409716224
https://twitter.com/gwern/status/1470203876209012736
https://www.reddit.com/r/MachineLearning/comments/sew5rl/d_it_seems_openais_new_embedding_models_perform/
https://mobile.twitter.com/arvind_io/status/1488257004783112192
https://mobile.twitter.com/arvind_io/status/1488569644726177796

Links:
TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
LinkedIn: https://www.linkedin.com/in/ykilcher
BiliBili: https://space.bilibili.com/2017636191

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Other Videos By Yannic Kilcher

2022-02-26	Can Wikipedia Help Offline Reinforcement Learning? (Paper Explained)
2022-02-23	[ML Olds] Meta Research Supercluster \| OpenAI GPT-Instruct \| Google LaMDA \| Drones fight Pigeons
2022-02-21	Listening to You! - Channel Update (Author Interviews)
2022-02-20	All about AI Accelerators: GPU, TPU, Dataflow, Near-Memory, Optical, Neuromorphic & more (w/ Author)
2022-02-18	[ML News] Uber: Deep Learning for ETA \| MuZero Video Compression \| Block-NeRF \| EfficientNet-X
2022-02-17	CM3: A Causal Masked Multimodal Model of the Internet (Paper Explained w/ Author Interview)
2022-02-16	AI against Censorship: Genetic Algorithms, The Geneva Project, ML in Security, and more!
2022-02-15	HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning (w/ Author)
2022-02-10	[ML News] DeepMind AlphaCode \| OpenAI math prover \| Meta battles harmful content with AI
2022-02-08	Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents (+Author)
2022-02-07	OpenAI Embeddings (and Controversy?!)
2022-02-06	Unsupervised Brain Models - How does Deep Learning inform Neuroscience? (w/ Patrick Mineault)
2022-02-04	GPT-NeoX-20B - Open-Source huge language model by EleutherAI (Interview w/ co-founder Connor Leahy)
2022-01-29	Predicting the rules behind - Deep Symbolic Regression for Recurrent Sequences (w/ author interview)
2022-01-27	IT ARRIVED! YouTube sent me a package. (also: Limited Time Merch Deal)
2022-01-25	[ML News] ConvNeXt: Convolutions return \| China regulates algorithms \| Saliency cropping examined
2022-01-21	Dynamic Inference with Neural Interpreters (w/ author interview)
2022-01-19	Noether Networks: Meta-Learning Useful Conserved Quantities (w/ the authors)
2022-01-11	This Team won the Minecraft RL BASALT Challenge! (Paper Explanation & Interview with the authors)
2022-01-05	Full Self-Driving is HARD! Analyzing Elon Musk re: Tesla Autopilot on Lex Fridman's Podcast
2022-01-02	Player of Games: All the games, one algorithm! (w/ author Martin Schmid)

Tags:

deep learning

machine learning

arxiv

explained

neural networks

artificial intelligence

paper

natural language processing

mlnews

openai

openai embeddings

nils reimers

beir dataset

beir benchmark

text similarity

neural embeddings

gpt-3 embeddings

gpt 3

openai api

openai gpt embeddings

splade

sentencebert

neural retrieval

neural search engine

vector search engine

inner product search

semantic search engine

gpt-3 search

faiq dataset

how good is openai

Channel	Latest
Top5Gaming	8 hours ago
gameranx	9 hours ago
dakblake	10 hours ago
TG Plays	10 hours ago
Markiplier	10 hours ago
RobtheMod	11 hours ago
MrT-Gaming	12 hours ago
The Nishant Vibe	12 hours ago
atv	12 hours ago
ConnorDawg	12 hours ago
TerraChannel / TerraFox	12 hours ago
LukePingu	12 hours ago
Taffe316	12 hours ago
RapCheck	12 hours ago
SOLO GAMER	12 hours ago
Olympus	13 hours ago
Gellar Gaiden	13 hours ago
JÚNIOR GAELZIN	13 hours ago
DIOSTAR GAMER	13 hours ago
RUTAX FREESTYLE	13 hours ago
Loster99	13 hours ago
NS_ART	13 hours ago
Power Art YT	13 hours ago
Kindly Keyin	13 hours ago
iin indra wicahya	13 hours ago