Building an LLM fine-tuning Dataset

Channel:

sentdex

Subscribers:

1,320,000

Published on March 6, 2024 7:01:15 PM ● Video Link: https://www.youtube.com/watch?v=pCX_3p40Efc

Duration: 1:01:55

29,148 views

827

Going through the building of a QLoRA fine-tuning dataset for a language model.
NVIDIA GTC signup: https://nvda.ws/3XTqlB6

Fine-tuning code: https://github.com/Sentdex/LLM-Finetuning
5000-step Walls1337bot adapter: https://huggingface.co/Sentdex/Walls1337bot-Llama2-7B-003.005.5000
WSB Dataset: https://huggingface.co/datasets/Sentdex/WSB-003.005
"I have every reddit comment" original reddit post and torrent info: https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/
2007-2015 Reddit Archive.org: https://archive.org/download/2015_reddit_comments_corpus/reddit_data/
Reddit BigQuery 2007-2019 (this has other data besides reddit comments too!): https://reddit.com/r/bigquery/comments/3cej2b/17_billion_reddit_comments_loaded_on_bigquery/

Contents:

0:00 - Introduction to Dataset building for fine-tuning.
02:53 - The Reddit dataset options (Torrent, Archive.org, BigQuery)
06:07 - Exporting BigQuery Reddit (and some other data)
14:44 - Decompressing all of the gzip archives
25:13 - Re-combining the archives for target subreddits
28:29 - How to structure the data
40:40 - Building training samples and saving to database
48:49 - Creating customized training json files
54:11 - QLoRA training and results

Neural Networks from Scratch book: https://nnfs.io
Channel membership: https://www.youtube.com/channel/UCfzlCWGWYyIQ0aLC5w48gBQ/join
Discord: https://discord.gg/sentdex
Reddit: https://www.reddit.com/r/sentdex/
Support the content: https://pythonprogramming.net/support-donate/
Twitter: https://twitter.com/sentdex
Instagram: https://instagram.com/sentdex
Facebook: https://www.facebook.com/pythonprogramming.net/
Twitch: https://www.twitch.tv/sentdex

Other Videos By sentdex

2024-03-06	Building an LLM fine-tuning Dataset
2024-02-14	Visualizing Neural Network Internals
2024-02-07	Getting Back on Grid
2023-12-24	Open Source AI Inference API w/ Together
2023-12-16	INFINITE Inference Power for AI
2023-11-10	Pandas Dataframes on your GPU w/ CuDF
2023-09-15	QLoRA is all you need (Fast and lightweight model fine-tuning)
2023-08-22	Chat Interface for your Local Llama LLMs
2023-07-28	Gzip is all You Need! (This SHOULD NOT work)
2023-07-11	Better Attention is All You Need
2023-07-05	The BEST Open Source LLM? (Falcon 40B)
2023-06-15	OpenAI GPT-4 Function Calling: Unlimited Potential
2023-06-03	Letting GPT-4 Control My Terminal (TermGPT)
2023-05-12	Building an Open Assistant API
2023-04-28	Sparks of AGI? - Analyzing GPT-4 and the latest GPT/LLM Models
2023-04-08	ChatGLM: The ChatGPT killer? Checking out ChatGLM6B
2023-03-24	GPT Journey - A text and image game with ChatGPT
2023-03-10	ChatGPT API in Python
2023-03-03	Image Editing A.I.
2023-02-11	The AI wars: Google vs Bing (ChatGPT)
2023-01-25	ChatGPT Writes a Chatbot AI

Tags:

python

programming

Channel	Latest
Furious Phoenix Gamings	6 hours ago
TFix	6 hours ago
GTA Funny Racing	6 hours ago
Cold Wind	6 hours ago
Day_Light96	6 hours ago
Source Of Game	6 hours ago
라키아	7 hours ago
zshoopy	7 hours ago
King Neu	7 hours ago
Mr Phenom	7 hours ago
MitoKratos Gameplays	7 hours ago
Kel e Laura	7 hours ago
주누피	7 hours ago
Anny's Gaming World	7 hours ago
만월쥐 ᘛ⁐̤ᕐᐷ 🌕	7 hours ago
Zinathrow	7 hours ago
Die Gamerz	7 hours ago
delightsayak	7 hours ago
LegendzofSleepy	7 hours ago
itsJudysLife	7 hours ago
4Dx ASSASSIN	7 hours ago
andaluchino	7 hours ago
Paulinho Eletric	8 hours ago
Jasp Squad	8 hours ago
우미	8 hours ago