It was tough to get this working, but I think I've figured it out enough to share.
Here's a quick guide on how to set up LLaMA-Factory with support for Flash Attention 2 and Unsloth training on Windows. This is using a RTX3060 12GB GPU, Windows 10, and CUDA 12.1.
Unsloth is an optimization library that claims up to a 2x performance boost with no trade off in accuracy.
There's also a quick and dirty script to convert bulk raw text to a dataset file, and a little overview of the dataset setup. I also touch on how to fix the error when loading the trained adapter in the Text Generation WebUI caused by mismatching PEFT libs.
[00:00] Intro & Topics: Installing LLaMA-Factory, Unsloth; Adding Datasets; Making Datasets; Training
[01:05] System Specs... Probably CUDA 12.1 only?
[01:27] System requirements; Microsoft Build Tools, etc
[01:55] Creating the Conda environment and installing dependencies
[02:05] Install Clang
[02:27] Install Flash Attention 2
[02:44] Install LLaMA-Factory requirements
[03:01] Install LLaMA-Factory
[03:15] Reinstall Numpy; Install Triton for Windows
[03:42] Datasets
[05:10] .txt to alpaca format .json single text column script
[05:42] Run training with Unsloth
[06:10] Loading LoRA adapter in the Text Geneation WebUI (fixing config file errors)