How Do You choose between training, fine-tuning, and using small models?

Published on ● Video Link: https://www.youtube.com/watch?v=B_DVpDUq1jo



Duration: 2:53
90 views
1


AF: You talked about a few different approaches, you talked about domain adaptation, combination of a few different models, DeepSpeed for training your own model. In what scenarios would I use each of these different ones?

AD: The topic of whether we combine small language model and large language model without doing fine tuning, since it's cheap and easy to do, as long as you're experienced with interfacing the output of the model that is going to be the input of the LLM, then it makes sense to do it for baseline. This is cheaper and faster but there is complexity of interfacing.

In any case that using the API doesn't make sense, you have to at least fine tune and then serve your model. You don't want your inference time to be less performant in terms of the round trip.

DeepSpeed for distributed training is great. If you can't have a cluster of eight H100 GPUs, then using DeepSpeed and 32 NVIDIA T4 or V100 GPUs, which are cheaper, which are more available, would enable you to do fine tuning, as well as serving. You just get more GPUs, you gain that distributed training capabilities. It reduces the training time as well. If your data set is too big.

If you just want to experiment, don't bother with DeepSpeed. Try PEFT, Q-LORA, quantization plus low rank adaptation. That helps you fit large models into one GPU. You gain memory efficiency and speed up and scale. Hugging Face has all the codes. Worst case, you have to make your own adapter, which is not too hard.

AF: If I want to summarize, it is less about where and more about at what a stage you would use them. Most probably, early in the project, you should be very focused on interfacing your LLM with other types of models because, training a BERT is probably way cheaper and easier, and you can repeat it a hundred times until you figure out the right way to do it. That's a very important freedom to have. There are complexities around interfacing, but that's completely worth it compared to the alternatives like DeepSpeed.

Once you hit the ceiling of argumenting the large language model using other models, you probably will look into fine tuning the large language model itself, let's say, if it needs to, spit out a bunch of thought processes before doing something else. In that case, PEFT and LORA are probably the right place to go.

Once you hit the ceiling of that, like you have a very interesting, niche and rare use case if that's happening to you, that's where you would probably consider to train the model from scratch.







Tags:
deep learning
machine learning