Batches with texts of different lengths

When I was experimenting with nanoGPT by Andrej Karpathy I saw that as in many other machine learning training, it’s common to concatenate the sample of the dataset separating them with an token. This makes sense to me at the moment when your model has to learn the human language structure but not during the fine-tuning process. This is because we could concatenate texts with completely different and uncorrelated contexts inducing the model to evaluate the generation of the next token based on, possibly, two different topics.

To solve this I tried to create a simple batch pipeline called batchization process which groups the texts by the number of tokens plus some tricks. Let’s check this out!

High-level guideline:

For clarity, we will divide the process into two steps

Grouping the texts by length
Creating a pseudo dataloader

Read the full article on Medium

Please hit me up on Twitter for any corrections or feedback.I hope you have found this guide helpful. Everything is ready to train the LLM beast model you desire.

Written on June 18, 2023