Batches with texts of different lengths
When I was experimenting with nanoGPT by Andrej Karpathy I saw that as in many other machine learning training, it’s common to concatenate the sample of the dataset separating them with an
To solve this I tried to create a simple batch pipeline called batchization process which groups the texts by the number of tokens plus some tricks. Let’s check this out!
High-level guideline:
For clarity, we will divide the process into two steps
- Grouping the texts by length
- Creating a pseudo dataloader
Read the full article on Medium
Please hit me up on Twitter for any corrections or feedback.I hope you have found this guide helpful. Everything is ready to train the LLM beast model you desire.