Nanochat vs Llama for training from scratch? [P]

Hey all - I'm engaged in a project training a model entirely on historical data, which I've posted about before on this subreddit. My last training run was done using Nanochat, and while that was very successful for pretraining and SFT of the initial model, I'm finding that while nanochat is great for getting it up and running, it's not so great for interoperability. There has been a little bit of work done to make nanochat transformers-compatible, but the latest version of nanochat (which I trained with) doesn't produce a transformers-compatible model.

So, I'm considering my next training run using the Llama architecture and the transformers 'trainer' class. I have assembled a much larger dataset for pretraining, and I want this to be an open-source project that people can access using transformers. However, I know that there are advantage to nanochat (such as the auto-scaling --depth parameter). All that said, is Llama the best potential architecture for this scenario? Or is there a better option that I could use here? Or do I just go with Nanochat again, and hope that I can build out a nanochat-to-HF export script on the other side?

submitted by /u/centerstate
[link] [comments]

Nanochat vs Llama for training from scratch? [P]

Want to read more?

Tagged with