So why would possibly or not it’s an issue if every thing is tracked proper? Properly it labored fairly properly in our conventional software program methods till we began to make use of AI.
In AI improvement, it creates skilled fashions as massive binary information, enormous gigabytes of picture datasets, checkpoint information for mannequin coaching and numpy objects (npy/npz). Git was not initially designed to deal with these information effectively for representing modifications. Even including 1 layer or retraining the mannequin creates a complete new whole binary of the mannequin. So all of the variations of the mannequin or numpy objects are getting tracked, and it will increase reminiscence exponentially!
It bloats git listing and every thing is slowed down and will get laggy. You’ll have to wait hours to make a pull request or fetch new modifications. Even throughout merging or rebasing, it’s a enormous drawback and wishes guide intervention. And your teammate can by accident commit enormous binary information to shared branches, and if it will get merged together with your dev department entire factor can get slowed down!
There are answers to this although, whereas in a manufacturing system to make use of Git LFS for storing massive information, Cloud options (S3, Azure Blob, GCS) and even there isan opensource different to git for ML improvement like DVC, comet, mlflow or as a easy resolution, put all these information in .gitignore