Service training process
Concept of training in Caila
Caila allows for the creation of services based on user data by training or fine-tuning services. The training process involves the fittable service, dataset, and training configuration:
As a training result, two entities are created:
-
State of the trained service — files in S3 storage obtained during the training process. These could be, for example, neural network dumps, SVM weights, or simply JSON files with the original dataset. State identifier is a path in S3 storage.
-
Fitted service, configured to operate with this state. The state, specifically the path to the folder in the S3 storage, is passed through the env variable. Training can be launched on an already created derived service. In this case, the service settings remain unchanged, but all instances restart with the new state.
Training modes
In Caila, two training modes are implemented, and they differ based on the space in which the training is conducted and the moment at which the derived service is created.
- First, a derived service is created, and training is conducted within it. This is the singleFit mode. The training is performed within the user account and on the user resources. If the service, for example, for LLM fine-tuning, requires substantial resources, they must be allocated to the user account.
- First, training is performed and a checkpoint is created. Then, a derived service is created with this checkpoint. This mode is called multiFit. Training is performed on resources within the account of the basic service owner. In this case, the training operation may incur a charge for the end user.
Stages of training in different modes
singleFit
Step No. | Description | Status * |
---|---|---|
1 | Beginning of training. | INITIAL |
2 | Generating a new modelDir value and retrieving the previous one. Caila transfers information about the previously trained state to the fit method to reduce retraining time.modelDir is the path to the folder in the S3 storage where the service saves the training result. | INITIAL |
3 | Extracting the training configuration, if it was passed as an identifier. | INITIAL |
4 | Launching an instance of the derived service in the dedicated mode. In this case, the instance will be able to process the fit call, but will not accept predict requests. | INITIAL → WAIT_FOR_START |
5 | Waiting for the new instance to start. | WAIT_FOR_START |
6 | Getting the dataset content in the required format. | WAIT_FOR_START |
7 | Passing the fit call to the service. The call parameters include:• modelDir for saving the new service.• previousModelDir , pointing to the previous fitted service.• trainData and targetsData , extracted from the dataset.• config in JSON format.• datasetInfo is the identifier, name, and format of the passed dataset.• targetServiceInfo is the identifier and name of the service being created. | WAIT_FOR_START → WAIT_FOR_FIT |
8 | Waiting for fit to execute. | WAIT_FOR_FIT |
9 | Converting the instance from dedicated mode to the normal mode. After this, the new instance begins to accept predict requests. | WAIT_FOR_FIT |
10 | Deleting all old instances. At the same time, Caila initially fixes their quantity. | WAIT_FOR_FIT |
11 | Launching new instances in the same quantity as the old ones. | WAIT_FOR_FIT → SUCCESS |
multiFit
Step No. | Description | Status * |
---|---|---|
1 | Beginning of training. | INITIAL |
2 | Generating a new modelDir value and retrieving the previous one. Caila transfers information about the previously trained state to the fit method to reduce retraining time.modelDir is the path to the folder in the S3 storage where the service saves the fitting result. | INITIAL |
3 | Extracting the training configuration, if it was passed as an identifier. | INITIAL |
4 | Checking that there is an active instance of the fittable service. If it doesn’t exist, starting it and waiting for it to launch. | INITIAL → WAIT_FOR_TRAINING_CONTAINER |
5 | Searching for a dataset and retrieving its content in the required format. | WAIT_FOR_TRAINING_CONTAINER |
6 | Granting the account owner of the fitted service permission to access the S3 bucket of the account that initiated the training. | WAIT_FOR_TRAINING_CONTAINER |
7 | Passing the fit call to the fittable service specified in trainingModelId . The call parameters include:• modelDir for saving the new service.• previousModelDir , pointing to the previous fitted service.• trainData and targetsData , extracted from the dataset.• config in JSON format.• datasetInfo is the identifier, name, and format of the passed dataset.• targetServiceInfo is the identifier, name of the service being created, and the name of the S3 bucket corresponding to the service being created. | WAIT_FOR_TRAINING_CONTAINER → WAIT_FOR_FIT |
8 | Waiting for fit to execute. | WAIT_FOR_FIT |
9 | Updating derived service parameters, including the value of modelDir . | WAIT_FOR_FIT |
10 | Starting a new instance of the derived service. If there were old instances, the new instance is launched first, and then the old ones are removed sequentially. As a result, the number of new instances will be equal to the number of deleted ones. | WAIT_FOR_FIT → WAIT_FOR_TARGET_CONTAINER |
11 | Revoking access to the S3 bucket of the account that initiated fitting. | WAIT_FOR_TARGET_CONTAINER → SUCCESS |