Service training process

Concept of training in Caila

Caila allows for the creation of services based on user data by training or fine-tuning services. The training process involves the fittable service, dataset, and training configuration:

Training process

As a training result, two entities are created:

State of the trained service — files in S3 storage obtained during the training process. These could be, for example, neural network dumps, SVM weights, or simply JSON files with the original dataset. State identifier is a path in S3 storage.
Fitted service, configured to operate with this state. The state, specifically the path to the folder in the S3 storage, is passed through the env variable. Training can be launched on an already created derived service. In this case, the service settings remain unchanged, but all instances restart with the new state.

Training modes

In Caila, two training modes are implemented, and they differ based on the space in which the training is conducted and the moment at which the derived service is created.

First, a derived service is created, and training is conducted within it. This is the singleFit mode. The training is performed within the user account and on the user resources. If the service, for example, for LLM fine-tuning, requires substantial resources, they must be allocated to the user account.
First, training is performed and a checkpoint is created. Then, a derived service is created with this checkpoint. This mode is called multiFit. Training is performed on resources within the account of the basic service owner. In this case, the training operation may incur a charge for the end user.

Stages of training in different modes

singleFit

Step No.	Description	Status *
1	Beginning of training.	`INITIAL`
2	Generating a new `modelDir` value and retrieving the previous one. Caila transfers information about the previously trained state to the `fit` method to reduce retraining time. `modelDir` is the path to the folder in the S3 storage where the service saves the training result.	`INITIAL`
3	Extracting the training configuration, if it was passed as an identifier.	`INITIAL`
4	Launching an instance of the derived service in the dedicated mode. In this case, the instance will be able to process the `fit` call, but will not accept `predict` requests.	`INITIAL` → `WAIT_FOR_START`
5	Waiting for the new instance to start.	`WAIT_FOR_START`
6	Getting the dataset content in the required format.	`WAIT_FOR_START`
7	Passing the `fit` call to the service. The call parameters include: • `modelDir` for saving the new service. • `previousModelDir`, pointing to the previous fitted service. • `trainData` and `targetsData`, extracted from the dataset. • `config` in JSON format. • `datasetInfo` is the identifier, name, and format of the passed dataset. • `targetServiceInfo` is the identifier and name of the service being created.	`WAIT_FOR_START` → `WAIT_FOR_FIT`
8	Waiting for `fit` to execute.	`WAIT_FOR_FIT`
9	Converting the instance from dedicated mode to the normal mode. After this, the new instance begins to accept `predict` requests.	`WAIT_FOR_FIT`
10	Deleting all old instances. At the same time, Caila initially fixes their quantity.	WAIT_FOR_FIT
11	Launching new instances in the same quantity as the old ones.	`WAIT_FOR_FIT` → `SUCCESS`

multiFit

Step No.	Description	Status *
1	Beginning of training.	`INITIAL`
2	Generating a new `modelDir` value and retrieving the previous one. Caila transfers information about the previously trained state to the `fit` method to reduce retraining time. `modelDir` is the path to the folder in the S3 storage where the service saves the fitting result.	`INITIAL`
3	Extracting the training configuration, if it was passed as an identifier.	`INITIAL`
4	Checking that there is an active instance of the fittable service. If it doesn’t exist, starting it and waiting for it to launch.	`INITIAL` → `WAIT_FOR_TRAINING_CONTAINER`
5	Searching for a dataset and retrieving its content in the required format.	`WAIT_FOR_TRAINING_CONTAINER`
6	Granting the account owner of the fitted service permission to access the S3 bucket of the account that initiated the training.	`WAIT_FOR_TRAINING_CONTAINER`
7	Passing the `fit` call to the fittable service specified in `trainingModelId`. The call parameters include: • `modelDir` for saving the new service. • `previousModelDir`, pointing to the previous fitted service. • `trainData` and `targetsData`, extracted from the dataset. • `config` in JSON format. • `datasetInfo` is the identifier, name, and format of the passed dataset. • `targetServiceInfo` is the identifier, name of the service being created, and the name of the S3 bucket corresponding to the service being created.	`WAIT_FOR_TRAINING_CONTAINER` → `WAIT_FOR_FIT`
8	Waiting for `fit` to execute.	WAIT_FOR_FIT
9	Updating derived service parameters, including the value of `modelDir`.	`WAIT_FOR_FIT`
10	Starting a new instance of the derived service. If there were old instances, the new instance is launched first, and then the old ones are removed sequentially. As a result, the number of new instances will be equal to the number of deleted ones.	`WAIT_FOR_FIT` → `WAIT_FOR_TARGET_CONTAINER`
11	Revoking access to the S3 bucket of the account that initiated fitting.	`WAIT_FOR_TARGET_CONTAINER` → `SUCCESS`

Service training process

Concept of training in Caila​

Training modes​

Stages of training in different modes​

singleFit​

multiFit​

Concept of training in Caila

Training modes

Stages of training in different modes

singleFit

multiFit