Skip to main content

Service training process

Concept of training in Caila

Caila allows for the creation of services based on user data by training or fine-tuning services. The training process involves the fittable service, dataset, and training configuration:

Training process

As a training result, two entities are created:

  • State of the trained service — files in S3 storage obtained during the training process. These could be, for example, neural network dumps, SVM weights, or simply JSON files with the original dataset. State identifier is a path in S3 storage.

  • Fitted service, configured to operate with this state. The state, specifically the path to the folder in the S3 storage, is passed through the env variable. Training can be launched on an already created derived service. In this case, the service settings remain unchanged, but all instances restart with the new state.

Training modes

In Caila, two training modes are implemented, and they differ based on the space in which the training is conducted and the moment at which the derived service is created.

  1. First, a derived service is created, and training is conducted within it. This is the singleFit mode. The training is performed within the user account and on the user resources. If the service, for example, for LLM fine-tuning, requires substantial resources, they must be allocated to the user account.
  2. First, training is performed and a checkpoint is created. Then, a derived service is created with this checkpoint. This mode is called multiFit. Training is performed on resources within the account of the basic service owner. In this case, the training operation may incur a charge for the end user.

Stages of training in different modes

singleFit

Step No.DescriptionStatus *
1Beginning of training.INITIAL
2Generating a new modelDir value and retrieving the previous one. Caila transfers information about the previously trained state to the fit method to reduce retraining time.
modelDir is the path to the folder in the S3 storage where the service saves the training result.
INITIAL
3Extracting the training configuration, if it was passed as an identifier.INITIAL
4Launching an instance of the derived service in the dedicated mode. In this case, the instance will be able to process the fit call, but will not accept predict requests.INITIALWAIT_FOR_START
5Waiting for the new instance to start.WAIT_FOR_START
6Getting the dataset content in the required format.WAIT_FOR_START
7Passing the fit call to the service. The call parameters include:
modelDir for saving the new service.
previousModelDir, pointing to the previous fitted service.
trainData and targetsData, extracted from the dataset.
config in JSON format.
datasetInfo is the identifier, name, and format of the passed dataset.
targetServiceInfo is the identifier and name of the service being created.
WAIT_FOR_STARTWAIT_FOR_FIT
8Waiting for fit to execute.WAIT_FOR_FIT
9Converting the instance from dedicated mode to the normal mode. After this, the new instance begins to accept predict requests.WAIT_FOR_FIT
10Deleting all old instances. At the same time, Caila initially fixes their quantity.WAIT_FOR_FIT
11Launching new instances in the same quantity as the old ones.WAIT_FOR_FITSUCCESS

multiFit

Step No.DescriptionStatus *
1Beginning of training.INITIAL
2Generating a new modelDir value and retrieving the previous one. Caila transfers information about the previously trained state to the fit method to reduce retraining time.
modelDir is the path to the folder in the S3 storage where the service saves the fitting result.
INITIAL
3Extracting the training configuration, if it was passed as an identifier.INITIAL
4Checking that there is an active instance of the fittable service. If it doesn’t exist, starting it and waiting for it to launch.INITIALWAIT_FOR_TRAINING_CONTAINER
5Searching for a dataset and retrieving its content in the required format.WAIT_FOR_TRAINING_CONTAINER
6Granting the account owner of the fitted service permission to access the S3 bucket of the account that initiated the training.WAIT_FOR_TRAINING_CONTAINER
7Passing the fit call to the fittable service specified in trainingModelId. The call parameters include:
modelDir for saving the new service.
previousModelDir, pointing to the previous fitted service.
trainData and targetsData, extracted from the dataset.
config in JSON format.
datasetInfo is the identifier, name, and format of the passed dataset.
targetServiceInfo is the identifier, name of the service being created, and the name of the S3 bucket corresponding to the service being created.
WAIT_FOR_TRAINING_CONTAINERWAIT_FOR_FIT
8Waiting for fit to execute.WAIT_FOR_FIT
9Updating derived service parameters, including the value of modelDir.WAIT_FOR_FIT
10Starting a new instance of the derived service.
If there were old instances, the new instance is launched first, and then the old ones are removed sequentially. As a result, the number of new instances will be equal to the number of deleted ones.
WAIT_FOR_FITWAIT_FOR_TARGET_CONTAINER
11Revoking access to the S3 bucket of the account that initiated fitting.WAIT_FOR_TARGET_CONTAINERSUCCESS