Skip to main content

Service training troubleshooting

If an error occurred during service training, examine the fitting logs. They are located on the Events history tab on your service page:

Events history

When analysing training errors, first pay attention to messages with the label ERROR, especially those that begin with Error processing FIT request. From these, you can learn the cause of the issue. For example, the Error processing FIT request: Instance was removed message indicates that the service failed to train because the active instance was stopped.

Content

Non-composite service errors

Invalid dataset format

During training, a dataset with incorrect content was passed, or the dataset format does not match what the service expects.

Diagnostics

The error message indicates that the service was unable to convert the dataset to the required type, or there was an unexpected character when parsing the dataset (for example, if it is in JSON format).

Recommendations

  • Correct the dataset by following the hints in the error message.
  • Carry out the training on a different dataset, preferably smaller.
  • Check that the dataset type you pass during training is compatible with what the service expects.

High resource consumption by fitted service

Services consume varying amounts of resources when training on datasets of different sizes. If the set limits are exceeded, the training is automatically interrupted.

Diagnostics

  • In the event history, there is the Error processing FIT request: Instance was removed or Instance closed connection message, but there is no message Instance <instance_id> was evicted within <scope> scope.
  • On the Diagnostics tab in the event list, it indicates that the instance, that was participating in training, has reached its memory consumption Maximum resource limit. It also indicates if the instance was evicted while participating in training.
tip
Note the reason for the eviction: it may provide information on which resource was lacking. For example: <…> service was evicted <…> to free Ram 500Mi Disk 100Mi <…>.
  • Training is successful on a small dataset, but fails on a large one.

Recommendations

Increase the resource limits for the service. Memory and disk space limits usually run out the quickest.


High resource consumption by other services

When launching new services within the account or resource groups, Caila may evict the derived instance.

Diagnostics

There is the Instance <instance_id> was evicted message in the event history. The training ended with the Instance was removed error.

Recommendations

The eviction message indicates within which space the eviction occurred.

If eviction occurred…What to do
Within an accountTry not to run additional services during the training. You can also analyse the resource consumption by other services and either optimise it or expand the account limits.
Within a public resource groupContact the support using the widget in the bottom right corner.
Within a private resource groupAnalyse the resource consumption within this group. Try to optimise it or increase the limits for the account.

Training stuck at WAIT_FOR_START

When training in the singleFit mode, the service instance created for training must first start. The WAIT_FOR_START state means that Caila is waiting for this instance to launch.

If the state persists throughout the training and ultimately ends with a fitTimeout error, it indicates that, for some reason, the instance could not start.

Diagnostics

  • The WAIT_FOR_START state persists throughout the training. Training ends with a fitTimeout error.
  • There is the Instance could not start message in the event history. A related Instance was removed message may also appear.

Recommendations

Go to the Diagnostics tab and select the instance in the Waiting state. If such an instance does not exist, select the one created after training started. Analyse the Events section to identify errors:

  • If the error is related to Caila not being able to load the Docker image, check its availability. If you are not the owner of the image, contact the owner for further details.

  • If the error is related to the fact that 0 nodes are available from the allowed quantity, contact support using the widget in the bottom right corner.

  • If the error is different and/or the instance was restarted several times, you should download the instance logs and analyse the error causing the restart. If the logs are not available to you, contact the service owner.

  • If during training you see the Back-off restarting failed container event on the Diagnostics tab, it means the service crashed with an error while starting. Download the logs and analyse the cause of the error. If the logs are not available to you, contact the service owner.

  • If there are no errors, pay attention to the Pulling image: event, as your image may be too large and not load into Caila in time. If you are sure the image is small, contact support.

  • If the problem is unresolved, contact support using the widget in the bottom right corner.

tip
When contacting support, provide the instance logs (if you have access to them) and the event history from the Diagnostics tab.

Service configuration error

During service training, you can configure its operation. If you make an error in the configuration, it will only appear when the service starts.

Diagnostics

  • The error occurs immediately after the service starts.
  • When using the default configuration, service training is successfully completed.
  • The event history explicitly states that the problem is in the service configuration.

Recommendations

  • If the error hints at the location of the error and/or how to fix it, follow these instructions.
  • Compare your configuration with the sample configuration from the service documentation.

Internal service error

Various internal errors may occur during training.

Diagnostics

Training transitions to the WAIT_FOR_FIT state and after some time ends with an error not described above.

Recommendations

  • If the service is not yours, copy the error message from the event history and contact the model owner.
  • If the service is yours, examine the logs on the Diagnostics tab.
  • If the issue is related to communication between the service and Caila, contact support using the widget in the bottom right corner.
  • Otherwise, to identify the problem, you can add additional logging and error handling to the service, and also run it locally.

Caila internal error

An internal error may occur in Caila, leading to errors during training.

Diagnostics

None of the previous points apply.

Recommendations

  • Restart the training.
  • Contact support using the widget in the bottom right corner.

Composite service errors

Different types of errors can occur in composite services.

Such services interact with other services during training. If the performance of these services is disrupted, it directly affects the services they participate in training.

Unavailability of public service

The required public service may be unavailable for various reasons beyond the user’s control. For example, this may be due to a lack of resources, an error when loading the image, or the deletion of the service itself.

Diagnostics

  • The training error message indicates service unavailability or a predict request error.
  • When contacting a public service directly through the Caila interface, an error is returned.

Recommendations

Contact the service owner.


Errors on starting derived services

The fitted service can launch other services within the account. These derived services may either not fit within the user account or lead to the eviction of the fitted service itself when starting.

Diagnostics

In the event history:

  • There is the Instance <instance_id> was evicted message. The training ended with the Instance was removed error.
  • It states that child training ended with an error.

Recommendations

  • Increase the limits for the account.
  • If you have a private resource group, set the minInstancesCount value via REST API.
  • Contact the service owner.