Service training troubleshooting
If an error occurred during service training, examine the fitting logs. They are located on the Events history tab on your service page:
When analysing training errors, first pay attention to messages with the label ERROR
, especially those that begin with Error processing FIT request. From these, you can learn the cause of the issue. For example, the Error processing FIT request: Instance was removed message indicates that the service failed to train because the active instance was stopped.
Content
- Non-composite service errors
- Invalid dataset format
- High resource consumption by fitted service
- High resource consumption by other services
- Training stuck at WAIT_FOR_START
- Service configuration error
- Internal service error
- Caila internal error
- Composite service errors
- Public service unavailability
- Errors on starting derived services
Non-composite service errors
Invalid dataset format
During training, a dataset with incorrect content was passed, or the dataset format does not match what the service expects.
Diagnostics
The error message indicates that the service was unable to convert the dataset to the required type, or there was an unexpected character when parsing the dataset (for example, if it is in JSON format).
Recommendations
- Correct the dataset by following the hints in the error message.
- Carry out the training on a different dataset, preferably smaller.
- Check that the dataset type you pass during training is compatible with what the service expects.
High resource consumption by fitted service
Services consume varying amounts of resources when training on datasets of different sizes. If the set limits are exceeded, the training is automatically interrupted.
Diagnostics
- In the event history, there is the Error processing FIT request: Instance was removed or Instance closed connection message, but there is no message Instance <instance_id> was evicted within <scope> scope.
- On the Diagnostics tab in the event list, it indicates that the instance, that was participating in training, has reached its memory consumption Maximum resource limit. It also indicates if the instance was evicted while participating in training.
- Training is successful on a small dataset, but fails on a large one.
Recommendations
Increase the resource limits for the service. Memory and disk space limits usually run out the quickest.
High resource consumption by other services
When launching new services within the account or resource groups, Caila may evict the derived instance.
Diagnostics
There is the Instance <instance_id> was evicted message in the event history. The training ended with the Instance was removed error.
Recommendations
The eviction message indicates within which space the eviction occurred.
If eviction occurred… | What to do |
---|---|
Within an account | Try not to run additional services during the training. You can also analyse the resource consumption by other services and either optimise it or expand the account limits. |
Within a public resource group | Contact the support using the widget in the bottom right corner. |
Within a private resource group | Analyse the resource consumption within this group. Try to optimise it or increase the limits for the account. |
Training stuck at WAIT_FOR_START
When training in the singleFit
mode, the service instance created for training must first start. The WAIT_FOR_START
state means that Caila is waiting for this instance to launch.
If the state persists throughout the training and ultimately ends with a fitTimeout
error, it indicates that, for some reason, the instance could not start.
Diagnostics
- The
WAIT_FOR_START
state persists throughout the training. Training ends with afitTimeout
error. - There is the Instance could not start message in the event history. A related Instance was removed message may also appear.
Recommendations
Go to the Diagnostics tab and select the instance in the Waiting state. If such an instance does not exist, select the one created after training started. Analyse the Events section to identify errors:
-
If the error is related to Caila not being able to load the Docker image, check its availability. If you are not the owner of the image, contact the owner for further details.
-
If the error is related to the fact that 0 nodes are available from the allowed quantity, contact support using the widget in the bottom right corner.
-
If the error is different and/or the instance was restarted several times, you should download the instance logs and analyse the error causing the restart. If the logs are not available to you, contact the service owner.
-
If during training you see the Back-off restarting failed container event on the Diagnostics tab, it means the service crashed with an error while starting. Download the logs and analyse the cause of the error. If the logs are not available to you, contact the service owner.
-
If there are no errors, pay attention to the Pulling image: event, as your image may be too large and not load into Caila in time. If you are sure the image is small, contact support.
-
If the problem is unresolved, contact support using the widget in the bottom right corner.
Service configuration error
During service training, you can configure its operation. If you make an error in the configuration, it will only appear when the service starts.
Diagnostics
- The error occurs immediately after the service starts.
- When using the default configuration, service training is successfully completed.
- The event history explicitly states that the problem is in the service configuration.
Recommendations
- If the error hints at the location of the error and/or how to fix it, follow these instructions.
- Compare your configuration with the sample configuration from the service documentation.
Internal service error
Various internal errors may occur during training.
Diagnostics
Training transitions to the WAIT_FOR_FIT
state and after some time ends with an error not described above.
Recommendations
- If the service is not yours, copy the error message from the event history and contact the model owner.
- If the service is yours, examine the logs on the Diagnostics tab.
- If the issue is related to communication between the service and Caila, contact support using the widget in the bottom right corner.
- Otherwise, to identify the problem, you can add additional logging and error handling to the service, and also run it locally.
Caila internal error
An internal error may occur in Caila, leading to errors during training.
Diagnostics
None of the previous points apply.
Recommendations
- Restart the training.
- Contact support using the widget in the bottom right corner.
Composite service errors
Different types of errors can occur in composite services.
Such services interact with other services during training. If the performance of these services is disrupted, it directly affects the services they participate in training.
Unavailability of public service
The required public service may be unavailable for various reasons beyond the user’s control. For example, this may be due to a lack of resources, an error when loading the image, or the deletion of the service itself.
Diagnostics
- The training error message indicates service unavailability or a
predict
request error. - When contacting a public service directly through the Caila interface, an error is returned.
Recommendations
Contact the service owner.
Errors on starting derived services
The fitted service can launch other services within the account. These derived services may either not fit within the user account or lead to the eviction of the fitted service itself when starting.
Diagnostics
In the event history:
- There is the Instance <instance_id> was evicted message. The training ended with the Instance was removed error.
- It states that child training ended with an error.
Recommendations
- Increase the limits for the account.
- If you have a private resource group, set the
minInstancesCount
value via REST API. - Contact the service owner.