How Can AI Work in Microservice Systems?


Artificial intelligence in microservice systems, in Kubernetes clusters, … does that work at all? Aren’t this large, unhandy modules in an unsuitable software architecture?

The use of artificial intelligence methods is growing rapidly in practice, with increasing demand. In many cases, the development of these systems has its beginning in research. This is good - very good - because it allows the latest findings and developments to reach productive practice very quickly. AND it is at the same time a problem: In research, such systems are normally designed as “demonstrators”. They serve the proof of a principle functionality, for the examination of newest, scientific realizations, are preparation and/or part of a publication and not meant to execute practical work in the developed form.

Unfortunately, this often results in the AI module not being “reentrant”. This means that a program it is not capable of processing multiple requests at the same time. In the worst case, it is even only suitable for a single call and must be restarted afterwards. AI systems in particular have a non-negligible loading time (ramp-up phase). Until they are ready for operation, the trained (often huge) AI-models have to be loaded and internal structures have to be initialized. Research does not focus on resource consumption and processing time, since it is primarily concerned with proving usability in principle. This cannot be blamed in any case, since research and application development have completely different motivations and goals.

Another challenge for a practical use of AI systems can arise from a programming environment used in research. In most times, this is Python. Python offers many advantages, as it provides a large runtime environment with additional functions and mathematical modules connected to an easy-to-use programming language. These make work in research and development much easier. Unfortunately, this advantage also results in a disadvantage for productive use: A large “resource hunger”, which can have a disruptive effect in a cloud and microservices environment.

Practice begins on the horizon, far behind the prototypes

So caution is advised in the short-term adoption of AI systems from research into practice. It is highly likely that some research and steps are necessary to integrate the new AI into a productive system.

Many articles can be found on the Internet about Kubernetes, microservices, AI systems and their combinations. Mostly, they describe how something like this can be quickly put to use. For practical use in software systems deployed in production, many constraints have to be solved. It is a long way from “We tried this once and it worked prototypically …” to stable, practical use for customers.

… and the solution?

As you would expect, the solution is to perform many tasks, only some of which will be mentioned here:

  • The AI-model is to be examined for biases. Nobody wants to use a system in practice which, for example, is conspicuous for its discrimination of individuals or groups, or which cannot deal with special meteorological constellations and therefore evaluates extreme situations incorrectly. For this purpose, the data used to train the AI-model must be carefully selected and supplemented.

  • If possible, the trained AI-model should be reduced in size. This has a direct influence on resource consumption, runtime behavior and ramp-up phase of the AI system. For deep learning models, for example, it’s a good option, to convert it into the ONNX format. This is often associated with a reduction in size.

  • To get multiple processing tasks running in parallel, the AI system should become reentrant. For this, problems that can arise based on static data areas, global variables. These can prevent further tasks from being processed correctly and should be investigated. In addition, the system must be transformed into a service (or server) that can process multiple calls in parallel and does not require restarts.

  • One goal is to create the “smallest” service possible (mostly containers) for efficient use in a cloud computing environment, such as Kubernetes. The size can be decisively influenced with the choice of runtime environment and programming language. Today, either GO, Rust, or C++ are suitable choices. However, with some additional efforts, it is also possible to build optimized, small services with Python.

  • The search for and elimination of memory leaks is one of the most important tasks. After all, the service should be able to run for days/weeks/months without restarting and its memory requirements should not grow endlessly.

  • If possible, the ramp-up phase of the service should be reduced to a few seconds. Short time spans for the start of an instance of the AI system allow the use of autoscaling in order to be able to react quickly to increasing requirements in productive use.

  • With the integration of a caching of the data and models, the reduction of the consumption of resources and the times of the ramp-up phase can be supported.

  • In any case, a monitoring of the essential system parameters must be integrated. With the aid of such a health system, for example, automatic replacement and restart of the service in the event of problems occurring can be realized. This guarantees an uninterrupted operation.

Although the above list contains the essential work, it is by no means complete. Depending on the selected cloud computing environment in which the AI system is to be used in practice, further investigations, adaptations and additions are still necessary.

Image by Gerd Altmann at Pixabay