Computational Technologies And AI For Automated Behaviour Analysis

White Paper Series Part 4

Development of a Prediction Model


The current surge in video recordings of professional interactions (e.g. video CV’s, webinars, video conferences) represents a true opportunity to develop automated systems for behaviour-based inference of traits and states as explained above, by leveraging fast and scalable algorithms built upon ground truth data that is both large in size and high in quality. The development of such systems, including Vima’s, typically has three steps:


  1. Human raters or annotators evaluate video recordings of expressions (e.g. video CVs) on key questions relating to particular skills and personality traits (e.g. “Is the person willing, eager to talk?”). At Vima, human annotators are trained, expert observers who meet the statistical standards of high interrater agreement. This annotation quality check is especially important when the labels refer to complex or “high-level” entities (e.g. judging personality) as opposed to “low-level” entities of an image, sound, or video (e.g. detecting pedestrian crossings). In the appendix, we list our standard set of assessed skills and traits.
  2. In parallel, the recorded expressions are analysed by several behaviour feature extraction tools taken from libraries of independently developed algorithms (some of them described below).
  3. Steps (1) and (2) are integrated into the training of a machine learning model that is aimed to predict the human evaluations based only on the extracted behaviour features. The final test of the system is performed on video material that was not included in the training phase (cross-validation). If the prediction model is accurate, it means that this type of human evaluation of skills and personality traits are reliably and systematically correlated with the extracted behaviour features and that the model can be applied to any other new set of video recordings showing a similar type of expressions (e.g. video CVs).

Behaviour Extraction and Prediction of Personality Traits and Soft Skills


Once a prediction model is trained, it becomes a stand-alone automatic system that can be applied to other video material of human expressions. A model can contain one or several modules that extract large and different sets of behavioural features. Vima built and tested robust extraction tools for both speech, acoustic, visual signal processing. Figure 3 illustrates how Vima’s machine learning model operates. It includes a data pre-processing step to the video and audio channels. Large sets of feature vectors are subsequently used as input to make predictions for each modality separately. These modality-specific predictions are then combined using advanced balancing (fusion) techniques to maximize prediction accuracy and calculate recommendations based on the full multimodal behavioural expression.

Figure 3. Schematic overview of Vima’s submodules and how they interact to provide skills and traits predictions based on late fusion techniques.

The Audio Module of Vima’s Prediction Model Focuses on Five Subgroups of Acoustic Features:


  1. Spectral features (MFCC, PLP, Chroma, ….)
  2. Temporal features (variational characteristics of silence and voiced segments)
  3. Intonation (pitch, formants, ….)
  4. Speech intensity (energy, …)
  5. Voice quality (Jitter, Shimmer, HNR, …)

Vima’s video module contains two submodules built using so-called Deep Neural Network1 techniques: a data-driven feature extractor and a prediction submodule.


  1. Data-driven feature extraction is used to detect a face region, compress video information and extract the most salient features from the region. As Figure 4 illustrates, three hidden layers (i.e. “encoder”) convert a sequence of video frames into a fixed-length feature vector (about 2’000 features). These features can significantly reduce the dimensionality of the feature space. At the same time, it also contains discriminative temporal information about expressive behaviour changes.
  2. The prediction submodule is the output layer of the visual module. A trained neural network regressor or MLP regressor takes the features of the above-mentioned encoder as input to predict skills and traits.

Figure 4. Schematic overview of the processing flow of the visual module. A hidden layer is located between the input and output of the algorithm, in which the function applies weights to the inputs and directs them through an “activation” function as the output. In short, the hidden layers perform (nonlinear) transformations of the inputs entered into the network.

Neural networks use neurons to transmit data in the form of input values and output values through connections.

Vima further extends and optimizes the video and audio features for each concrete business case allowing to further boost performance (selecting the most informative features).

The language model of Vima’s prediction engine includes three major steps. First, the speech from an audio signal is transcribed to text. Vima’s product supports several languages (currently English, French, and German), and the language spoken in a video CV should match one of Vima’s supported choices. Second, the text transcription is encoded using a transformer-based technique for Natural Language Processing. An encoding algorithm was trained on a large volume of texts, learning language representations in an unsupervised manner. These representations are demonstrated to be very effective in predicting traits and states from verbal behaviour in a video CV. The last block of the modality is a regression algorithm trained in a multi-task learning mode. The task of a regressor is to predict corresponding personality traits and skills from text encodings.

Finally, fusion techniques are used to combine the predictions from individual modalities – visual, audio and speech (language). The fusion is done by combining outputs of multiple neural networks, where each was trained in the multitask mode. The fusion module boosts the overall performance of the product, making it higher than the performance of any individual modality.


Language-specific Modelling and Cultural Diversity


Traits and skills can be expressed and perceived in different ways by speakers of different languages. Moreover, difficulty in speaking a certain language may reflect lower skill and trait perceptions. Vima actively prevents language fluency and cultural bias to slip into its algorithms by using only data sets for training where language and cultural variables are specified. Concretely, this means that language-specific prediction models are built on data from known native or high fluency language speakers from specific countries where that language is a mother tongue. It also means that these data were labelled by annotators speaking the same native language and have the same nationality as the individuals they evaluated (e.g. an English native annotator from UK or US viewed and rated videos from English native or high fluent speakers from the US or UK).

On the other hand, culturally singular models are a myth, and would not reflect the reality of a culturally mixed world. That is why Vima tries to represent this culturally mixed reality into its models and communicates will full transparency on the rich cultural metadata of each prediction model (country of origin, country of residence, other languages spoken, etc.). This valuable information helps Vima to give expert guidance and select the most appropriate solution (model, application) that fits the needs of a company.

Finally, model-adaptation techniques can be used to compensate for cultural differences during automatic behaviour modelling. Vima uses transfer learning to adapt prediction models trained on a multicultural dataset for new cultures even with a limited amount of training instances. The system continuously increases its precision and the provided recommendations reach the highest level of accuracy.

In sum, language-specific models are necessary because language affects the expression and perception of personality and skills. They are not culturally sterile but represent a cultural variety that also characterizes today’s society. Finally, adaptation techniques such as transfer learning provide computational solutions for extrapolation to other cultures and languages than defined in a prediction model.

Ground Truth in Person Assessment


Emotions, personality traits, and soft skills cannot be sensed or observed directly. Social sciences are often faced with so-called latent constructs and need to find solutions to make them measurable. In emotion research, for example, researchers experimentally induce affective states in their participants and then search for behaviour correlates for each state. Here, the elicited emotion is known and defined by the experimenter; it is considered the criterion or “ground truth”. Its behavioural manifestations are therefore considered truthful representations of an underlying internal state.

Alternatively, and for several theoretical and practical (even ethical) reasons, researchers can collect less “laboratory-controlled” but more naturally occurring expressions of states, traits and skills. Moreover, stable personal characteristics such as personality traits or skills can of course not be experimentally induced or controlled as emotions can. Thus, a common approach in assessing traits and skills is the use of large databases of natural, labelled expressions.

Vima owns several databases that contain a large number and variety of natural expressions that are labelled by a large group of human annotators with expertise in emotion, personality, and skill assessment. Annotation quality assurance is essential for establishing a reliable ground truth, thus, best practice statistical measures are used to establish reliability (interrater agreement). When a type of expression is unanimously evaluated by experts as socially skilled, the behaviours that discriminate this set of expressions from other expressions have high predictive power to that skill.

In sum, absolute ground truth does not exist in the social sciences. Instead, the best practice is to establish a solid benchmark that serves as relative ground truth. Several practices have been used, from laboratory-controlled to large-scale databases of expressions described by consensus labels.

Vima fosters a behavioural approach with scaling potential by leveraging state-of-the-art machine learning algorithms on quality labelling of human behaviour to obtain solid and reliable ground truth.