Symposium: Facing Choices When Modelling Modern Slavery Risk
The work by Pablo Diego-Rosell and Jacqueline Joudo Larsen represents a remarkable step forward for the scientific study of modern slavery. First, through the Walk Free foundation’s collection of nationally representative data on forced labour and forced marriage in collaboration with Gallup World Poll, we have seen the evidence base substantially enriched, providing researchers with the necessary building blocks for the 2017 Global Estimates of Modern Slavery. In “Modelling the Risk of Modern Slavery” the authors further enhance our understanding of factors influential to slavery vulnerability at the individual and national levels of analysis. The multistage modelling exercise presented in this paper showcases an impressive level of technical rigour expanding the horizon of methodological approaches in the area of modern slavery.
As in all models of any phenomena, researchers face choices regarding empirical strategy and a range of limitations based on those choices. Choosing to press beyond the exploratory or explanatory towards extrapolation and prediction is a bold move that can open doors to new knowledge. Along with that innovation comes concern about whether prediction is appropriate or possible at this early stage when the data environment is as sparse and unevenly developed as it is in this particular field. To that end, I have centred my discussion on the choices the authors made in terms of model selection and estimation method.
When selecting independent variables to include in a predictive (or explanatory) model, the approach used by quantitative researchers often falls into one of two camps: theory-driven or data-driven modelling. The more classical approach in the social sciences is theory-driven. We build models of the world that shave down all of the noise and complexity in order to isolate the main causal mechanisms we hypothesize are responsible for the change in the phenomena we are interested in predicting, in this case, modern slavery risk. Normally, we have an idea about primary influences, which are spelled out in our hypotheses. These main influences are measured and tested within models with many other variables that are assumed to have some effect on the phenomena or outcome of interest.
More recently, with ever-increasing computing speeds, using data to find the answer to questions has become far more exploratory and less driven by stories about how we see the world. There is arguably nothing fundamentally wrong with taking a more inductive, exploratory approach to model selection. It can be messier than data-driven approaches when it comes to understanding why some characteristic or factor has an influence on the outcome. The added advantage to the data-driven approach is that it does not rely on the way an individual or institution sees the world. Instead, selecting the best model of prediction is based on statistical evidence of model fit.
The model selection strategy presented in the paper takes a middle ground approach. As the Gallup World Survey instrument collects information on 157 variables, there are a lot of choices when it comes to establishing a model. The approach used by Diego-Rosell and Larson is to assess each variable’s correlation with forced labour or forced marriage subsequently discarding variables one by one if a significant effect is not apparent. In the next stage the variables found to have a significant bivariate correlation with the forced labour or forced marriage are entered in groups and those found insignificant are excluded.
The problem with the approach is that variables can appear to have no correlation with one another because the relationship between them is more complex, possibly nonlinear or interactive. Furthermore, when including variables in a model, the effect on the outcome is largely dependent on the other variables that are also included in the right-hand side of the model. So, for example, if age was included as a predictor in a model with other characteristics, such as years of education, income or political preference, the influence of age might be wiped out by the effects of other variables. Age may then be discarded as a non-influential characteristic when it actually predicts an outcome pretty well.
That said, it seems extremely unlikely to me that research this technically thorough would make obvious errors like the one in my example. However, the justification of model selection would be strengthened by adhering to the more traditional modelling approach informed by more robust theory, which the authors acknowledge, or by using more advanced, data-driven model selection techniques, such as various model averaging approaches or fully automated model selection using regression methods or random forest modelling.
The choice to employ Bayesian estimation instead of the more common frequentist approach, which is based around hypothesis testing, requires better explanation.
Bayesian statistics differ from frequentist statistics in that they have the ability to incorporate prior information about the world, which seems appealing. We might hypothesize that gender has an influence on risk of forced marriage. In the frequentist approach, the null hypothesis—gender does not have an influence on forced marriage—is assumed to be true until we have evidence to the contrary. A Bayesian approach rests on prior information suggesting gender does have such an influence.
Correct usage of Bayesian modelling is thus predicated on the existence of solid prior information. Do we have enough solid prior information about the risk of modern slavery to justify using this information to guide risk models? I think it would be difficult to convince researchers in this space that there is and that the approach is not biasing. The authors also point out that the results of the Bayesian modelling align with the frequentist version of the hierarchical models. If the less complex modelling has the same traction, the Bayesian approach is not necessary and may even overcomplicate the analysis and interpretation and communication of the findings.
Attempting to present results of an innovative model, whether explanatory or approaching prediction, on slavery risk is a foundational step towards knowing more about which individuals may require greater social protections, which communities may be more vulnerable and which countries may face the largest challenges. Extending the model to make inferences about countries where there is very limited information presents challenges. The authors are appropriately very transparent about the limitations in the paper.
Pressing on with analysis despite imperfect data environments has inspired the use of novel proxies, instrumental variables and a range of statistical modelling approaches. Innovating in the pursuit of estimating individual risk to modern slavery is moving the field forward. As long as the current limitations are presented clearly so we can attempt to overcome them, I am confident that future research in the field will build off of this important and ground-breaking work.
This piece has been prepared as part of the Delta 8.7 Modelling the Risk of Modern Slavery symposium. Read all the responses here.
Dr Kelly Gleason is the Delta 8.7 Data Science Lead.
This article has been prepared by Dr Kelly Gleason as a contributor to Delta 8.7. As provided for in the Terms and Conditions of Use of Delta 8.7, the opinions expressed in this article are those of the author and do not necessarily reflect those of UNU or its partners.