The Importance of Standardized Child Labour Data to Machine Learning and AI

2 March 2022
Research Innovation

Elizabeth Burroughs  | Head of Data, HACE
Anahad Kaur Khangura  | Lead Social Data Researcher, HACE
Eleanor Harry  | Managing Director, HACE

Machine learning and AI require consistent, reliable data in order to produce analyses for effective policy to eradicate child labour. Data on child labour is collected by many governmental and international sources, such as the national statistical offices, the International Labour Organization and UNICEF-MICS. However, variables are not measured or collected consistently throughout repeated, longitudinal surveys, and they are not standardized between different sources. HACE uses AI and machine learning to generate cross-sectoral analyses of child labour; but to do so, we must standardize secondary data and produce our own time series models which are appropriate for machine learning. During this standardization process, we see the following problems repeated across data sources and across time, using as an example the collection of child labour data in Bangladesh.

Age groups

Child age is important in identifying the severity and nature of child labour, as seen in the ILO Minimum Age Convention 138, national laws and private sector policies. However, data is often collected in a way that makes this confirmation of child labour and its severity impossible. To use an example, the Census of Agriculture 2008 for Bangladesh does not provide data in the appropriate age ranges from which child labour can be analysed. In the census, the age ranges for children engaged in agricultural work are 0-10 years, 10-14 years, and 15+ years which does not correspond to the internationally known age groupings for child labour (5-11 years, 12-14 years and 15-17 years). This means that analysis of permissible work versus child labour in this example is not possible, as children of 14 years are allowed to work more hours than a child of 13 years, as per Bangladesh national law, yet they are collected in the same group in the census.

For the sake of accurate data analysis, it is therefore essential to collect discrete ages instead of age groupings; for example, a child should be recorded as 7 years old, not 6-11 years old. The discrete ages can later be aggregated to suit the analytical needs of whichever actor or machine learning algorithm is using them. Additionally, international age groups on child labour range from 5-17 years, despite the prevalence of child labour in the 0-5 age group (see Figure 1), and the severity of the Occupational Safety and Health risks associated with children under 5 years old.

Figure 1: Estimated ages working children (child labour by international definition) started work. “Most conservative estimate” describes a scenario where every child from the interviewed group of children aged 5-9 years old was 9 years old. “Least conservative estimate” describes a scenario where every child from the interviewed group of children aged 5-9 years old was 5 years old. We can assume the true prevalence is somewhere between the two estimates. Adapted from Bangladesh Child Labour Survey 2002-03.

Types of activity

Since 2000, UNICEF has collected data on child labour in more than 50 Multiple Indicator Cluster Surveys (MICS) following a standard module questionnaire. In 2010, the MICS definition for child labour was revised to align with international standards, with the guidance of the ILO. Therefore, early MICS data is incomparable to data collected in the following rounds of surveys because “child labour” no longer includes hazardous working conditions, which is now a separate indicator. This limitation is often acknowledged by UNICEF-MICS, but beyond acknowledgement, there appear to be few mitigating actions taken to address impact on data analysis.

Time series

Intervals between data collection on child labour are variable between countries and sources. In the example of UNICEF-MICS, timing of the surveys varies based on how often other surveys are carried out in the country as they aim to supplement governmental data. This is logical and should work, but the lack of standardization of child labour definitions regarding age groupings and types of activities between government surveys and UNICEF-MICS data renders much of the data incomparable and leads to various half-formed time series. There are issues with consistency of data collection. For example, the 2012-13 MICS for Bangladesh is missing data on child labour, despite it being a set MICS indicator[1]. A lack of standardization in collection of indicators impairs formulation of a time series for the purpose of analysis.

Variation in regional groupings

The various international regional groupings provided in the Child Labour Global Estimates 2020; Trends and the Road Forward report (ILO regions, SDG regions and UNICEF regions; see Table 1) are not standardized across organizations and therefore across surveys and data collection. This means that forming a longitudinal time series of global trends and any accurate disaggregation of data into countries is challenging.

ILO regionsSDG regionsUNICEF regions  
Sub-Saharan AfricaSub-Saharan AfricaSub-Saharan Africa
AfricaNorthern Africa and Western AsiaMiddle East and North Africa
Arab StatesEastern and South-Eastern AsiaSouth Asia
Europe and Central AsiaCentral and Southern Asia  Europe and Central Asia
AmericasEurope and Northern AmericaNorth America
Latin America and the CaribbeanLatin America and the CaribbeanLatin America and Caribbean  
Asia and the Pacific     East Asia and Pacific  
Table 1: Different regional groupings by the ILO, SDG and UNICEF; adapted from Child Labour Global Estimates 2020; Trends and the Road Forward

Effect on understanding the impact of COVID-19

In the absence of standardization measures, it is difficult to see the impact of COVID-19 on child labour. Incomparable data adversely impact the formulation of any time series and thus impairs any analysis and machine learning mechanisms. It is difficult to determine the impact on child labour of COVID-19 without standardized data on child labour from before, during and after the pandemic.

The impact of COVID-19 on child labour is difficult to ascertain, particularly so when we see the prevalence of child labour in all wealth index quintiles, including higher income households (see Figure 2). Only the holistic consideration of all potential drivers of child labour can help determine the impact of the pandemic on child labour. In the Child Labour Global Estimates 2020: Trends and the Road Forward report, the ILO explores drivers of child labour during COVID-19 as school closures, food insecurity and a lack of legal youth employment opportunities. However, as there is limited data on these variables before and after the pandemic, it is impossible to measure how a change in these factors caused by COVID-19 had subsequent effects on increased child labour.

Figure 2: Non-hazardous and hazardous child labour by wealth index quintile; adapted from Bangladesh MICS 2019.

A road forward

Standardizing or improving previous data is challenging without open access to raw data and collaboration between leading organizations in the field. In general, we should improve standardization of data measurement and collection through collaboration. This can be achieved in various ways, which include ensuring consistent collection of standardized child labour data, collecting data on discrete ages, and maintaining one set of regional groupings to make country disaggregation easy for the purpose of analysis and time series.

By standardizing data on child labour and its related variables, we can better examine the trends in child labour and therefore use AI and machine learning to combat the issue. As child labour is a complex issue and a unique form of labour, it should be collected and analysed through a purely child labour lens by child labour data experts, not as a subset of modern slavery or from an adult labour force perspective. The limitation of utilizing machine learning and AI solutions to tackle child labour is not yet a data science problem; it is a data problem.

This article has been prepared Elizabeth Burroughs, Anahad Kaur Khangura, and Eleanor Harry as a contribution to Delta 8.7. As provided for in the Terms and Conditions of Use of Delta 8.7, the opinions expressed in this article are those of the authors and do not necessarily reflect those of UNU or its partners.


[1] Missing child labour data from Bangladesh MICS 2012-13 may be explained by the Bangladesh Bureau of Statistics Child Labour Survey 2013, as UNICEF-MICS is designed to work around governmental surveys already in place.

This piece has been prepared as part of the Code 8.7 Symposium: Using Tech-Driven Data to Address Child Labour. Read all the responses below:

The Delta 8.7 Forum

Can New US Law Help Increase Financial Recovery and Reintegration of Survivors of Human Trafficking?

Professor Barry Koch, Dr Leona Vaughn, Sarah Byrne
Continue Reading

Gendered Understandings of Forced Sexual Exploitation

Ellie Newman-Granger
Continue Reading

Forced Labour Import Bans: What Does the Evidence Tell Us?

Owain Johnstone
Continue Reading

Gendered Understandings of Forced Sexual Exploitation

Ellie Newman-Granger
Continue Reading

Domestic Slave Labour in Brazil

Maurício Krepsky Fagundes
Continue Reading

Indigenous Peoples and the Anti-Trafficking Sector’s Blind Spot

Miriam Karmali, Krysta Bisnauth
Continue Reading