How many times have you provided personal information when filling out a hospital care, home loan, or simply by responding to a national survey such as the Census? Government information systems contain data on the various aspects of citizens’ lives over the years, which can help advance science in understanding society.
But for that, researchers have to face some technological challenges, such as how to make the systems “talk”, since there is not a unique number in the government registers that identifies each citizen. It is to solve this issue that the Record Linkage (record linkage, free translation) has arisen, a methodology that calculates the similarity of data in a deterministic way (when there is a unique identification, as in the social registers that use the Social Identification Number) or probabilistic (by means of varied information such as name, date of birth and mother’s name).
The potential, the current outlook and the challenges of cross-checking administrative data for scientific research were addressed in a scientific study published in the journal “Big Data and Society”, available in early December.
The article “Challenge in administrative data linkage for research” brought together researchers from different parts of the world in an attempt to better understand the different aspects of the method, among them the coordinator of the Center for Data Integration and Knowledge for Health (Cidacs / Fiocruz Bahia) Mauricio Barreto. The team relied on researchers from the London School of Hygiene & Tropical Medicine; University of Edinburgh and University of Bristol, UK; Institute for Clinical Evaluation Sciences, Canada and Curtin University, Australia.
The researchers argue that while it does not replace classical studies based on primary data collection, analyzes produced from the use of linked data can answer questions that require large samples or detailed data. The method is also more advantageous than cohorts and research questionnaires, given the high cost and low rate of response and / or adherence (in the case of cohorts, which accompany the same individual throughout life).
Generally the separation between linkage production and data analysis is considered good practice in the area in order to help protect the confidentiality of information. The data scientist uses sensitive information (which could identify an individual, such as name and birth) to link data from different databases, and provides the researcher with linked and unidentified data, that is, without the individual information. For what matters to scientific research is the collective results generated from individual information – not the data of an isolated person. However, the authors of the paper recognize that this practice may limit the analysis, since a part of the process becomes obscure for those who analyze and interpret the linked data.
Other problems related to linkage are data quality, which is not always accurately obtained, and the difficulty of linking some data, in the absence of a standard identifier.
Linkage is the main method of linking big data used in Cidacs. At the center, administrative data from different databases are linked to answer relevant scientific questions in the health area, such as the analysis of the impact of public social policies on the health of Brazilians held in the 100 million Cohort, which uses joint data from social programs and health information systems.
Check out more about how Linkage is applied to Cidacs in this interview with data scientist Robespierre Pita, a Ph.D. in Computer Science responsible for developing the Center’s linking algorithms.
What are the main challenges faced by the data production team in the area? How is the alignment with the needs of the scientific question of each researcher?
The current Big Data context is still lacking in tools and infrastructures capable of handling large data repositories. That is why we are concerned with dealing with state-of-the-art technology when it comes to handling this data and ensuring that queries made at databases have faster and faster responses. In addition, the scientific issues need an exhaustive analysis of the data, so we need to ensure a good use of the available computational resource and an acceptable response time. Another latent challenge is concern about the confidentiality of individual data that will be used for research and analysis to generate aggregate results. This brings us to the need to use a dedicated infrastructure to deal with this type of problem and avoid data diversion [Safe Room].
What progress has been made in the area in this first year of operation?
In this first year we were able to reach a level of full knowledge in many of the bases under the tutelage of the Center. In addition, we were able to perform linkages of large base pairs, such as the Cohort Baseline based on Notifications (114 million records) and Hospitalization by Turbeculose (1.2 mi).
What are the accuracy rates of the currently linked data?
Currently our tools have the highest accuracy results. In the [scientific] literature, the results are on average 95% accuracy. Our results are around 98%. In a sense, what Cidacs is doing in this area is unprecedented in Brazil.
What is it like to be a pioneer? What is the role of the Center for the diffusion of this method in Brazil?
In our state, region or even country itself, there is no professional or academic training focused on the jobs we do here. Still, we have a wonderful, innovative and dedicated computing team capable of handling all the difficulties and requirements of an environment like Cidacs. Being a pioneer gives us a responsibility to ensure the reproducibility and scientific acceptance of what is being done to support existing projects and enable the validation of what has already been done. Still, being on the crest of the wave of innovation in this area and being recognized for it is rewarding.