Owing to the complexity of our health care system, patients access care in various ways and settings, and their medical information is captured in an array of electronic medical records (EMRs) that do not always communicate with one another.1 Thus, if you want to gather information about patients for a quality improvement or research study, you might find yourself confronted with different sources of unlinked data that require you to find a way to link the data efficiently. Also, this link must not infringe on any patient privacy laws or ethical considerations.2 This task may appear to be impossible at first, but it can be resolved easily using techniques from computer science: hashing and salting.
Real-life research conundrum
Consider a situation where you wish to study trends in bloodwork of patients in your town over the past decade. These patients would have sought care from different physicians and clinics, and the bloodwork ordered by one physician (or clinic) would rarely or never have been shared with another. As you prepare to gather data from EMRs around town, you ponder how you will link patient-specific data from multiple clinics together.
In this situation, you might have been advised to assign random unique alphanumeric codes (eg, A1A1A1) to individual patients (Figure 1). The disadvantage of this is that you cannot conduct studies across multiple physicians or health care settings because, without access to a patient’s identifying information (eg, name, date of birth, provincial or territorial health card number), there is no way to assign the same alphanumeric code to that same patient at another location. Thus, a patient who sought care at different clinics would be assigned different codes at each site, and their data could never be linked.
Practical solution
Hashing is a 1-way algorithm that takes data you want to secure (eg, health card numbers) and turns the information into scrambled strings of characters.3 This procedure has become a key aspect of Internet security and is used in other popular technologies, such as cryptocurrency.3,4
Hashing works by using an algorithm on patient data that are highly unlikely to change over a person’s lifetime (eg, date of birth, health card number, first name) to create a unique identifier, making it possible to link a patient’s data when collected from multiple sites. The study team can pick the personal data to input into the algorithm and, as long as the exact same fields are used each time, the same unique identifier for an individual patient will be outputted at every instance, across time and across different health care settings (Figure 1).
Assignment of unique alphanumeric identifiers to a patient using standard practices versus with hashing and salting
How secure is this approach?
As an example, feeding John (first name), 999-999-999 (provincial or territorial health card number), and 1988 (year of birth) into a hashing algorithm produces the unique identifier 221c5ae9b14d19bd469ca529b41cc102. Trying to reverse engineer this code to obtain the data from which it originated would take 4.54×1017 combinations, the rough equivalent of 1 million computer years. However, as computer processing speeds have increased—consider that in the early 2000s a code that took 1 million computer years to reverse engineer could be cracked by a single computer in 1 year in 2023—a salt can be added to the hashing algorithm to make the calculations even more complex.5 A salt is a block of characters that can be chosen or generated that adds trillions more years to any attempted reverse-engineering calculations. Also, these estimates assume that the reverse engineer knows which data fields were used to create the unique identifier. Thus, an additional layer of protection is created based on the security through obscurity adage, as it is unlikely that a researcher would reveal those fields.
How did it work in real life?
The situation described above is one that our research team encountered while examining bloodwork trends among patients in a mid-sized town in British Columbia. Here are the steps we followed:
After agreeing which personal patient data would be inputted, our team’s computer programmer wrote a hashing program.
Our team’s research assistant visited participating clinics and, using the typical EMR reporting function, created reports with the data of interest (ie, bloodwork). These reports included personally identifying patient data.
The reports were exported into spreadsheet files, which were collected in a folder on the clinic computer’s desktop.
The research assistant installed the hashing program on the clinic computer’s desktop.
The data folder was processed using the hashing program. The program removed personally identifying data and replaced them with unique identifiers (hashes). Thus, hashed files were created.
The hashed files were copied to a flash drive that the research team took from the clinic.
The original data reports, hashed files, and hashing program were deleted from the clinic’s computer, thus destroying all retrieved personal patient information and ensuring the details of the hashing program remained confidential.
Once all hashed files from all participating clinics had been retrieved, an automated search looked across all files for repeated unique identifiers. When these were found, the data belonging to a given unique identifier were linked together.
Once data are hashed, there is no way to identify a specific patient’s file, which could be an issue if a clinic had entered or coded data incorrectly. However, if the clinic were to repair the data, the data could be rehashed and exported, with patients each receiving the same unique code they had been assigned previously. Therefore, in our study, we assigned each clinic a unique alphanumeric code and kept track of from which clinic the hashed data had been collected. In this way, we could not only report aggregate results back to clinics about their own patients, but we could also identify affected clinics if any anomalies in hashed data were detected.
Conclusion
By using a hash-and-salt algorithm, we were able to conduct a multisite study through which we retrospectively gathered 10 years’ worth of data about patients across multiple physicians and family medicine clinics while keeping patient data linked, despite having removed personally identifiable information.
Acknowledgment
Financial assistance was received from the Rural Coordination Centre of British Columbia through the Rural Physician Research Support Project grant in the amount of $10,000.
Notes
Hypothesis is a quarterly series in Canadian Family Physician (CFP), coordinated by the Section of Researchers of the College of Family Physicians of Canada. The goal is to explore clinically relevant research concepts for all CFP readers. Submissions are invited from researchers and nonresearchers. Ideas or submissions can be submitted online at https://mc.manuscriptcentral.com/cfp or through the CFP website https://www.cfp.ca under “Authors and Reviewers.”
Footnotes
Competing interests
None declared
- Copyright © 2023 the College of Family Physicians of Canada