Research data in family medicine often comes from 2 sources: self-report and medical record review. Frequently, the quality of these data sources is assumed to be high, but measuring the reproducibility of these data is essential to evaluating the quality of the information collected. In ideal circumstances, data obtained from either data source would be equivalent. However, no source of data is without error. In cases of low agreement between data sources, research findings differ depending on the method of data collection used1 and leave the researcher with questions about which estimate is correct. Comparing data from different sources can give family medicine researchers insight into which data source is most appropriate to answer a specific research question or can direct efforts to improve the collection and recording of health data.2
Imagine that we are interested in the prevalence of fever or cough in outpatients over the past influenza season. Neither the medical record nor patient self-report is considered the true criterion standard for symptoms. We are not assessing the accuracy of one data source compared with another; rather, we are examining agreement between the sources of data. The presence or absence of patient symptoms is considered a binary variable—a categorical variable in which there are 2 possible conditions (eg, yes or no, positive or negative). This paper describes indicators for determining agreement between binary variables: total agreement, κ, and positive and negative agreement.
Interpreting the value of κ
Table 1 displays data from the Hutterite Influenza Prevention Study in 2 × 2 contingency tables.3 Symptoms reported by Hutterite community members were compared with documentation in the medical records. Total agreement is the number of concordant pairs divided by the total sample. In Table 1A, total agreement is 74%, which is the number of concordant yes’s for fever (18) plus the concordant no’s (112) divided by 176 participants. However, this simple measure does not take into account that a certain amount of agreement between medical charts and self-report is expected by chance alone4; assessment of κ, on the other hand, measures the strength of agreement beyond what we expect solely by chance. The calculation for κ is as follows:
Contingency tables of data from the Hutterite Influenza Prevention Study3: A) fever, B) earache, C) cough, and D) chills.
The answer falls on a scale of −1 to 1, where 0 equals chance agreement and 1 equals perfect agreement. In 1977, Landis and Koch proposed the following guidelines for understanding κ values: less than 0 equals no agreement, 0.01 to 0.20 equals slight agreement, 0.21 to 0.40 equals fair agreement, 0.41 to 0.60 equals moderate agreement, 0.61 to 0.80 equals substantial agreement, and 0.81 to 1.0 equals almost perfect agreement.5 While these guidelines are widely used and cited, the cutoffs are not universally accepted and have been criticized for being arbitrary divisions based on personal opinion rather than evidence.6,7
The value of κ is not simple to interpret because it is influenced by the prevalence of the variable being measured.8 Table 1A and 1C have similar total agreements (as do 1B and 1D), but κ values differ according to distributions. The κ value represents the proportion of total variance that is not attributable to chance or random error. Because total variance is minimal in a uniform (homogeneous) population where there is a relatively high (or low) prevalence, κ will be low even though total agreement might be high (Table 1D). Because chance agreement is smallest in a mixed (heterogeneous) population, κ will be higher when prevalence is closer to 50% (Table 1B and 1C). This makes it difficult to compare κ values between patient symptoms or other variables and different prevalences.9
Calculation of κ is also influenced by bias or the disagreement in the proportion of positive or negative cases (number of discordant responses)6; that is, the mismatch of positive or negative cases or disagreements are not random but go in one direction rather than another,8,10 which tends to happen when the prevalence of a symptom is high or low. This might result in a low κ value even though agreement is substantial (Table 1A and 1D); the value of κ is higher when there is a large bias and lowest when bias is absent.11
The κ value does not distinguish between various types and sources of agreement and disagreement.6,8,12,13 The aim of measuring agreement is to discover the bases of differences and reduce them if possible, rather than, for example, simply quantifying the degree of disagreement.9 In fact, it might be that no single agreement statistic can adequately capture agreement.11
Calculating positive and negative agreement
To help interpret κ values, calculating both positive and negative agreement has been recommended.11,14 The formula for calculating positive agreement is as follows:
Family medicine practitioners should consider these concepts when evaluating various aspects of clinical care, such as data collection for a new practice quality assurance process. Although total agreement and the value of κ are commonly reported in agreement studies, we recommend the additional calculation of positive agreement and negative agreement.
Notes
Hypothesis is a quarterly series in Canadian Family Physician, coordinated by the Section of Researchers of the College of Family Physicians of Canada. The goal is to explore clinically relevant research concepts for all CFP readers. Submissions are invited from researchers and nonresearchers. Ideas or submissions can be submitted online at http://mc.manuscriptcentral.com/cfp or through the CFP website www.cfp.ca under “Authors.”
Footnotes
-
Competing interests
None declared
- Copyright© the College of Family Physicians of Canada