|
|
| |
| Appraising studies evaluating diagnostic tests |
| |
| |
In the approaching era of prospective medicine, we will increasingly be faced with new diagnostic or screening disease test kits. However, whether a test is of clinical usefulness is often difficult to assess especially if we have not been previously exposed to studies examining the issue surrounding diagnostic testing. Bear in mind, that assessing a diagnostic test not only includes examining its immediate utility in individuals but also encompasses its utility and costs in a large cohort of patients in the long-term. For example, the use of fecal occult blood tests in detecting the presence of colorectal carcinoma have evolved from the earlier studies examining sensitivity and specificity in small cohorts to large population based studies examining its utility in reducing mortality. Listed in the following table are some simple guidelines that are useful in evaluating studies involving diagnostic tests.
Dr Benjamin Chua
Visiting Specialist
Division of Clinical Trials and Epidemiologic Sciences
A Useful Framework for Evaluationg a Clinical Research Paper
| Types of studies |
Critique Issues |
| Studies of test reproducibility |
Asks the question “How often will I get the same results from a single specimen if I test on different occasions?”
Presence of : |
| 1) |
Intraobserver variability – lack of reproducibility in results when the same observer or laboratory performs the test at different times. Usually a first “screening” step before further development of a test.
|
| 2) |
Interobserver variability – lack of reproducibility among 2 or more observers.
|
When reproducibility is poor, a diagnostic test is unlikely to be useful. A cross sectional design is the most commonly used method of testing reproducibility as it allows comparison of results from more than 1 observer on more than 1 occasion.
Outcome is: |
| 1) |
Categorical variable – simplest measure of interobserver agreement is the concordance rate i.e. proportion of observations in which the observers agree exactly. But when there are more than 2 categories or the observations are not evenly distributed among the categories, kappa (κ) is a better measure. The range of κ is from -1 (perfect disagreement) to 1 (perfect agreement). A κ = 0 indicates that the amount of agreement is due to chance.
|
| 2) |
Continuous variable – Coefficient of variation is the most commonly used measure i.e. the standard deviation of the results on a single specimen divided by their mean and expressed as a percentage. If the results are normally distributed, then 95% of the results from different observers / machines will be within 2 standard deviations of the mean.
|
|
| Studies of the accuracy of the test |
Asks the question “To what extent does the test give the right answer?”
There needs to be a “gold standard” for comparison and the assessment of outcome should not be influenced by the diagnostic test being studied.
Outcome measures: |
| 1) |
Sensitivity and Specificity – Reflects the results of dichotomous tests. Sensitivity is the proportion of subjects with the disease in which the test gives the right answer (i.e. true positive). Specificity is the proportion of subjects without the disease in which the test gives the right answer (i.e. true negative). |
| 2) |
ROC curves – Used when outcomes are ordinal or continuous. Depends on cutoff point used to define a positive test. Several cutoff points are selected and the sensitivity and specificity calculated. A graph of sensitivity (y-axis) as a function of 1-specificity (x-axis) is plotted. Ideal test is one that has a curve reaching the upper left corner. Area under the ROC curve summarizes the accuracy of the test and can be used to compare 2 or more tests. Ranges from 0.5 (useless test) to 1.0 (perfect test)
|
| 3) |
Likelihood ratios – Better measure than ROC curves for handling continuous or ordinal outcomes. Defined as:
P (Positive result/Disease) |
P(Positive result/ No Disease)
The higher the likelihood ratio, the better the test result in ruling in disease. The lower the likelihood ratio, the better result in ruling out disease.
|
|
| Studies of feasibility, cost and risks of the test |
Generally descriptive studies.
The method of sampling patients is important as the tests results vary among the institutions/ persons performing them as well among the subjects. Studies should define criteria for determining if the test is acceptable.
Risks studies should have pre-defined criteria for determining adverse effects/ events. Cost studies should use costs of the test rather than charges incurred as the outcome measure.
|
| Studies of the effect of testing on outcomes |
Usually involves long term outcomes following administration of the test. The outcome(s) measured should be a reflection of morbidity or mortality and be broad enough to include many types of adverse effects.2 broad groups of studies: |
| 1) |
Observational studies – easier and less costly to conduct but is often confounded by the indication for testing e.g. patients tested may be at a higher disease risk state and hence up being volunteered for testing – a form of selection bias.
|
| 2) |
Clinical trials – minimize or eliminate confounding and selection bias. However there may be ethical issues about withholding potentially beneficial tests.
|
|
| Pitfalls |
| 1) |
Inadequate sample size – especially true if the disease or outcome being measured is rare.
|
| 2) |
Inappropriate exclusion – the basic rule is that if any patients who test positive and are excluded from the numerator, then similar patients must also be excluded from the denominator.
|
| 3) |
Institution specific results – especially important when a study suggest that a test is not helpful in clinical practice. It can also occur in institutions that do exceptionally well in conducting a particular test.
|
| 4) |
Dropping uninterpretable or borderline results – occurs when a test gives results that fall into the “grey zone” or have tests specimens that have deteriorated. Depending on what the study aims to achieve, these results may be classified as “positive” or negative depending on what cutoffs are used.
|
|
|
| |
|