Tools for Direct Observation and Assessment of Clinical Skills of Medical Trainees
A Systematic Review
- Jennifer R. Kogan, MD;
- Eric S. Holmboe, MD;
- Karen E. Hauer, MD
- Author Affiliations: Department of Medicine, University of Pennsylvania Health System (Dr Kogan) and American Board of Internal Medicine (Dr Holmboe), Philadelphia, Pennsylvania; and University of California, San Francisco (Dr Hauer).
-
Corresponding Author: Jennifer R. Kogan, MD, Department of Medicine, University of Pennsylvania Health System, 3701 Market St, Ste 640, Philadelphia, PA 19104 (jennifer.kogan@uphs.upenn.edu).
Abstract
Context Direct observation of medical trainees with actual patients is important for performance-based clinical skills assessment. Multiple tools for direct observation are available, but their characteristics and outcomes have not been compared systematically.
Objectives To identify observation tools used to assess medical trainees' clinical skills with actual patients and to summarize the evidence of their validity and outcomes.
Data Sources Electronic literature search of PubMed, ERIC, CINAHL, and Web of Science for English-language articles published between 1965 and March 2009 and review of references from article bibliographies.
Study Selection Included studies described a tool designed for direct observation of medical trainees' clinical skills with actual patients by educational supervisors. Tools used only in simulated settings or assessing surgical/procedural skills were excluded. Of 10 672 citations, 199 articles were reviewed and 85 met inclusion criteria.
Data Extraction Two authors independently abstracted studies using a modified Best Evidence Medical Education coding form to inform judgment of key psychometric characteristics. Differences were reconciled by consensus.
Results A total of 55 tools were identified. Twenty-one tools were studied with students and 32 with residents or fellows. Two were used across the educational continuum. Most (n = 32) were developed for formative assessment. Rater training was described for 26 tools. Only 11 tools had validity evidence based on internal structure and relationship to other variables. Trainee or observer attitudes about the tool were the most commonly measured outcomes. Self-assessed changes in trainee knowledge, skills, or attitudes (n = 9) or objectively measured change in knowledge or skills (n = 5) were infrequently reported. The strongest validity evidence has been established for the Mini Clinical Evaluation Exercise (Mini-CEX).
Conclusion Although many tools are available for the direct observation of clinical skills, validity evidence and description of educational outcomes are scarce.
- KEYWORDS:
- CLINICAL COMPETENCE
- EDUCATION, MEDICAL
- EDUCATION, MEDICAL, GRADUATE
- EDUCATIONAL MEASUREMENT
- FACULTY, MEDICAL
- INTERNSHIP AND RESIDENCY
- QUALITY ASSURANCE, HEALTH CARE
- QUALITY OF HEALTH CARE
- STUDENTS, MEDICAL
Direct observation of medical trainees with actual patients by clinical supervisors is critical for teaching and assessing clinical and communication skills. A recent Institute of Medicine report calls for improved supervision of trainees to enhance patient safety and quality of clinical education.1 The Liaison Committee on Medical Education and Accreditation Council for Graduate Medical Education require ongoing assessment that includes direct observation of trainees' clinical skills.2,3 By observing and assessing learners with patients and providing feedback, faculty help trainees to acquire and improve skills and help patients through better supervision of clinical care.4
Direct observation of medical trainees occurs infrequently and inadequately.5,6 End-of-rotation global rating forms are often completed by supervisors who have not directly observed trainees with patients.7 However, assessment based on direct observation should be an essential component of outcomes-based education and certification.8,9 With current interest in establishing an outcomes-based medical education system that enhances trainee development and patient safety, there is a great need for robust work-based evaluation tools. To our knowledge, a rigorous systematic review has not been performed of the utility and quality of the numerous existing tools for direct observation and assessment of medical trainees with actual patients. We therefore systematically reviewed the literature to determine available tools for direct observation by supervisors of trainees' clinical skills with actual patients. The aim was to describe existing tools and the evidence of their validity and outcomes to provide medical educators with evidence-based assessment measures and an understanding of areas for further research.
METHODS
Data Sources
A systematic literature search was conducted using specific eligibility criteria, electronic searching, and hand searching to minimize risk of bias in selecting articles. The search, conducted with the assistance of a library science expert, included relevant English-language studies published between January 1965 and March 2009 using the PubMed, Education Resource Information Center (ERIC), Cumulative Index to Nursing and Allied Health Literature (CINAHL), and Web of Science electronic literature databases. Combinations of terms were used related to competence (clinical competence; clinical skills), medical education (education; students, education, medical; clinical clerkship, internship and residency/methods; preceptorship), and learner level (student; intern; resident). Tables of contents of medical education journals not indexed in PubMed (Teaching and Learning in Medicine, 1986-1996; Medical Teacher, 1979-1980) were hand-searched. The reference lists of all included articles and identified review articles were examined. A key word search of instruments identified in the included articles was conducted. A more detailed search strategy is available on request.
Study Selection
Studies were included if they described a tool designed (1) for direct observation of skills in clinical settings with actual patients (observer in the room or observing by remote camera) and (2) for use by educational supervisors (interns, residents, fellows, faculty, nurses, nurse practitioners, other trained observers) with medical trainees (medical students, interns, residents, fellows). Studies were excluded that described tools intended (1) for use with standardized patients, (2) for use in simulated settings (eg, without actual patients), or (3) to assess surgical or procedural skills; and (4) without a full article available for review.
Title and Abstract Review
The initial search identified 10 672 citations (Figure). All 3 authors independently reviewed citation titles and abstracts to assess eligibility for review, with each title/abstract reviewed by at least 2 authors. Of those, 199 were appropriate for detailed review to determine if they met inclusion criteria. Review articles were excluded. When reviewers disagreed or an abstract was insufficient to determine study eligibility, the full article was retrieved.
Study Review and Data Extraction
A Best Evidence in Medical Education abstraction form10 was modified to focus on the settings, learners, tool content, and outcomes described in studies. Every article was independently abstracted by 2 authors (J.R.K. and K.E.H.). Each reviewer then reconciled half of the abstractions for completeness and accuracy. Differences in data abstraction were resolved through consensus adjudication. Extracted information included tool characteristics and implementation, validity, and outcomes. Abstracted items characterized tool characteristics (assessed skills, number of items and how they were evaluated, space for open-ended comments or action plan) and implementation (research study design,11 setting [country, single/multi-institution, specialty, inpatient/outpatient, trainee level], observer characteristics, use for formative/summative evaluation).
Information on reliability and validity was extracted. Although many frameworks to evaluate assessment tools exist,12,13,14 the unitary theory of Messick13 was used. In this approach, validity evidence is used to support the overarching framework of construct validity, the degree to which an assessment measures the underlying construct.13,15,16 Validity evidence was sought in 5 areas:
-
Content: relationship between the tool's content and the construct it intends to measure
-
Response process: evidence showing raters have been properly trained (faculty development)
-
Internal structure (reliability): internal consistency, test-retest reliability, agreement (interrater reliability), generalizability
-
Relationship to other variables (concurrent, predictive validity): correlation of scores with other assessments or outcomes; differences in scores by learner subgroups
-
Outcomes (educational outcomes): consequences of assessment.
A modified version of Kirkpatrick's hierarchy was used to evaluate outcomes of implementing a tool.17 Outcome levels abstracted included:
-
Participation: learners' or observers' views on the tool or its implementation
-
Self-assessed modification of learner or observer attitudes, knowledge, or skills
-
Transfer of learning: objectively measured change in learner or observer knowledge or skills
-
Results: change in organizational delivery or quality of patient care
Information regarding cost of tool development and implementation was also extracted.18
Data Synthesis and Analysis
Due to study heterogeneity, a meta-analysis was not possible. After ascertaining tools used for direct observation, we specifically identified those with evidence of internal structure validity and validity based on relationship to other variables. We determined whether these tools had an educational outcome beyond learners' or observers' attitudes about the tool or its implementation.
RESULTS
Search Results and Article Overview
The Figure summarizes the results of the review process. Of 10 672 citations, 85 met inclusion criteria after title, abstract, and full article review. Fifty-five unique tools were identified. The 85 studies were heterogeneous in their populations, methods, and outcomes (Table 1). The most common study design was a prospective cohort without a comparison group. Randomized controlled trials were used in 6 studies in internal medicine,19,20,21,22,23,24 1 in pediatrics,25 and 1 in an unspecified discipline.26 Of the studies, 64 (75%) occurred within single institutions. Twenty-seven studies mentioned institutional review board approval.20,21,22,23,24,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48 Costs of tool implementation, mentioned infrequently,37,39,49,50,51,52,53,54,55,56,57 usually focused on faculty time. One article specifically mentioned administrative costs56 but none included cost calculations. eTable 1 presents additional information about the characteristics of each study (objective, design, country, learner, specialty, observation location, assessment type [formative/summative], and how observations of trainees occurred).
Table 1. Characteristics of 85 Studies Describing Tools for Direct Observation of Medical Trainees' Clinical Skills
Description of Tools
Details about each of the tools are provided in Table 2A and Table 2B. Of the 55 unique tools identified, 21 (38%) were implemented with students, 32 (58%) with residents or fellows, and 2 (the Mini Clinical Evaluation Exercise [Mini-CEX] and 1 unnamed58) with both. The largest number of tools (17) were developed or tested in internal medicine settings. The Mini-CEX was the most studied, with adaptations for palliative care,37 ophthalmology,59,60 and cardiology41,61,62 and implementation in multispecialty settings.63 Most tools contained items on history taking, physical examination, and communication (eTable 2). Eleven tools (20%) contained scales with behavioral anchors.40,59,60,64,65,66,67,68,69,70,71,72,73 Twenty tools (36%) solicited open-ended comments or written action plans. Thirty-two tools (58%) were implemented for formative assessment, 7 (13%) for summative assessment, and 3 (5%) for both, although this distinction was not always clear (eTable 1). Many tools were used once per trainee, although some were used up to 10 times (eTable 1).
Table 2. Description of Tools (n = 55) Used for Direct Observation of Clinical Skills and the Studies Describing Them
Table 2. Description of Tools (n = 55) Used for Direct Observation of Clinical Skills and the Studies Describing Them (cont)
Validity Evidence
The frequency of reported validity evidence across tools is summarized in eTable 2. Table 2A and Table 2B describe whether validity was studied for each tool. Actual evidence by study is presented in eTable 3.
Content
Descriptions of tool content selection (content validity) were mentioned for 20 tools (36%)20,21,27,29,30,33,34,38,39,40,52,56,59,74,75,76,77,78,79,80,81 and typically involved expert or consensus groups reviewing educational competencies and literature.
Response Process
Observers were infrequently trained to use assessment tools. Rater training, described for 47% of tools,19,20,21,22,23,27,28,29,33,34,35,36,37,38,39,41,42,44,45,47,49,50,51,55,61,62,65,66,70,73,74,75,77,80,82,83,84,85,86,87 usually occurred once and was brief (10 minutes to 3 hours). Training usually included orienting observers to the tool or discussing feedback principles via e-mail, workshops, or preexisting institutional faculty/resident lectures and meetings.19,20,21,22,27,28,29,33,34,35,36,37,38,39,41,42,44,45,47,49,50,55,61,62,64,65,66,70,75,77,80,82,85,86,87 Training sessions that incorporated rater practice using the tool or review of videotaped performances of different competency levels were described for 8 tools.20,22,23,34,35,49,55,70,74,85 For 2 tools, observers were either given examples of effective feedback21,85 or trained to provide feedback using role play.23,49
Internal Structure
Interrater reliability, reported for 22 tools (40%), was the most commonly reported reliability assessment19,20,22,24,25,28,30,31,33,34,39,40,52,60,63,66,69,73,75,77,79,81,88,89,90,91,92,93 and was often suboptimal (<0.70).94 Intrarater reliability93 and test-retest reliability88 were reported for 1 tool each. Interitem correlations (correlations between items on the form) and item-total correlations (correlations between items and the overall rating) were reported for 222,42,52,95,96 and 4 tools,22,42,45,77,80,95,96 respectively. Internal consistency was described for only 8 tools30,36,40,41,42,45,60,68,75,77,97 but was usually high (Cronbach α approximately ≥0.70).94 Generalizability/reproducibility coefficients were reported for 8 tools.22,28,42,47,61,63,66,69,73,74,75,77,95,96 Three studies, 1 describing the minicard and the other 2 a modified Mini-CEX, compared performance characteristics of 2 different tools.20,22,48
Relationship to Other Variables
Correlation of direct observation scores with other assessments was described for 17 tools (31%) in 22 studies.21,28,30,36,41,42,52,53,63,68,69,73,75,79,80,84,89,97,98,99,100,101 Assessments were compared to written examination scores21,28,41,42,73,75,84,89,97,98,99,101 and clinical performance ratings.21,28,30,36,42,52,53,69,84,89,97,99,100,101 Comparisons with objective structured clinical examinations/standardized patient examinations,28,41,63,73,75,101 chart audits,79 patient write-ups,42,68 or patient ratings30 were less common. In general, correlations were low (r = 0.1) or modest (r = 0.3).102 Correlations were disattenuated in 3 studies.41,73,75
Performance scores were also compared across training level or other learner characteristics.24,28,35,38,39,41,42,51,58,61,69,72,80,83,92,93,95,96,97,103 Eight tools (10 studies) had scores that increased with training level35,38,42,51,58,61,69,80,83,95; with 4 tools this trend was not seen.51,72,83,97 The Mini-CEX had evidence both supporting24,41,42,61,95,96 and refuting97 score improvement with training level. With 4 tools, learners' performance improved after clinical skills training and/or feedback.39,72,92,93
Outcomes
Surveying trainees and observers about their experiences with a tool was the most common method for assessing outcomes, used with 19 tools (35%).21,23,30,37,41,42,44,45,47,49,50,54,55,56,57,61,62,65,66,67,70,76,86,87,88,89,93,95,96,100,104 Trainees generally rated observation experiences positively.
Modification of trainees' self-assessed knowledge, attitudes, or skills was reported for 9 tools (16%).21,27,30,37,50,76,89,91,100,104 Transfer of trainee learning (objectively measured skill or behavior change) was described for 5 tools.25,26,39,49,93 Studies describing these changes were often nonblinded and failed to control for baseline clinical skills.26,39
Outcomes of tool implementation on observer feedback or the effect of observer training on rating behaviors was described for 6 tools.22,23,27,49,56,70,88 Tool implementation increased the frequency,27,56 specificity,70,88 and timeliness70 of observation and feedback. Training increased confidence using the tool22,23 but inconsistently improved rater stringency and accuracy.22,23
Organizational change was described for 2 tools (Modified Leicester Assessment Package64; Patient Evaluation Assessment Form38). For both, it was suggested that deficiencies identified on assessments inspired curricular change.38,64 No tool had evidence that use affected patient care outcomes.
Tools With Multiple Elements of Validity Evidence
Eleven tools had evidence of internal structure validity and validity based on relationships to other variables. These included the Direct Observation Clinical Encounter Examination75 (multispecialty), Clinical Encounter Card27,28 (surgery), Direct Psychiatric Clinical Examination89 (psychiatry), Revised Infant Video Questionnaire39 (pediatrics), a 360-degree evaluation described by Wood et al30 (radiology), Davis Observation Code79,101 (family medicine), Mini-CEX,41,42,45,47,61,63,95,96,97 and unnamed tools described by Woolliscroft et al (unspecified discipline),68 Brennan and Norman73 (obstetrics), Beckman et al92 (internal medicine), and Nørgaard et al80 (internal medicine). Only 3 had evidence of learning. Use of the Revised Infant Video Questionnaire increased learning using a noncontrolled study design.39 Residents self-assessed improved communication and counseling skills with a 360-degree evaluation.30 Students reported improved understanding of their history-taking, physical examination, and decision-making skills using the Clinical Encounter Card.27,28
COMMENT
Direct observation of medical trainees by faculty remains a vital component of assessment across specialties. Assessment through observation provides ongoing data on trainee performance with actual patients, and effective assessment helps medical educators meet their professional obligation to self-regulate effectively.105 Enhanced supervision (with observation) can be associated with better patient care and faster acquisition of clinical skills by trainees,106 and the 2008 Institute of Medicine report recommends greater supervision in medical education to improve patient safety and education.1 The development of expertise depends on accurate and detailed assessment and feedback.107 However, faculty and training institutions may not be held accountable for ensuring trainees' clinical competence, and high-quality direct observation of trainees should augment the quality of supervision.108
Although we identified many tools available for direct observation of clinical skills, few have been thoroughly evaluated and tested. One tool, the Mini-CEX, has been implemented repeatedly with medical students, residents, and fellows across specialties. The 20 Mini-CEX studies illustrate how validity evidence can accrue and tool implementation can be manipulated (ie, adding behavioral anchors to increase score reliability and accuracy).20 Multiple publications suggest the validity of Mini-CEX scores. Ten other tools (Table 2A and Table 2B) possessing at least 2 levels of validity evidence have potential for wider use with additional research on implementation and consequential validity.27,28,30,39,68,73,75,79,80,89,92,101
Although many studies measured trainees' or observers' attitudes about the observation process, few demonstrated improved clinical skills or patient care quality with tool implementation in an educational program. Outcomes such as learning, transfer of skills to new situations, or improved patient care are important and relatively unstudied. Whether these tools are associated with health care system improvements remains an area for future research.
In many studies, rater training (the response process component of validity) was minimally described or did not occur. Whether this omission was related to perceived cost, time constraints, or unawareness of the importance of rater training is unknown. However, observers need training to rate learners' performance reliably and discriminate between performance levels.8 Randomized trials highlight the value of rater training and its effect on scores.22,23 Brief training is likely to be ineffective.19,22,23,77 Although rater training may initially be resource- and time-intensive, these costs should be weighed against potential benefits gained in teaching quality and learning.18 Given the relative inattention to implementation in the studies we reviewed, as well as the high expense associated with current assessment strategies such as simulation and standardized patient examinations, faculty development that enhances trainees' clinical skills and increases faculty supervision through observation could enhance care and may be cost-effective.
Our findings also suggest several next steps to improve the quality of research in this area. To enhance the quality of evidence in medical education, published research should include the assessment or intervention; methods of implementation; and evidence for reliability, validity, and educational outcomes.106 However, current research generally does not adhere to these recommendations. After utility of a tool has been demonstrated (validity evidence) and guidelines for implementation developed, randomized study designs should follow whenever possible to assess whether the tool affects educational outcomes.109,110 More multi-institutional studies could help improve generalizability of findings. However, these larger, complex studies will require more resources, often lacking for educational research,111 and may benefit from more streamlined institutional review board approval processes.112
A strength of this study is that the review included more than 10 000 abstracts and hand-searching of bibliographies from published studies. However, several limitations should be considered. Publication bias is possible; there are likely tools that have not been described in publications, although they may have relatively poor psychometric characteristics.113 The search strategy was limited to English-language studies and did not include unpublished abstracts from conference proceedings or nonindexed open-access journals. Although a library science expert assisted with the search, the lack of a specific Medical Subject Heading for direct observation and variability of terms used in the medical education literature may have limited the ability to identify all studies. The literature search may have missed relevant international studies because the search strategy did not include some terms commonly used in non-US countries (eg, registrar).
In conclusion, this systematic review identified and described a large number of tools designed for direct observation of medical trainees' clinical skills with actual patients. Of these, only a few have demonstrated sufficient evidence of validity to warrant more extensive use and testing.
Author Contributions: Dr Kogan had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Study concept and design: Kogan, Holmboe, Hauer.
Acquisition of data: Kogan, Hauer.
Analysis and interpretation of data: Kogan, Holmboe, Hauer.
Drafting of the manuscript: Kogan, Holmboe, Hauer.
Critical revision of the manuscript for important intellectual content: Kogan, Holmboe, Hauer.
Statistical analysis: Kogan.
Obtained funding: Kogan, Holmboe, Hauer.
Study supervision: Hauer.
Financial Disclosures: Dr Holmboe reports being employed by the American Board of Internal Medicine and receiving royalties from Mosby-Elsevier for a textbook on physician assessment. No other disclosures were reported.
Previous Presentations: A subset of these data were presented in a poster at the Clerkship Directors in Internal Medicine National Meeting, Orlando, Florida, October 31, 2008.
Funding/Support: This study was funded by a grant from the American Board of Internal Medicine.
Role of the Sponsor: The funding source had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; or preparation, review, or approval of the manuscript.
Additional Contributions: Josephine Tan, MLIS (UCSF) provided help with literature searching; Joanne Batt, BA, and Salina Ng, BA (UCSF), provided administrative assistance and data organization; Patricia S. O’Sullivan, EdD, the ESCape works in progress group (UCSF), and Judy A. Shea, PhD (University of Pennsylvania), provided comments on the manuscript. These individuals did not receive compensation for their roles in the study.
REFERENCES
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
- 31.
- 32.
- 33.
- 34.
- 35.
- 36.
- 37.
- 38.
- 39.
- 40.
- 41.
- 42.
- 43.
- 44.
- 45.
- 46.
- 47.
- 48.
- 49.
- 50.
- 51.
- 52.
- 53.
- 54.
- 55.
- 56.
- 57.
- 58.
- 59.
- 60.
- 61.
- 62.
- 63.
- 64.
- 65.
- 66.
- 67.
- 68.
- 69.
- 70.
- 71.
- 72.
- 73.
- 74.
- 75.
- 76.
- 77.
- 78.
- 79.
- 80.
- 81.
- 82.
- 83.
- 84.
- 85.
- 86.
- 87.
- 88.
- 89.
- 90.
- 91.
- 92.
- 93.
- 94.
- 95.
- 96.
- 97.
- 98.
- 99.
- 100.
- 101.
- 102.
- 103.
- 104.
- 105.
- 106.
- 107.
- 108.
- 109.
- 110.
- 111.
- 112.
- 113.
- 114.









