++
Ideally, learners (and programs) will be assessed at multiple points with performance data being used to iteratively drive improvements. In a competency-based model, a threshold is set that all learners must achieve; however, benchmark or milestones on the path toward competence can be used to track learner progression toward more advanced skills over time. Programs should articulate skill development pathways that repeatedly assess learners and offer remediation, as needed.
++
A number of validated assessment tools are available for the BSS (see Suggested Readings and resource links). Table 43-2 provides examples of specific tools that assess core BSS areas such as social attitudes and behavior change counseling. Evaluators may select validated tools or develop their own based on the specific assessment needs. Remember that assessment instruments are not universally valid or reliable. Although it is helpful to review the psychometric properties of tools as they have been previously studied, it will still be necessary to evaluate any tool’s performance when used with learners in the context of your assessment. Assessment goals (e.g., changing attitudes, knowledge, or skills), available resources, and preferred methodologies all inform the selection of measurement tools.
++
++
Educational programs seek to instill particular values or attitudes in their learners (e.g., empathy, respect, nonjudgment, egalitarianism). Educational activities are often indirect or occur on an experiential level and can be highly influenced by institutional culture and the “hidden curriculum” (see Chapter 46). Attitudes are most commonly assessed with self-report surveys that are highly efficient but have fairly obvious demand characteristics. Respondents may know the socially desirable answer and may not answer honestly. Indirect testing measures, such as the Implicit Associations Test (IAT), attempt to tap unconscious bias by tracking milliseconds of reaction time to words paired with race, gender, or other demographic identifiers. Multisource feedback (also known as “360-degree evaluations”) involve eliciting learner evaluations from peers, allied health professionals, patients, and others with first-hand knowledge of the learner’s tone, emotion, and behaviors—all of which infer attitudes.
++
Educational or learning objectives are critical for both curriculum development and evaluation. Objectives should specify observable and measurable changes brought about as a consequence of curricular exposure. Even for objectives limited only to medical knowledge, the levels or content of knowledge expected can vary, ranging from simple recall to complex synthesis using foundational knowledge and principles (see Bloom’s Taxonomy). Written tests are commonly used to assess medical knowledge, but their content and format should be determined by knowledge level expected. Because learning can occur at many levels, from simple recall to problem solving, multiple-choice questions (MCQs) should target the level of learning appropriate to the content taught. Characteristics of effective MCQs are related to the item, the question stem, and answer options. Poorly constructed MCQs will not produce accurate or meaningful test scores and could negatively affect learner pass rates. Tips for developing MCQs can be found on The National Board of Medical Examiners web site.
++
Although MCQs can be problematic in nearly any content area, they are especially difficult to develop for BSS content. MCQ’s commonly test recognition (choosing an answer) rather than recall (constructing an answer), allow guessing, and are time-consuming to construct with choices that are neither intuitive nor unequivocally correct. Thus, case write-ups, critical essays, or innovative tools like concept mapping might better assess a learner’s mastery of BSS content. For example, in the UCSF School of Medicine, medical students keep a “sociocultural skills tracker” that includes a longitudinal series of patient-based assignments to demonstrate progress toward competence working with patients of diverse backgrounds. “Tracker” entries are used to provide feedback and shape subsequent exercises.
++
Achieving behavioral science competency includes mastery of broad content as well as communication and interpersonal skills. Although relatively little has been studied about assessing students’ behavioral science knowledge, a rich literature exists on performance-based testing to assess communication skills. Most of this work assesses learners at the “shows how” level of Miller’s pyramid (know, knows how, shows how, does) and are stronger predictors of whether learners will demonstrate these skills in the clinical setting than written examinations. Using unannounced SPs in the clinical setting is an example of a performance-based assessment that evaluates learners at the “does” level of the pyramid, assuming they are unaware they are being assessed.
++
In the following section, we discuss several methods of performance-based assessments to evaluate communication and interpersonal skills. Many of these performance-based assessment methods can also be used to directly measure knowledge and indirectly measure attitudes.
++
Direct observation of learners in medical education is essential but underutilized, especially in clinical settings. Both the Liaison Committee on Medical Education (LCME) and Accreditation Council for Graduate Medical Education (ACGME) require ongoing assessment of trainees that includes direct observation of clinical skills. The Mini-Clinical Evaluation Exercise (mini-CEX) is the most studied format for performance evaluation using direct observation. Less formal observations are also commonly done using behavioral checklists, global rating scales, or other observational tools.
+++
Inter-Rater Reliability
++
Several factors are important to consider when evaluating the quality and accuracy of assessments using direct observation. Overall assessments based on clinical observation generally have poor reliability. Variability in rater assessments has been attributed to measurement errors of leniency, bias, and poor performance discrimination. Whereas case specificity is typically unavoidable and affects reliability, rater variability also plays a role in accuracy. Judgments made by raters can be extremely subjective, influenced by raters’ moods and likely to be idiosyncratic.
++
Using interviews of faculty raters, Kogan and colleagues identified four themes that provide insight into how ratings form: (1) faculty rely on variable frames of reference during observations including their own performance, performance by others known to them at different levels of training, standards of performance considered necessary for patient care, and patient outcomes; (2) inference often plays a key role, including assumptions about learner motivations and attitudes; (3) faculty use variable approaches to synthesizing judgments into numerical ratings across or within different observations; and (4) contextual factors (complexity of the encounter, learners’ level of experience, the faculty–resident relationships), and residents’ responses to feedback play a role in faculty judgments. Thus, raters need both tools and training to improve reliability.
++
Tools to guide observations and structure assessments come in different forms. Kogan and colleagues published a systematic review of tools for direct observations with students, residents, and fellows in clinical settings. Another review of tools to assess communication skills evaluated 15 instruments based on criteria from the Kalamazoo Consensus Statement. Global rating scales have been found to be the more efficient of these tools, and are as reliable as checklists when completed by expert faculty. Checklists have advantages for lesser-experienced raters such as students or SPs.
++
Both checklists and rating scales vary greatly in length and format based on the objective of the assessment, the aspects of performance being assessed, and who is doing the rating. Checklists typically measure a list of behaviors observed that can be scored dichotomously as “done” or “not done.” This approach is not ideal for assessments of complex and nuanced behaviors such as empathy or respect. In general, rating scales measure implicit processes and offer a range of responses usually on either a numerical or Likert-type scale or a behaviorally anchored rating scale (BARS). BARS specifically list the learner behaviors that must be observed to achieve a particular score. For example, for the item, “Allows the patient to tell their story,” the behavioral anchors can range from the lowest level “fails to let patient tell story or sets pace with closed Q&A style, not conversational” to the highest level of “encourages and lets patient tell story with open-ended questions and doesn’t interrupt patient.” Behavioral anchors can improve inter-rater reliability as they provide a more explicit definition of each level of performance. Global rating scales use fewer items but assess larger domains that may be difficult to reduce to a micro-skill set (e.g., empathy, cultural sensitivity). Global rating scales have been shown to be reliable when used by expert faculty or well-trained raters.
++
Despite the difficulties inherent in faculty raters, their assessments of learners’ performance are essential for any educational program. Faculty training can improve their observation skills, which is far more important than choosing the “perfect” rating instrument.
++
Behavioral Observation Training (BOT) focuses on improving the ability of the observer to detect, perceive, and remember performance of learners. Strategies include practice by increasing the number of observations done, using a tool to record observations, establishing the learning objectives, and considering encounter logistics. Performance Dimension Training (PDT) is designed to familiarize faculty with the definitions and criteria for the competencies being observed. When faculty understand performance dimensions as a group, it is easier to achieve consensus on behaviors that constitute a superior versus inferior performance. Frame of Reference Training is an extension of PDT. It encourages faculty consensus and distinguishing minimum criteria for performance levels. Training includes several exercises where faculty work together to set criteria for satisfactory performance and create a BARS. Vignettes are viewed and rated using the scale, followed by feedback from the session trainer on the “true” ratings with explanations. Finally, Direct Observation of Competence training (developed by Holmboe and colleagues) incorporates elements of all the above rater training methods and adds relevant, practical exercises in direct observation. These exercises can be done in faculty development sessions and have been shown to lead to meaningful improvements.
+++
Osce/Clinical Skills Examinations
++
Since direct observation by faculty in clinical settings does not have the reliability necessary for high stakes assessments, most institutions now rely on structured performance testing. Clinical skills examinations, such as OSCEs or Clinical Performance Examinations (CPEs), are used by many programs to assess progression of learners to the next level of training. The United States and Canada both include a Clinical Skills Examination as part of their medical licensure requirements. These examinations often include a series of encounters with SPs who have been trained to portray patients with specific medical needs. By holding variables in the encounter constant, the reliability of these examinations is greater than observations with real patients. However, artificially creating and standardizing patient encounters does not fully eliminate threats to validity.
++
Standardized clinical examinations measure a learner’s skills at the “show’s how” level of Miller’s pyramid; however, this occurs in an artificial situation. The key question for Clinical Skills Examinations is how well its scores predict learner performance in actual clinical settings (e.g., the “does” level of Miller’s pyramid). Several studies support the predictive validity of CPE communication scores showing strong correlations between medical student CPE scores and later performance as residents. Although evidence supports the validity of CPE to assess communication, there are many aspects of communication in real clinical settings that are unmeasured, such as team communication and adaptability of communication styles in various clinical settings.
++
Contextual fidelity, or the extent to which the CPE can be affected by the learner, environment, task, or technical system, can threaten the validity of communication skills assessment. Also, the artificial context of the CPE may deprive the learner of contextual cues found in a real clinical environment and, therefore, stymie the learner’s ability to communicate empathically. To consider these challenges to validity, some experts recommend the use of reflective practice, challenging cases, and rater training on how to give feedback to enrich the SP simulations in evaluating empathy. When the artificiality of the testing environment is further increased by “over standardization” in SP training, this can lead to encounters that lack fidelity and validity.
++
There is some debate about whether BSS skills can be reliably assessed using clinical cases or whether they are more accurately assessed when the explicit goal of the case is BSS focused, such as smoking cessation counseling. Some have also questioned whether general assessment of a learner’s BSS skills can be measured across several cases or requires case specificity. Generally, each clinical case in a CPE has a separate and relevant content checklist for history and physical examination items, but the CPE may have only one global checklist (global rating scale) to evaluate the BSS skills across all cases. A study by Guiton of fourth-year medical students found high generalizability for seven communication skills across seven OSCE cases, supporting the ability to identify a set of generic communication skills across cases. However, they also found significant case-student variability in communication scores. This suggests that even though one communication skills checklist can be used across several cases, in any performance assessment of communication skills, several cases providing a variety of communication (or BSS) challenges should be used to achieve adequate reliability.
++
Although inter-rater reliability is generally greater in CPEs than in directly observed encounters with real patients, it is important to address this form of reliability when assessing BSS skills. There are inconsistent findings in the literature as to whether faculty and SP ratings correlate or if faculty are rating something different than SPs in assessments of communication and other BSS skills. Faculty are often looking for communication behaviors or skills that are known to be effective with patients, and SPs are reporting their experience from inside the interaction on the effectiveness of the skills observed. Some differences between expert faculty and SP raters have been found when global ratings are used in lieu of checklists. Training raters and using standardized tools (e.g., checklists) can reduce the number of raters needed for reliability and can increase inter-rater reliability.
++
Technically sophisticated mannequins and task trainers have been increasingly integrated into medical education for both teaching and assessment of clinical skills. A growing literature demonstrates that the practice of high-risk procedures on computerized, highly technical mannequins that appear to blink, sweat, and breathe create enough realism or fidelity that the skills demonstrated in these scenarios carry over to real clinical situations. Psychological fidelity—or how “real” the situation feels—and whether it requires the same cognitive and emotional skill set as a real situation must also be considered. Although medical procedures are well suited for high-simulation mannequin assessments, they can also be used to test teamwork skills, clinical reasoning, and emotion/stress-management.
++
Patient-focused simulation training involves a combination of mannequins and real people within a simulated scenario to provide psychological and functional fidelity. These simulations allow learners to tap into conscious and unconscious responses that can provide opportunities to assess skills of delivering bad news, communication of medical errors, ethical decision making, and end-of-life care, to name a few.
++
Computer-based virtual patients are another emerging technology for BSS assessment. More can be learned about the use and development of virtual patients on the MedBiquitous web site (see resource links).
+++
Performance Portfolios
++
Assessment of learner attitudes, knowledge, and skills is complex and requires a robust toolbox of assessment instruments. For BSS skills the most valid evaluations use planned overlaps among SPs, case assignments, faculty observations, and other “real-world” tools such as multisource feedback or chart review. These diverse data elements are often integrated into “performance portfolios” where learners and mentors are given some latitude in selecting artifacts that demonstrate progression toward or achievement of competence. By drawing from a pool of validated assessment instruments and evaluation data (e.g., CPE, OSCE), portfolios ensure some level of standardization while still allowing sufficient flexibility for learners with different strengths and needs. Ideally, portfolios serve both summative and formative purposes.