image

 

Course abstract

Offers test takers opportunity to demonstrate complex skills with their responses. Constructed responses are not usually judged unambiguously correct or incorrect. Rather, they must be evaluated for the levels of knowledge, skills, or abilities they demonstrate. The promise of CR tasks to improve assessment hinges on the quality rating of responses and validity of inferences from resulting scores and their uses (Livingston, 2009; McClellan, 2010). 

Traditionally, trained human raters score responses using a rubric. Increasingly, artificial intelligence (AI) algorithms trained to predict human ratings are used to score constructed responses.  But decisions must be made as to whether the AI predictions are adequate to serve as scores. The workshop will address how to make this judgement. We will discuss evaluating performance of AI algorithms and resulting scores in the context of the larger validity argument for constructed responses and the entire assessment. We will discuss the role of validity evidence for human ratings and the importance of the quality of those ratings for the validity of AI scores. We will also discuss how to judge the evidence needed in any context as well as issues of fairness and maintenance of the AI scoring system once operational. With the release of large-language models (e.g., ChatGPT), interest in using AI in assessment including for CR scoring is exploding. Practices to ensure AI yields valid, fair and reliable scores must accompany the rush to increase the use of AI scoring.

DIAMOND SPONSORS