By Antony Furlong
Consider the following question:
On a scale of 1-10, how lifelike are each of the following drawings of an elephant?
If I asked this question to a room full of people, I’d probably get a wide range of answers for each elephant. Maybe you think elephant A’s a 2 but the person sat next to you thinks it’s a 3. I’d award elephant B a 9, but any art critic in the room might have different expectations for what a 9 would look like.
The reason why we’d get such a range of numbers is really quite simple. Whether it’s how lifelike a drawing of an elephant is, how scary the last horror movie you saw was or how good a theory of knowledge (TOK) essay is, humans find it difficult to assign and agree on absolute values for things.
But when we mark student work, this is precisely what we have to do.
It is therefore perhaps no surprise that reliability is such a fiercely discussed topic within the assessment community. Here at the IB, as well as at other awarding organisations, a great deal of time and effort is invested in ensuring that our processes lead to a high level of examiner agreement.
Now, if I were to change the original question to the following, much simpler question:
Which of the two drawings of an elephant is more lifelike?
We would expect almost unanimous agreement.
It is this very simple idea that is the driving force behind Comparative Judgement (CJ). With CJ, examiners are repeatedly given pairs of candidate work and simply asked to decide which of the two is ‘better’. These judgements (usually at least 10 for every piece of candidate work) are then combined to create a scaled rank order, which can then be converted into ‘marks’ should we so wish. Gone is the need to worry about harshness or leniency or whether something is an 8, a 9 or a 10; all that matters is understanding what ‘better’ looks like.
As well as examiners having a much simpler task, each piece of work is looked at by several different people, making the process much more consensus-driven and considerably lessening the impact that one examiner’s opinion will have on a candidate’s final grade. It is therefore not surprising that CJ has been found to be very reliable in many different contexts.
Because of the nature of the judgements being made, a strong argument can also be made that CJ will lead to more valid results than traditional marking. During my time at the IB, I’ve sat in a number of meetings where marking has been taking place and on several occasions I’ve heard examiners say things like ‘X is the better candidate, but actually Y does better against the marking criteria’. This problem disappears with CJ.
In CJ, examiners are typically asked to make a holistic judgement about which piece of work shows a better understanding of a certain concept or construct without having to worry about how well each piece of work matches a set of marking criteria. Therefore, CJ allows us to assess whatever it is we are trying to assess directly, rather than assessing a fixed and imperfect compartmentalisation of it that can easily be explained within a markscheme or set of marking criteria. This freedom could in turn allow those involved in designing assessments to create more open and interesting tasks without having to worry about what the markscheme that accompanies it will look like.
It is however, not all good news. As mentioned above, a large number of judgements have to be made for each piece of work, which makes it a close call as to whether it is even feasible for the IB. Furthermore, teacher and examiners, whilst generally reacting positively to CJ, have expressed concerns about what feedback to students and Enquiry upon Results would look like, along with the difficulty in explaining how a final ‘mark’ is arrived at. Whilst these are perfectly legitimate concerns, it is hard to understand at this stage whether they are ‘dealbreakers’, challenges to the IB to alter our procedures to suit the process or drawbacks that are tolerable given the potential advantages (please do leave a comment below if you have an opinion).
So there you have it. The hope of more valid and reliable results but significant challenges around feedback, enquiries and auditability. Is CJ something the IB should be considering?
Antony Furlong is Manager Assessment Research and Design at the IB.
*Image source: copied from clipartbest.com/clipart-ncEEdzR7i
**Image source: copied from yedraw.com/how-to-draw-elephant.html#.Wnw99G_wZaS