Top Nav Breadcrumb

The end of marking? The case for (and against) Comparative Judgement

Proudly telling the world about innovations at the IB in our 50th anniversary year

By Antony Furlong

Consider the following question:

On a scale of 1-10, how lifelike are each of the following drawings of an elephant?

Elephant A*

Elephant B**

If I asked this question to a room full of people, I’d probably get a wide range of answers for each elephant. Maybe you think elephant A’s a 2 but the person sat next to you thinks it’s a 3. I’d award elephant B a 9, but any art critic in the room might have different expectations for what a 9 would look like.
The reason why we’d get such a range of numbers is really quite simple. Whether it’s how lifelike a drawing of an elephant is, how scary the last horror movie you saw was or how good a theory of knowledge (TOK) essay is, humans find it difficult to assign and agree on absolute values for things.

But when we mark student work, this is precisely what we have to do.

It is therefore perhaps no surprise that reliability is such a fiercely discussed topic within the assessment community. Here at the IB, as well as at other awarding organisations, a great deal of time and effort is invested in ensuring that our processes lead to a high level of examiner agreement.

Now, if I were to change the original question to the following, much simpler question:

Which of the two drawings of an elephant is more lifelike?

We would expect almost unanimous agreement.

It is this very simple idea that is the driving force behind Comparative Judgement (CJ). With CJ, examiners are repeatedly given pairs of candidate work and simply asked to decide which of the two is ‘better’. These judgements are then combined to create a scaled rank order, which can then be converted into ‘marks’ should we so wish.  Gone is the need to worry about harshness or leniency or whether something is an 8, a 9 or a 10; all that matters is understanding what ‘better’ looks like.

As well as examiners having a much simpler task, each piece of work is looked at by several different people, making the process much more consensus-driven and considerably lessening the impact that one examiner’s opinion will have on a candidate’s final grade. It is therefore not surprising that CJ has been found to be very reliable in many different contexts.

Because of the nature of the judgements being made, a strong argument can also be made that CJ will lead to more valid results than traditional marking. During my time at the IB, I’ve sat in a number of meetings where marking has been taking place and on several occasions I’ve heard examiners say things like ‘X is the better candidate, but actually Y does better against the marking criteria’. This problem disappears with CJ.

In CJ, examiners are typically asked to make a holistic judgement about which piece of work shows a better understanding of a certain concept or construct without having to worry about how well each piece of work matches a set of marking criteria. Therefore, CJ allows us to assess whatever it is we are trying to assess directly, rather than assessing a fixed and imperfect compartmentalisation of it that can easily be explained within a markscheme or set of marking criteria. This freedom could in turn allow those involved in designing assessments to create more open and interesting tasks without having to worry about what the markscheme that accompanies it will look like.

It is however, not all good news. As mentioned above, a large number of judgements have to be made for each piece of work, which makes it a close call as to whether it is even feasible for the IB. Furthermore, teacher and examiners, whilst generally reacting positively to CJ, have expressed concerns about what feedback to students and Enquiry upon Results would look like, along with the difficulty in explaining how a final ‘mark’ is arrived at. Whilst these are perfectly legitimate concerns, it is hard to understand at this stage whether they are ‘dealbreakers’, challenges to the IB to alter our procedures to suit the process or drawbacks that are tolerable given the potential advantages (please do leave a comment below if you have an opinion).

So there you have it. The hope of more valid and reliable results but significant challenges around feedback, enquiries and auditability. Is CJ something the IB should be considering?

Antony Furlong is Manager Assessment Research and Design at the IB. 

*Image source: copied from

**Image source: copied from


  • Pak Liam

    Wow, does that not go against formative assessment principles of feedback ? When we need to tell a student how they went against a clear and transparent set of criteria (or outcomes) and then what they need to do next time in order to improve?

    Just be more like John, is hardly informative feedback…

  • Margaret Thompson

    Yes, but I believe the idea is to use CJ for summative assessments, when a final judgment is being made about the student’s understanding of a concept or skill. This is not assessment for learning, but assessment of learning.
    My concern is that assessment tasks, particularly complex ones, are rarely able to focus on a single skill or concept. What if Student A is better are articulating an idea, but Student B’s ideas show a better grasp of the underlying concepts, once you have parsed the language? I am thinking of EAL students in particular.

  • Tanya Haggarty

    So what CJ is referring to is norm referenced assessment, a form of assessment that I thought we had moved on from, criterion referenced assessment is where we should be heading. Clear criteria for success, which doesn’t refer to how other students are performing, seems much more in keeping with the IB assessment philosophy.

  • Colin Duff

    I imagine comparative judgements are common practice despite having criteria to support assessment.

    Once familiarity with strand assessment statements is gained, there is a likelihood that teachers make faster decisions about the grade of specific pieces of work. In turn, there is a tendency to sort student work accordingly, resulting in separate piles of work: Excellent, Very Good, Good, Satisfactory, Unsatisfactory.

  • When the clearest way to a better grade is to have worse peers, I feel like the assessment scheme has gone off the rails.

  • Antony Furlong

    Hi Tanya,

    Thanks for your comment. You hit on an interesting and
    important point here (and one I would have covered if I hadn’t already written quite a long blog post). In short, CJ wouldn’t have any impact on how you would choose to set standards, since it is a replacement for marking here and not grade awarding/standard setting. The procedure for deciding where grade boundaries would be positioned and which grade a piece of work deserves can be exactly the same as it is at the moment and so would continue to be in keeping with current IB philosophy of criterion-related assessment.

    Thanks again for your comment,


  • Antony Furlong

    Thanks to everyone commenting here for your thoughts on this topic.

  • Antony Furlong

    Thanks, Margaret. This is a really interesting thing to think about. In many assessments in the IB Diploma, particularly those that are essay based, we already ask examiners to award a single mark for a piece of work based on a holistic judgement that weighs up all aspects of a task against one another. I suppose therefore that this challenge is not new but perhaps brought more into the foreground by CJ.

    What I think would be key here is how ‘better’ is defined and to make sure that those involved in the judging are given enough guidance and training about how to deal with such judgements without detracting from the spirit of the process.

  • WillKS

    How would the exemplar used for comparison be determined? Would students see the comparison before or after the marking, or CJ; aren’t they both comparative judgements, one against an exemplar and the other against descriptive criteria? I agree with the respondents who note the difference between drawings of elephants and the Extended Essay.

  • Mark Munday

    I am struggling to see see how the CJ idea would be applied in practice. Would it mean that instead of marking 180 scripts, I would have to rank them from 1 to 180? And what criteria would be used for ranking them?

  • Antony Furlong

    Dear Mark,

    Many thanks for your comment. The way the process works is that you would be shown two scripts (onscreen), decide which is the better of the two and then be shown another (different) pair of scripts and so on. All the judgements everyone has made are then combined together and a
    rank order of all scripts is created from them.

    Regarding what criteria are used for making a judgement, it can take any form we want it to but a balance needs to be struck between giving people enough information to make a judgement without discouraging the holistic nature of the judging task. For the trials that the IB have run, it has usually been a statement about the qualities a ‘good’ response would have for that particular assessment.

    Thanks again for your comment,