Ann M. Aly
Human-Centered Design Expert, Strategist, and Mixed-Methods Researcher

Quantifying Difficulty

How can you reliably report what you observe in usability studies?

A common way to test usability is by having a user complete a task

In most cases, this involves asking a participant to complete a task (such as "submit a support ticket on this website") and evaluating their experience

Some common metrics to evaluate participant experience on a task include:

  • Success rate (success/fail/partial success)

  • Time to completion (in minutes or seconds)

  • Level of difficulty (easy/medium/difficult)

  • Lostness metric (number of screens visited during the task)

One challenge with these metrics is knowing WHY a task was difficult or failed

Success or failure in a task can be due to many factors, such as:

  • System bugs

  • Unclear instructions

  • (Un)familiarity with a product or service

  • Workarounds

Another challenge is reliability

With 1 person rating tasks, there may be bias. With 2 or more raters, there may be bias and disagreements

Combining nuance and reliability: A case study

DISCOVERY
Screengrab of employee service portal

Screengrab of employee service portal

We conducted a usability study on a service portal for federal employees to request equipment and repairs

To learn more about how difficult common tasks were in the portal for users, we asked our participants to complete the following tasks:

  • Requesting copies

  • Mailing printed materials

  • Ordering dual monitors

  • Requesting a ceiling light repair

  • Checking the status of a submitted request

EVALUATION
Unreliable observations?

Assessing user difficulty

We wanted a rubric that evaluates task success as well as the types of errors, barriers, and feedback that users might have in the process.

For this study, we used a 5-point rubric that distinguished between task failure due to cognitive load and a fatal system bug.

Future iterations on our rubric will also differentiate between errors and bugs as well.

Flowchart for choosing common IRR statistics

Flowchart for choosing common IRR statistics

Measuring reliability

We incorporated an interrater reliability (IRR) statistic into our analysis, which is a way to measure agreement between 2 or more raters on the same task with the same rubric.

To do this, 3 researchers independently rated each task and then I used the irr package in R to calculate Cohen's Kappa (an IRR statistics for 2+ raters using a nominal scale).

If you're curious about common types of IRR statistics, I created a flow chart for researchers to reference.

Empirically-backed suggestions for implementation

Our results showed that all tasks had barriers or required workarounds for our participants

All of our task ratings were in the following two categories:

  • (3) Task is completed by the user with minimal difficulty or obstacle (can be an error, bug, some confusion). User is able to recover and complete the task; may have feedback or input for ways to improve the task.

  • (4) Task is difficult for the user and due to cognitive load, confusion, or frustration (but not an error or bug), user finds an alternative way to complete task 

These results, along with the qualitative feedback from participants, provided us with quantitatively-backed recommendations (for example, removing redundancies or supporting users through errors) to improve user experience on a service portal used by over 6000 federal employees.

We iterated on this rubric for future studies
Our Cohen's Kappa statistic showed us that the Mailing Documents task needed to be re-rated due to questionable reliability between raters

Our Cohen's Kappa statistic showed us that the Mailing Documents task needed to be re-rated due to questionable reliability between raters

We also learned about reliable rating as a team

The IRR measurements were able to tell our research team how much agreement we had when assessing the tasks participants created. This meant that any cases of low agreement needed to be re-rated or arbitrated by an additional rater. Without the IRR measurements, we would not have known how faithfully our team was using our rubric, which may have skewed the recommendations we made to improve the service portal.

See this article for more information about agreement thresholds

6
Participants
3
Researchers
~5
Hours in R
6000
Employees Impacted

Want to learn more?