Experimental Research Method: Rapid Eval

 

Need:

Typical pilot tests of digital treatments (such as cognitive behavioral therapy self-treatment apps) cost $150,000 and take 3 to 8 months to develop first estimation of validity.  Paper prototypes and other user testing methods are insufficient to evaluate many concepts that resemble commercial video games, since the experience requires rich interaction. Some of these rich media concepts are so flawed that it is possible to detect these flaws by building something less than a full pilot. However, methods to do so are needed and lacking.

Solution:

“Rapid Eval” is the name of an experimental user testing method that aims to be less accurate, quicker, and executed earlier in the product development cycle than traditional pilot testing. It combines the rapid, agile, lean approach used in commercial game prototyping with efficacy assessment methods typically used to assess psychological interventions or treatments.

Key differences between traditional scientific pilot testing and Rapid Eval are summarized in the following chart:

Traditional Pilot Rapid Eval
Purpose reliable findings for all stakeholders Rough estimate of efficacy, for developers only
Intervention design Fixed during pilot Constantly evolving
Precision of outcomes reliable enough for publication “order of magnitude” estimate
Iteration speed 1 – 3 months 1 – 2 weeks
Scope business & scientific scientific
Leadership/Team Subject Matter Expert Game Designer/PM

Development Team

The project was led by Josh Whitkin and Isabela Granic.  Dr. Whitkin is a designer/researcher with 20 years of commercial and academic experience building and studying video games for good. Dr. Granic is a psychologist with clinical and research background who has designed acclaimed video games for mental health.

Process

We tested Rapid Eval as we iteratively built and tested 12 distinct prototypes of Scrollquest, a multiplayer game (v4.0 through v11). We conducted 3.7 playtests per week for 11 weeks from October 2015 through January 2016, 47 playtests in total.  During each week, we:

  • reviewed past week’s playtest
  • developed feature requirements
  • built those features
  • modified playtest protocol
  • conducted new playtests

At times, our weekly cycle was stretched to two weeks to accommodate major feature changes.

In “Rapid Eval”, we measured efficacy by conducting weekly 1-hour playtests and structured interviews. Two coders independently coded observations of teens and parents play behavior and verbal statements. We evolved our protocol along with the product, as detailed in the chart below.

A sample evaluation is shown in a short video: https://www.youtube.com/watch?v=lGnsZcI1FbI. Note this was done mid-project with a scope of aims that is different from the final pilot test.

Data:

Compared to our pilot study, we found the “Rapid Eval” method to be beneficial as predicted, though unexpected limitations were found.  Key benefits:

  • We discovered efficacy problems earlier than traditional pilots did
    • e.g. at v4, we had built a playable prototype that revealed a key problem with efficacy: players were dismissing rejection event as “just a game”.  from v4 to v11, we iterated on that key problem.  WIthout Rapid Eval, we may not have detected the problem until the final pilot.
  • Our efficacy measurement method was not accurate enough.
    • We needed larger sample sizes. We had 3.7 playtests per week. but needed 20 per week.   This caused problems. e.g. we incorrectly thought we may have saw  “Promising SIgns” at v7, but later findings and final pilot suggest those signs were a statistically insignificant observation.
    • Short, single-session playtests, multiple plays over weeks
  • Online playtesting is a good fit for Rapid Eval
    • We got a representative sample of national population, vs local ‘bubble’ recruiting
    • Online testing was convenient way for team and participants alike
    • We got benefit of ‘in home’ settings at low cost (normally very expensive to travel to homes)
    • Low cost – no need for equipment, research space; cost of recruitment very low
    • Recruiting was convenient (mTurk ads)
    • We could change recruiting language in ads quickly

Results:

The project conducted both the Rapid Eval study and a traditional Pilot study, so we could compare and confirm findings. The pilot study consisted of 10 playtests conducted in two countries using the final v11 prototype from December 26 to January 6 and its results confirmed the findings from the Rapid Eval method found.

Discussion:

 

We feel that we successfully demonstrated that “Rapid Eval” is likely to be a useful method.  We believe a 2 week iteration cycle is achievable and appropriate for a variety of games for health.  We believe we could address the specific shortcomings mentioned below within that cycle.

Future Rapid Eval projects should address the specific shortcomings of this project:

  • Precision of outcomes was insufficient to guide development reliably.  Sample size and protocol needs improvement, and important design decisions were made on faulty data.
  • The project needs a more hours from the subject matter expert (research psychologist, in this case) more than this project had.  While many academics contributed their time generously, these donations were insufficient for the pace of Rapid Eval project needs. Many of the Rapid Eval failures may have been caused by the delay and lack of time to investigate and experiment via a paid psychologist consultant with game development experience.
  • We improved the measurement methods and protocol as we went, but were perhaps not ambitious enough in changing the intervention itself. We could have addressed “key problem” noted in early findings by trying multiple session protocols, and other intervention designs. We mainly made changes to the product, but kept the way the product was used with teens and parents (single 30 minute playtest, interview).