One question that often gets asked, both at my talks and on projects, is how many gestures can a user reasonably be asked to remember in order to control a system?
The honest answer is: I don’t know, and I doubt anyone else does either. I’d love to see a research paper on this. There are a number of points of data for it, however.
First is sign language. Signers have an extensive vocabulary of gestures in order to communicate. So clearly the upper limit of the number of gestures any human being can learn is at least in the hundreds and likely in the thousands if necessary. Of course, no user is likely to learn anywhere near that number to control a system. Signers do it out of necessity of communicating.
Looking at the world of sports, there seems in most physical games a core set of movements necessary to play most games. The number of movements for any given game is usually fairly small (although of course the mastery of these requires intense practice). From my observation and rough estimate, the core movements of most sports are less than seven, and that is counting actions like running and jumping.
Musicians likewise have a core set of gestures, and again, this number seems to be less than seven. If we take playing a stringed instrument, for instance, those are pressing and holding strings down, sliding the hand up and down, sliding the hand left and right, bowing or strumming, plucking or picking. It’s the combination and nuance of performance that brings the instrument to life and makes music.
When we look at the digital world, I’m sure somewhere there is an academic paper on the number of keystroke commands (probably the most cognitively similar thing to interactive gestures on traditional WIMP systems) that users remember. I’m guessing this too hovers around five to nine per application, with of course some key commands (control-s for save for instance) spanning across multiple applications. Certainly some advanced users (just like professional musicians and athletes have a more advanced set of actions) will know more key commands for specific applications that they spend a lot of time working with.
The iPhone’s set of gestures is nine: tap, double tap, flick, drag, swipe, pinch open, pinch close, touch and hold, and two-finger scroll.
So to sum up, we don’t really know yet, but the magical number would seem to once again be seven, plus or minus two. Unless you are planning on training users, having a core set of five to nine gestures that can be remembered and perhaps even used in combination seems to be a good practice.
ABOUT KICKER STUDIO
11 Comments
It would be fascinating to see a study done on the matter, though I do wonder if there’s any connection between the 7±2 and consideration for initial learning of the gestures.
As you said, much of the nuance of both sports and music comes from combinations of movements and nuances of movements. For a gestural interface, it does seem more usable gestures could come from reversing or combining existing gestures, so long as the combination or reversal makes sense as part of the gesture’s result. The iPhone’s pinch open and pinch close almost register as one gesture with a user, despite technically being two, because of that similarity. I think the same kind of reversal could work for Undo/Redo and Copy/Paste, in interfaces where they’re necessary.
Interestingly, Opera has been offering a gesture-based browser interface since 2001 with Mouse Gestures. I’ve been using Opera for eight years, and I’ve only really integrated four of the gestures offered into my usage patterns. It might also be worth investigating whether or not an interface that combines gestures with more traditional input methods runs into the 7±2 magic number as a combined value between both gestures and traditional motions, or if the two would be cognatively separate.
I wasn’t suggesting, by the way, there was a real link between George Miller’s Magical Number 7 and remembering gestures. Miller’s number was about cognitive chunking and what we can retain in short-term memory. Nothing to do with long term memorization (to my understanding, anyway).
This post makes me question what is meant by gesture. Some of the things described above, like running, are actions, not gestures. When I play soccer, I run, jump, kick, etc. Those are actions. I may put my hand out in front of me to indicate to my teammate that I would like the ball put into a space where I can run to it. That is a gesture.
Are tap, double tap, flick, drag, et cetera gestures or actions?
A gesture is an action done in a specific context for a specific effect. More specifically, for an interactive systems, a gesture is any action a system can detect and respond to.
Gestures can be fluid movements with no clear beginning/end or they can be static poses. Tap, double tap, etc are definitely gestures: actions done for a specific effect.
Running probably wasn’t the greatest example, although for instance, running to a base in baseball I would consider a gesture, albeit a large one.
Which of course begs the question: what actions AREN’T gestures. It might be possible that ANY action in the right context could be considered a gesture.
Isn’t the real answer “None”? Any skilled guitarist knows the position of all major and minor chords (at least), which makes 16 hand positions, and there are gestures for transitioning between each of them, so a guitarist knows at least 16^2 distinct gestures. Guitarists can tell you what the fingering for a chord is, but can’t tell you what the gestures are between them. They aren’t learned or remembered in the sense of “To transition between D and Am, move your hand one fret left, bring your middle finger up to the 4th string on the same fret, etc…” To describe that, I actually have to watch my fingers move; if I couldn’t look at my hand, I couldn’t tell you at all, so it’s as if my fingers know how to do it, but I don’t. I think this points to the possibility that the many thousands of gestures a musician knows are not reducible to a small set of basic gestures.
Granted, this is achieved through years of practice, but isn’t it true that virtually everyone above a certain age — certainly all adults — have this practice, having learned gestures for manipulating objects in space beginning from early infancy? And are therefore capable of many thousands of gestures. I think the main limitation on the number of gestures depends on the ability to provide affordance for what kinds of gestures are available, which is effectively the problem with Opera gestures or shortcut keys: they are almost completely abstract, totally disconnected from how we learned to manipulate objects in childhood. Even something like Opera’s next/previous gestures — moving left to go temporally back in time only makes sense because of an implicit assumption of an abstract model of time like a timeline, where the past is left and the future is right. But the way the page is rendered, it’s as if the page in the present has been overlayed over the page in the past, which is essentially how Apple’s Time Machine represents it.
This is actually reminding me of the George Lakoff’s “Metaphors We Live By”. A mistake to be avoided is to “naturalize” or reify the linguistic metaphors we use: assuming that the linguistic and cognitive metaphors we use to talk about concepts are actually, physically present in our interaction with real objects, and this reflects a kind of stupidity in the domain of kinesthetic intelligence that has to be overcome, the bias toward cognitive and conceptual ways of knowing.
By “remember” I don’t necessarily mean consciously recall. I don’t have to think about a key command every time I use it, nor do I have to think about what hand position I need to have to play an F# on the D string. Nor, one hopes, will users have to consciously have to recall what the gesture is to turn the lights on in a room or change the channel on the TV.
I think I’m just repeating what alsomike and others have already said more coherently, but anyway:
I think it’s useful to consider all these gestures on a scale from direct to indirect/abstract (assuming for now that it’s okay to call them all gestures). Only on the right side of the scale does memory (in the sense of consciously memorizing) really come into it.
The gestures at the left end of the scale are basically the same as traditional direct manipulation interactions — tapping, selecting, dragging windows and objects.
As you move a bit to the right you get gestures like the ones provided by the iPhone’s rubber sheet metaphor: sliding, stretching, flinging. These are a little less obvious to users but they’re still fairly direct and are suggested by the overall metaphor so they’re easy to learn and remember. A little further right are semi-abstract gestures like circular scrolling.
At the indirect/abstract end of the scale are gestures like: make a certain squiggle to open up application X, or use this abstract alphabet to input text.
For direct gestures you’re not really loading the user’s memory but for indirect/abstract gestures you are. So I think the question “how many gestures can users remember?” is really the wrong one most of the time. You want to choose direct, discoverable gestures as much as possible. A given metaphor will only suggest so many and if you have to invent more then you’re probably going to far. But if you do introduce abstract gestures then the question of learning and memory really apply. You could probably find data about that in papers on text entry (e.g. unistrokes). That’s probably the wrong direction to go, though… The bottom line, I think, is that this question is the wrong one to begin with (at least for designers): just pick appropriate gestures that fit the design and then test with users and you’ll find out if you’ve got too many.
We’re especially interested in this topic. We’re actually coordinating with the KU School of Design to do a research project figuring out, at least first, what gestures are “natural”.
Hello,
I asked a similar question last night… and as happens, rephrased the question in my mind after the Q&A ended. Here is what I was trying to get at…
As Dan mentioned above, context is important. You need to set the context and engage the listener (or listening device) before communication can take place. This would reduce a great deal of noise and help to separate between a potentially large number of devices. (What happens in a home when several devices are all looking at me at once and interpreting gestures?)
You need to engage the device, establish that you would like to begin communication and the device needs to respond with “Yes, I’m listening.” It’s a call and response… or handshake just like a modem… set the speed, the protocols, and then begin communicating. I would imagine that some form of this would be required for any gestural system in order to calibrate the physical constraints of the discussion. The range of motion, speed, child vs. adult…
I wonder though… if you think about the spectrum of languages and notations, they may have a large or small set of individual symbols and a grammatical structure. I would expect that depending on the context a set of gestures would evolve based on the complexity of what is being communicated… some sort of subject, object, verb, etc…
There are also ways to use existing cultural artifacts to establish context. If you look at the Wiimote as a means of capturing gestures, it’s often being wrapped in a form to help communicate the context. You can plug it in to prosthetic form like a steering wheel for racing games, a racket for tennis, a golf club, gun, etc. Even though the devise is capable of capturing the gestures on it’s own, and the artifact is not offering any technical affordance, we are still wrapping it in a form to specify the range of expected gestures.
I actually like the idea of a prosthetic form to set context… Dan spoke about getting tired making the same gesture over and over, and yet there are instances where this happens, aided by a form of some sort.
Think about driving a car. We climb in to an interface where our hands are supported by the steering wheel, and our feet are resting on the floor, but manage to steer and push the gas for many hours with a fine degree of control… (sometimes assisted by cruise control
for longer trips)
The bottom line is that by using a prosthetic cultural artifact one can establish engagement, set a protocol of expected gestures, and communicate with a vocabulary of common gestures without having to train the participants to memorize a fresh series of unprecedented actions.
Quick thought from an airport departure lounge. Apologies for spoolong mastiks.
Shannon and weaver’s classic research into communication may be worth looking at. Their issue was related to signal compression but led to a classic model of communication which still has uses. That would suggest that the problem is not so much how many gestures the user can remember but how the device interprets a gesture. I think turning the problem around like this might be useful. Just as my iPhone is interpreting my cakhanded typing and correcting spelling a device should be able to interpret my gestures. Similarly if you’ve ever met someone with a strong accent you know that getting used to it is important to understanding them.
The example of sign language is interesting. That’s a system that grows just like verbal language. Neoligisms depend on correct interpretation and subsequent adoption (unless you’re part if a subculture in which case adoption is intended to be restricted).
But we all use sign language: we all know how to ask someone if they want a drink when we see them walk in a bar. But again it depends on the cultural context: the “time out” sign may mean nothing in the parts of the world that don’t play whatever game it is that uses it. At the same time, having been exposed to it via tv shows in my youth it has, fir me, grown outside its initial context.
I’m rambling. Basically gestures should be interpretable by the device or learnable. That is the paradigm shift that’s important because ultimately there is no magic number: some guitarists know three chords, some know many. Some know enough that they can work out how a rare chord is produced. And many others can’t remember even one. The challenge us to cater for all those possibilities.
It seems that context is the crux of gestural interfaces. Does the gestural input and the system’s response match a real world response? For example with the iPhone, a flick of the finger moves the screen much like one could imagine piece of paper would in the real world, so the gesture is easy to remember. Conversely, the iPhone’s zoom gesture is a little more abstract, I find, that it is slightly harder to remember which motions zooms in and which zooms out. SO, maybe the number of gestures isn’t AS important as the quality of the interaction.
One Trackback/Pingback
[...] How Many Gestures Can Users Remember?By being so far ahead of the pack (and with so much penetration) in terms of touchscreen mobile interfaces, I think Apple’s patterns will go a long way towards establishing the standards going forward. SHARETHIS.addEntry({ title: “EverydayUX morsels (January 30th – February 3rd)”, url: “http://www.everydayux.com/2009/02/03/everydayux-morsels-january-30th-february-3rd/” }); [...]