Hand Gestures and Voice for Desktop Control

Intelligent Multimodal User Interfaces (6.835) Class Project: Spring 2021

Introduction

6.835 (Intelligent Multimodal User Interfaces) is a course that explores how to create intelligently designed systems that take in multiple modalities of user input (such as voice, text, camera, etc). During the first half of the course we explored different methods of doing this, including machine learning and algorithms that detect handwriting.

In the second half of the course, students were tasked with coming up with an original multimodal application and implementing it. We were able to use any sensors that exist on our local machines (such as cameras, microphones, etc), as well as requesting a LeapMotion sensor (which detects the hands) or an Xbox Kinect (for full body detection).

Although there were many interesting use cases and ideas that other students came up with (such as creating a handsfree version of Spotify), I wanted to create something that improved technological usability and accessbility, while still being useful for the general population. The idea I came up with was to create a way to interact with a desktop interface without having to use the mouse and keyboard. The original inspiration behind this was that people with fine motor control disabilities (such as Parkinson's disease) have trouble doing things that require control of smaller muscle groups, such as fingers. What's worse is that often for desktop interaction this requires a higher level of accuracy and speed (such as when using the trackpad). This system can also be helpful for the general population as well - for instance you are sitting farther away from your computer, or your hands are dirty and you need to do something without getting your computer messy (for instance, you are following a recipe in the kitchen).

Summary

Implementation

The main idea behind this project was to essentially map hand gestures to mouse actions (such as clicking and moving the cursor). I implemented this project in Python using the LeapMotion 3.2.1 SDK to get the sensor data. Here is a list of Python packages I used to be able to control the desktop:

The core features I wanted was for the user to be able to get around any desktop program, perform common functionalities (such as search, typing, etc), and for this to not be significantly slower than using a typical keyboard and mouse is, espicially for our target user (people with fine motor control disabilities).

User Interaction

The first version of this project originally intended for the user to be able to type with an onscreen keyboard, however after a few rounds of testing I deemed this to not be viable as it was very tiring to hold a hand up for several minutes to type for simple sentences, much less if the user needed to write a paragraph. The final version of the project instead uses a combination of hand gestures and voice input to either input text into the current selected text field or perform common actions.

When the user first logs into the system, they are greeted with a help document (linked here) with an overview of the features and functionality available. This document can also be accessed at any time by the user. There is also a GUI to be able to configure the system to their liking (i.e. which hand will be used for gesture input). When voice input is activated, there will be a small window that appears in the corner with a microphone symbol, indicating to the user that the microphone is ready to recieve input. Below is a summary of the help document and features. Further below is also a quick demo featuring some of the basic functional that the system can do.

Feature Summary
The main functionalty that is present is that the cursor moves as the user's hand moves, a grabbing motion for 2 seconds is a click, a double click is for 4 seconds, and a click and drag is to simply move your hand while keeping the grabbing motion. A right click is done by facing the palm up and doing a grabbing motion like that for about half a second.

To enter text, the user faces the palm of their hand up, and holds it for about 2 seconds until the microphone icon in the bottom right hand corner appears. The user then continues to hold this position while saying the text, and removes it when they are done.

There are also some voice commands that the user can input (the enter this mode the same way above, by facing the palm of their hand up). Commands include, "undo", "redo", "save file", "open file", "backspace", "enter", "tab", "exit tab", "exit window", and "help".

Demo


Below is a video featuring a walkthrough of the system, how it works, and some of the user interaction. The full project report with more details is also at the bottom of this page.



Final Paper and Report

This work was completed in Spring 2021 for the final project in 6.835 Intelligent Multimodal User Interfaces (a graduate level course at MIT).
The full paper with results is included below. It can also be viewed and downloaded here.