GSOC 2026 Project #36 : Durham university - GestureCap: Creation of mappings to enable music and speech generation, to investigate musical creativity, agency and music-movement-dance interactions

arnab1896 · March 10, 2026, 3:29am

Mentors: Alison Wang <jiaxi.wang@durham.ac.uk>, Deepansh Goel <deepansh.04614815623@cseaiml.mait.ac.in>, Suresh Krishna <suresh.krishna@mcgill.ca>

Skill level: Intermediate – Advanced

Required skills: If interested in music generation, experience/familiarity with a framework like Max/CSound/PureData/SuperCollider required and some experience with Python preferred. If interested in speech generation, fluency in Python required and experience with sign-language transcription / speech generation libraries preferred.

Time commitment: Full time (350 hours)

About: Over the last two years, we have developed GestureCap (GSoC 2024 report · GitH), a tool that uses markerless gesture-recognition and motion-capture (using Google’s MediaPipe) to create a working framework whereby movements can be translated into sound with short-latency, allowing for example, gesture-driven new musical instruments. We have created a pipeline whereby we can get down to 12 ms gesture to sound latency, thus increasing the range of possibilities for markerless gesture-driven musical expression. We have also created elementary mappings to go from gesture to sound.

Aims: This year, we aim to build on this initial proof-of-concept to create a usable tool that enables gesture-responsive music and speech generation. Of particular interest is the creationo of a workflow/framework that enables the creation of new mappings from detected gestures to sound. The development of GestureCap will facilitate both artistic creation, as well as scientific exploration of multiple areas, including for example - how people engage interactively with vision, sound, and movement and combine their respective latent creative spaces. Such a tool will also have therapeutic/rehabilitative applications in populations of people with limited ability to generate music and in whom agency and creativity in producing music have been shown to produce beneficial effects.

Project website: GitHub - m2b3/gesturecap2025 · GitHub

Tech keywords: Sound/music generation, Image processing, Python, MediaPipe, Wekinator, AI, Deep-learning

Yutong_Zhou · March 11, 2026, 1:09pm

Hi Alison, Deepansh, and Suresh,

I’m a first-year master’s student in cognitive science at ENS-PSL in Paris, with a background in embodied cognition, language iconicity, and cognitive semantics. I’m interested in contributing to GestureCap through the speech generation track.

The project connects well to work I’ve done previously — including studying iconicity computationally through transfer learning on multilingual vector embeddings, and modeling action verb learning using a spatial “language of thought” formalism. I currently work with Sho Tsuji on infant phonological discrimination, and before that spent nearly two years as a research assistant in psycholinguistics at UChicago (language contact, scalar semantics). I’m comfortable with Python and have experience building experiment pipelines and working with behavioral data.

I’m particularly drawn to the gesture-to-speech direction because it feels like a natural extension of my iconicity research into a tools-building context. During my PSL-week, I took a class that ended with a similar gesture-based instrument project — we used LabVIEW to map 7 handshapes to corresponding musical notes. It was a small project, but it gave me hands-on experience with the core loop of this kind of tool.

A few things I’d love to understand before writing a proposal: Is there a preferred architecture for the speech generation component — e.g., building on existing TTS systems, or something more generative? And is there room within the scope to study how users develop consistent gesture-meaning mappings over time, as a question about the cognitive side of the tool’s use?

Looking forward to the discussion.
Tony

suresh.krishna · March 11, 2026, 4:40pm

@Yutong_Zhou Thank you for the interest. There are existing sign-language to sound/text packages that should be built upon – that should be the ideal path for the speech feature.

ShivPatil26 · March 16, 2026, 2:49pm

Hi Alison, Deepansh, and Suresh,

My name is Shiv Patil, and I’m interested in applying for the GestureCap project for GSoC 2026. I’ve started exploring the repository to better understand the current pipeline before drafting a proposal.

While trying to run the preview pipeline locally (latency_measurement/preview_flircam.py), I encountered an issue where the script fails immediately due to a dependency on PySpin (FLIR Spinnaker SDK). From reading the code, it looks like video/flircam.py imports PySpin at module load time, which causes the program to raise a ModuleNotFoundError on systems that do not have the FLIR hardware/software stack installed.

I understand that the project is designed around a FLIR Blackfly S camera for high-FPS capture, so the dependency itself is expected. However, it seems to make it difficult for contributors without that hardware setup to run even the preview pipeline or explore parts of the gesture detection workflow.

I wanted to ask whether this behavior is intentional for the current research setup, or if it would be useful for contributors to improve this part of the codebase - for example by adding clearer dependency handling or a development fallback so that the software pipeline can be explored without the FLIR stack.

I’d appreciate any guidance on whether this would be a good direction for an initial contribution.

Looking forward to the discussion.

Best,
Shiv Patil

suresh.krishna · March 16, 2026, 7:16pm

@ShivPatil26 - we have code versions lying around where we use the laptop webcam. Setting up Mediapipe for use with your webcam is very straightforward, and then you can take the part in our code that outputs OSC signals for the mapping etc, or just do it yourself.

Once the coding period starts and we see who is still sticking around (Selected interns, contributors etc), we can work on making this pipeline with regular webcams accessible.

Perhaps @Deepansh_Goel still has the code somewhere.

All the best.

ShivPatil26 · March 16, 2026, 8:18pm

Hi Suresh, thanks for the clarification. That makes sense. I’ll set up a simple MediaPipe pipeline using my webcam and experiment with the OSC output/mapping side to understand the workflow better. Looking forward to exploring the project further.

Best,
Shiv Patil

Deepansh_Goel · March 24, 2026, 6:08am

Hi @ShivPatil26

Thanks for showing your interest

The FLIR camera object imported in (preview_flircam.py) is an implementation of the ‘VideoInput’ class in (video/video_input.py)

You can implement your own camera object with the same ABC. hence no change would be required in rest of the code when using implementations of ‘VideoInput’ .

Let us know if you have any other queries

Regards,

Deepansh

Meghana · March 25, 2026, 4:06am

Hi Alison, Deepansh, and Suresh,

I’m Meghana R, a final-year CSE student and I’m applying for Project #36 via the speech generation path.

I went through the gesturecap2025 codebase before writing my proposal. A few things stood out — sounddevice and pyaudio are already in requirements.txt but unused, which tells me the audio output layer was always intended to be built out. The natural insertion point for speech output is after client.send_message(‘/trigger’, 1) in the consumer() function, in a separate non-blocking thread so the 13ms detection loop stays unaffected.

For the speech engine, I’ve tested Python TTS integration locally using gTTS as a lightweight baseline and plan to use Coqui TTS as the offline production engine — it has strong accessibility-focused model support and works without internet connectivity, which matters for therapeutic use cases. I also noticed about building on existing sign-language to sound/text packages — I think Coqui fits well here and I’d love to discuss whether there are specific packages the mentors have in mind.

On the gesture classification side — MediaPipe + CNN pipeline that classifies facial landmarks into emotional states. The coordinate system is identical to GestureCap’s hand landmark pipeline, so extending the classifier to gesture types is something I’ve already thought through in detail. My plan is to start with rule-based classification for the first gestures and extend to a lightweight CNN in later weeks.

Regarding the FLIR dependency — I plan to implement a webcam version using the VideoInput ABC, so I can develop and test the speech and mapping layer without needing the FLIR hardware setup.

suresh.krishna · March 25, 2026, 10:44am

@Meghana - you are welcome. The tricky part to solve is of course the gesture to text conversion in your pipeline. Looking forward to your proposal.

Ilias_Mahboub · March 28, 2026, 9:30pm

Hi Alison, Deepansh, and Suresh. I’m Ilias Mahboub, a molecular biology and neuroscience student at Duke University and Duke Kunshan University, researching across neuroscience, neuroengineering, and cancer biology. I’m applying for the music generation track. I built a gestural MIDI instrument using MediaPipe hand landmarks to control a DAW in real time; right hand wrist height selects notes quantized to selectable scales including maqam approximations, pinch triggers onset with hysteresis, finger spread controls velocity, and left hand modulates pitch bend and CC1. It’s the same core pipeline as GestureCap: webcam → MediaPipe → gesture mapping → audio parameters. Outside the lab I DJ and produce across deep house, afrobeats, and reggaeton, and play classical guitar (you can check out my beat catalogue at https://wav.iliasmahboub.com)

I read through both the original gesturecap repo and gesturecap2025 (I originally read the wrong repo). The 2025 rewrite replaced the modular FeatureExtractor → Mapper → AudioGenerator architecture with hardcoded tap detection in the consumer process; fast for latency benchmarking, but gesture logic cannot be swapped or composed. I opened PR #3 that adds a GestureMapper ABC paralleling the original Mapper pattern, extracts tap detection into a reusable TapMapper, and adds PinchMapper, VelocityMapper, and CompositeMapper for running multiple gestures simultaneously. Also included are PureData patches mapping pinch distance to oscillator frequency and hand velocity to a noise filter, plus 19 unit tests; the first automated tests in the repo.

My portfolio’s at iliasmahboub.com, GitHub at github.com/iliasmahboub.

Looking forward to discussing the music generation direction!!!

ShivPatil26 · March 29, 2026, 5:59pm

Hi Alison, Deepansh, and Suresh,

I have submitted my draft proposal for Project #36 (GestureCap: Gesture-to-Speech Generation) on the GSoC portal. I would really appreciate any feedback or suggestions before the final deadline, if you get a chance to look at it.

Thank you for your time,

Shiv Patil