GSoC 2024 Project Idea 10.1 Using markerless motion capture to drive music generation and develop neuroscientific/psychological theories of music creativity and music-movement-dance interactions (350 h)

This is a new exploratory project that aims to use the recent AI-based advances in markerless motion capture software (e.g. MediaPipe from Google) to create a working framework whereby movements can be translated into sound with short-latency, allowing for example, gesture-driven new musical instruments and dancers who control their own music while dancing. The project will require familiarity with Python, and ability to interface with external packages like MediaPipe. Familiarity with low-latency sound generation, image processing, and audiovisual displays is an advantage, though not necessary. The development of such tools will facilitate both artistic creation, as well as scientific exploration of multiple areas, including for example - how people engage interactively with vision, sound, and movement and combine their respective latent creative spaces. Such a tool will also have therapeutic/rehabilitative applications in populations of people with limited ability to generate music and in whom agency and creativity in producing music have been shown to produce beneficial effects.

Skill level: Intermediate/advanced

Required skills: Comfortable with Python and modern AI tools. Experience with image/video processing and using deep-learning based image-processing models. Familiarity with C/C++ programming, low-latency sound generation, image processing, and audiovisual displays, as well as MediaPipe is an advantage, though not necessary.

Time commitment: Full-time (350 h, large project)

Lead mentor: Suresh Krishna (

Project website:

Backup mentor: Yohai-Eliel Berreby (

Tech keywords: Sound/music generation, Image processing, Python, MediaPipe, Wekinator, AI, Deep-learning

1 Like

Hi Professor Suresh, this is Aansh Samyani, a 2nd year undergraduate from Indian Institute of Technology, Bombay (IIT Bombay). I am deeply intrigued by the ideas in this project and also familiar with basics of Deep Learning and Image Processing. Can you guide me how should I start getting the hold of some of the prerequisites required for this project and how should I start contributing or building my proposal for this project?

Thank You!

Greeting @suresh.krishna @greg_incf
My name is Tvisha Vedant and I am currently a second-year B.Tech Computer Science student at Veermata Jijabai Technological Institute,Mumbai .
I am well-versed with C++,Python,Pytorch .
I have a strong background in various areas of computer science, including natural language processing (NLP), image processing , web development , using libraries like numpy/pandas and version control with Git/GitHub. I had contributed in building a web-app which controlled the music volume and playback using gestures using Media-pipe.I have worked extensively with LLMS like langchain, Transformer architecture, and NLP, which I applied in developing a healthcare chatbot(using the Llama model from hugging face after testing a lot many models).

I am new to the world of open source with just a few month’s experience but the entire concept and setting of open source contributions appeals me !!

I found your project idea very intriguing and it aligns perfectly with my experiences and learnings.
Additionally the various possiblities of learning new things and implementing different features appeals me!

The repository mentioned does not provide any adequate tech-related information. Please guide me through the further tasks to be performed and the proposal preparation for GSOC’24.

Looking forward to collaborating with you.
My github repo:
My resume:

Hi Professor Suresh, this is Om Doiphode, a third-year undergraduate student in Computer Engineering at VJTI Mumbai. I am quite interested in this project. I have experience in Deep Learning and image processing. I was also selected for the Amazon ML Summer School 2023. Currently, I am an ML intern at an AI startup in Sunnyvale, California. Can you please guide me on how to contribute to this project or are there any prerequisite tasks to be done for this project? Thank You!

@Om-Doiphode @Tvisha_Vedant @Aansh_Samyani - thank you for your interest.

This is a new project, and as such, there is no existing code-base. We also do not anticipate active code development until the GSoC period starts and 0 - 2 of you are selected via GSoC (based on the projected, but not guaranteed, number of slots the project may get). Those who are not selected are of course welcome to contribute in any case if they are able to.

The goal for you between now and April 2, the application deadline, should be to research this space, and develop a concrete coding proposal for how to accomplish these goals, in a way that is aligned to your background and existing skills. There are several templates available on the web for successful GSoC proposals - use one of those. I can also give you feedback on 1 version, if you send it to me sufficiently in time.

In terms of the goals for this project, the project description is a good start. Then take a look at the Wekinator ( Wekinator | Software for real-time, interactive machine learning). The goal is to do something like that, but with modern software tools like MediaPipe. The project involves receiving images via a good webcam, or a specialized high-speed camera (e.g. Blackfly S USB3 | Teledyne FLIR or Genie Nano M640 Mono here - Genie Nano-1GigE | Teledyne DALSA), using MediaPipe (with C++ and/or Python - this is non-trivial since the C++ version involves some compiler issues and the Python version is slower), and then sending output to a speaker via a low-latency sound-card (via USB or a dedicated protocol). Additional work can involve setting up a framework that allows users to map gestures to sounds - this is a very complex process with a lot of research on it, and ideally should be give maximum flexibility, yet sufficient guidance to the user.

Please feel free to ask questions here as needed, but please do your own thinking and homework first, and spend time on your posts/questions.

Thanks @suresh.krishna I will start working accordingly.

ps. You are all welcome, and in fact encouraged, to write as much pilot code as you want that supports your proposal. For example, demos, successful installation of a working prototype, etc etc.

When I wrote above that “We also do not anticipate active code development until the GSoC period starts”, I meant that we do not anticipate mentored, curated code-development towards the final code-base.

Please use Neurostars for all communication, and not direct emails. Thank you.

Hi @suresh.krishna @greg_incf
I’m Deepansh Goel, a Computer Science Engineering student and ML Freelancer from MAIT Delhi.

I find this project aligns well with my past experience. I’ve worked on several projects involving mediapipe and audio processing.

One of them is a physiotherapy/posture correction module that provides feedback through sound beeps. Although I’m uncertain about its low latency compatibility with the specified hardware, I can share a 3-minute demo video for your consideration.

The provided problem statement seems to be an enhanced version of this project that I completed 2 years ago.

I’m familiar with music-related parameters and have previously developed a clustering-based music sorter utilizing data from the Spotify API.

Moreover, I’ve conducted research and partially implemented an advanced sign language recognition project that involves sorting video frames, deviating from the conventional single-frame approach. This aligns with the gesture-to-sound association suggested in the final statement.

Additionally, I have worked on many smaller projects regarding the same domains.

Looking forward to the discussion.

Here is the video link for the first project:

[PS: It’s not the best GUI that’s out there, and I was in high school when I recorded it. I apologise for the quality issues]

Here is my profile: Deepansh Goel

1 Like

@Deepansh_Goel - thank you for the interest. your previous work seems relevant and you should use that in your proposal. the more specific and detailed you can make your proposal, the easier it is to evaluate it correctly for feasibility, correctness, etc.

Thank you for the positive response, I have written some working pilot code, which uses pinch gestures to play a piano piece “Fur Elise”.

I have created a PR with the source code and additional notes in the Github repository mentioned above.

Here is a demo video with the pilot code, of me playing fur elise.
Here I have mapped all notes sequentially to a single gesture for testing, they can be mapped to different relevant gestures that would be decided on further research.

Video Link

Edit: A youtube video link if the drive link fails to work:
Youtube Link

Please let me know if I could share a sample proposal with you on email for a quick review
cc: @greg_incf

nicely done. how many distinct gestures do you estimate you can recognize ? how do you estimate that ?

for now, please dont create a pr. instead, can you host the code on your own github and send me a link ?

yes, you can share a sample proposal. but first please only send me a version that is fully done and readable and correct formatted. just make sure you have enough time left to implement any additions/changes.


Thanks, here is some further research I did.
If we just consider the hand gestures, there are two ways to generalise them.

  1. Harcode the landmark positions (with an error threshold): This is a temporary solution, we can generalise any number of gestures by just having conditionals of landmark positions, but it will be tedious, and has to be adjusted for every camera device separately, sicne the values will be different.

  2. Training a model to classify gestures based on landmark coordinates. Hand landmarks are sequential data, which can be used to train or fine tune a classification model.

The second option is the appropriate choice. There are a few things to keep in mind here

  • There are some pre-existing datasets, for example, the HAgrid dataset contains 18 gesture classes.
  • If we generate our own dataset, we can recognize any number of gestures as long they are considerably distinguished: source link
  • For the tasks above, we can utilise tool called “mediapipe model maker” to fine tune a pose model with less data: source link
  • The accuracy of our model will highly be dependent on the quality of our dataset.

Once we have a model that can generate classification outputs, we can design a map with different combinations of gestures to combination of notes or music samples.

We will be considering both hand and pose landmarks to generate musical outputs from a full body perspective rather than just hand gestures. Using medipipe holistic

There are two levels on which we can do so:

  1. We generalise full body gestures on the training level, where our dataset contains both hand and pose landmarks.
  2. We can perform classfication on two different levels of pose and gestures, and then use the combinations of hand gestures and body poses for mapping.

In summary, the first option provides us with a simpler solution with a single mapping output, and the second option is a little more complex but it provides us with more freedom of customization (two mapping outputs).

I am currently researching about a tree based recognition structure, where we can look ahead of just hand and pose landmarks, and use any desired combination of landmarks (more than two mapping outputs), making the project modular enough for other people to work on. This is relevant if using over 2-3 models is fast enough in our use case.

Github Link for the pilot code: sudo-deep/Gestures
I’ll share the proposal with you soon, thanks
Apologies for the late response, I was occupied last week with some important work.


1 Like

this is very good. you are on the right track !

1 Like

I saw the proposal. it is good. i actually have no specific comments.

if you have time:

i sent you a sample proposal, take a look. also take a look at the template that gsoc provides.

take a look at the wekinator project. also take a look at livepose - sat-mtl / tools / LivePose / LivePose · GitLab. look into osc based communication to something like supercollider or max (or puredata) for low-latency sound output.

Thanks for the feedback, I will make some improvements according to the sample template, and research about the topics above.

I went through some wekinator demos, and a python library called “MIDIutil” for music generataion. We can slightly modify the approach and provide more flexibility to the user in terms of tempo, duration and volume in addition to octaves and notes.

We can start by using piano maps, use the initial method to set notes and octaves, and then involve movement in join position to set tempo and duration.

For instance, the left hand is used to set the octave (also representing a “lock”), the right hand is used to set the note, and the change in X and Y coordinates of one of the body joints represent the changes in other musical parameters like tempo, and the duration can be controlled by when the “lock” is released.

PS: Whenever I say playing notes, that has a general possibility of generating a preset chord progression instead of just singular notes.

There is a possibility to create a gesture controlled menu as well where the user might have the options to set the volume, background beats, duration, etc.

The basic principle is to have a gesture-action map, where-in any number of parameters can be selected to generate music/control the flow of the application.
This way, any further research we do, we can incorporate the change in parameters, whether they are static or change with flow.

I will update you on the research about OSC based communication and outputs, possibly going to look for an approach to bridge MIDIutil and audio playback

I have submitted a proposal with the improvements from the given template by INCF and the sample proposal you shared

1 Like