GSoC Project Idea 16.1: Peer-to-peer file and metadata sharing for OpenWorm data management

The OpenWorm project has a data management tool called PyOpenWorm that aids in creating, storing, and sharing information about C. elegans and about the evidence that supports that information. Generally, this information will have data from the original work available in the form of CSV files, videos, plots, etc. Although in many cases this data is made available either alongside a published article’s supplementary materials or in a data repository, PyOpenWorm should also allow to share these data, acting as a primary or secondary distribution mechanism.

Aims: Ideally, the student would design and implement a solution for peer-to-peer file sharing that integrates with the existing PyOpenWorm codebase and allows for access control (so that researchers can limit sensitive information, for instance), identity management (to support access control), and data-integrity checking (to protect against accidental and malicious changes). For redundancy and to reduce infrastructure costs, a peer-to-peer framework is desired. The student should review file sharing protocols (e.g., ed2k, kademila, bittorrent) and determine which (if any) best align with OpenWorm’s goals.

Skills: Comfort with independent study, software development, and testing is expected. Python (2 and 3) experience is required. Familiarity with principles and practice of data management (equivalent to an undergraduate course in that topic) is recommended. Experience with peer-to-peer file sharing protocols is useful, but not required.

Mentor: Mark Watts (mark@openworm.org), Arnab Banerjee (arnab1896@gmail.com).

For the benefit of all potential applicants here’s the baseline for what this project idea is talking about. What will distinguish students is whether they can propose a robust solution to this problem or even go beyond it.

This is a typical sequence which might require a file download

  1. User requests to do a translation with pow translate.

    There can be other reasons to resolve the files associated with a data
    source, but this is currently the most well-developed use-case.

  2. One of the data sources to the translation is a LocalFileDataSource of
    some kind.

    LocalFileDataSource has a relative file path which must be resolved to a
    an absolute path in order for the datasource to be used. This is expressed
    by the FilePathCapability need in the LocalFileDataSource definition::

     needed_capabilities = [FilePathCapability()]
    

    that capability is met by the DataSourceDirectoryProvider in
    PyOpenWorm/command.py

  3. Skipping over some detail, the file which the data source needs can not be
    loaded up by POWDirDataSourceDirLoader (see command.py for how this class
    fits in to the story) – this means that it’s not included in the ‘.pow’
    directory which pow needs to function. So, the file must come from some
    other DataSourceDirLoader.

  4. All that above works now. The new work is to handle what happens when the
    file cannot be found by POWDirDataSourceDirLoader.

Hi,

I am Kushal. I have worked on many python projects especially with large-scale data and image analysis. I also built a peer to peer file sharing network for my friend using IPFS. I have previously completed a project with The FreeType Org as a part of GSoC 2017. I can share more details about my previous work and experience via mail. I am very much interested in building the framework for PyOpenWorm and I want to get started with this. Can you point me towards starting points that help me understand what is there and what needs to be done?

Thank You!

Best Regards,
Kushal K S V S

Hi, Kushal. Thanks for your interest! My advice is to join us on Slack (I’ve sent you an invite via direct message). You can get started by downloading PyOpenWorm and following the README. I’m also recommending interested students to try their hands at one of the issues currently on our issue tracker https://github.com/openworm/PyOpenWorm/issues