The OpenWorm project has a data management tool called PyOpenWorm that aids in creating, storing, and sharing information about C. elegans and about the evidence that supports that information. Generally, this information will have data from the original work available in the form of CSV files, videos, plots, etc. Although in many cases this data is made available either alongside a published article’s supplementary materials or in a data repository, PyOpenWorm should also allow to share these data, acting as a primary or secondary distribution mechanism.
Aims: Ideally, the student would design and implement a solution for peer-to-peer file sharing that integrates with the existing PyOpenWorm codebase and allows for access control (so that researchers can limit sensitive information, for instance), identity management (to support access control), and data-integrity checking (to protect against accidental and malicious changes). For redundancy and to reduce infrastructure costs, a peer-to-peer framework is desired. The student should review file sharing protocols (e.g., ed2k, kademila, bittorrent) and determine which (if any) best align with OpenWorm’s goals.
Skills: Comfort with independent study, software development, and testing is expected. Python (2 and 3) experience is required. Familiarity with principles and practice of data management (equivalent to an undergraduate course in that topic) is recommended. Experience with peer-to-peer file sharing protocols is useful, but not required.
For the benefit of all potential applicants here’s the baseline for what this project idea is talking about. What will distinguish students is whether they can propose a robust solution to this problem or even go beyond it.
This is a typical sequence which might require a file download
User requests to do a translation with pow translate.
There can be other reasons to resolve the files associated with a data
source, but this is currently the most well-developed use-case.
One of the data sources to the translation is a LocalFileDataSource of
some kind.
LocalFileDataSource has a relative file path which must be resolved to a
an absolute path in order for the datasource to be used. This is expressed
by the FilePathCapability need in the LocalFileDataSource definition::
needed_capabilities = [FilePathCapability()]
that capability is met by the DataSourceDirectoryProvider in
PyOpenWorm/command.py
Skipping over some detail, the file which the data source needs can not be
loaded up by POWDirDataSourceDirLoader (see command.py for how this class
fits in to the story) – this means that it’s not included in the ‘.pow’
directory which pow needs to function. So, the file must come from some
other DataSourceDirLoader.
All that above works now. The new work is to handle what happens when the
file cannot be found by POWDirDataSourceDirLoader.
I am Kushal. I have worked on many python projects especially with large-scale data and image analysis. I also built a peer to peer file sharing network for my friend using IPFS. I have previously completed a project with The FreeType Org as a part of GSoC 2017. I can share more details about my previous work and experience via mail. I am very much interested in building the framework for PyOpenWorm and I want to get started with this. Can you point me towards starting points that help me understand what is there and what needs to be done?
Hi, Kushal. Thanks for your interest! My advice is to join us on Slack (I’ve sent you an invite via direct message). You can get started by downloading PyOpenWorm and following the README. I’m also recommending interested students to try their hands at one of the issues currently on our issue tracker https://github.com/openworm/PyOpenWorm/issues