GSoC 2020 project idea 21: Automated comparison of scientific methods for time-series analysis

malin · January 15, 2020, 11:54am

Time series are measured and analyzed across the scientific disciplines, with new methods for their analysis being developed regularly. How can we make sense of these hundreds of methods? How can we distinguish a real advance from a new method that actually reproduces the behavior of an existing method? The need for comprehensive and systematic comparison is paramount in methodological literatures like time-series analysis, but is incredibly difficult to achieve practically. We have recently developed an online portal for comparing time-series data, CompEngine (https://www.comp-engine.org/), that allows scientists to drag-and-drop their data onto our portal to get an answer to the question: “what sorts of data from across science are similar to the data that I measure?” However, there is still no way for scientists to compare their methods to alternative methods from other disciplines. Achieving this would enable scientists to better work together across disciplinary boundaries, towards a unification of methods for time series. Such an endeavor would be transformative in facilitating the concentration of scientific effort towards meaningful interdisciplinary progress, and could become a template for similar efforts applied to other data types.

Aims: In this project, we will develop an online platform to compare time-series analysis methods. As CompEngine does for time-series data, this would allow a scientist to upload their code, compute the results on a dataset, and search a library of existing features for the most similar methods. The scientist could then be given a ‘uniqueness’ score, and be able to visualize their method in a broader scientific context.

Skills: Web development and C/python coding.

Mentors: Ben Fulcher (ben.fulcher@sydney.edu.au) and Nick Jones (nick.jones@imperial.ac.uk).

Salmankhancodes · February 13, 2020, 5:43am

Hi I am Salman khan from India, currently pursuing my 2nd of under graduation in the field of Computer Science. i am looking forward to contribute to Incf in GSoC’20. @malin Will you please guide me to understand more about this project?..

ben.fulcher · February 16, 2020, 12:04am

Hi Salman,

Welcome!

We will aim to develop an interface in which:

User drags code onto the portal (e.g., a python function that takes a single input: time-series data, and outputs a single real number).
The server would evaluate this function on a suite of X diverse time series (e.g., X = 1000: See example dataset here), allowing this new method to be characterised by a 1000-long feature vector representing how it behaves on diverse data.
This feature vector could then be compared to that of the existing library of python functions already on the site, to bring up a list of the top matches, and allow the user to understand the extent to which their function is unique, and highlight (perhaps unexpected) relationships to existing functions/time-series analysis methods.

You can see that the process if very similar to currently, except instead of representing pieces of data with a feature vector of their properties (as judged by a set of features), we’re now representing each function by a feature vector of their outputs (across a set of time series). In both cases, we use this feature vector as the basis of comparison. A redundant method is one that reproduces the pattern of outputs across empirical data as an existing method already in our library.

We describe some of the process in our 2013 article.

In the GSoC project, we would start simple (using the feature vectors from Empirical1000) and work up from there. The CompEngine backend is built and open source, and would ideally be leveraged as much as possible to avoid duplicate work.

LovelyBuggies · February 17, 2020, 12:12pm

Hi, @ben.fulcher. I am Shuo (Nino), a senior student from Sun Yat-sen University, PRC. I have been passionate about Neuro Stars for a long time, and I regard automated comparison of scientific methods for time-series analysis one of the best matches for me among these projects after reading the suggestion and information given. I saw your required skills are Web development and C/python coding, and consider myself a qualified candidate for your project since I have some experiences with them, e.g., Make Spare Money System, Blockchain-based Scoring System, Python projects list, and, etc. At this stage, I think my strengths are my mastery of Python programming and my positive attitude.
At present, my defects include the lack of understanding of the method and implementation of the time-series data analysis. I read the recommended materials and I believe I have enough enthusiasm and motivation to overcome these challenges under your guidance.

I want to know if there are any tests for me before joining this project? And what could I do to prepare for this project other than continue reading the materials given (if you think I’m qualified)?

P.S., in Web development, I prefer to use React/Vue to develop. But other frameworks or tools are also welcomed for me to contribute to this project.

I’m looking forward to your reply, and feel free to contact me if you think I’m competent! Thank you!

LovelyBuggies · February 17, 2020, 12:13pm

Other personal information:

GitHub: https://github.com/LovelyBuggies
Blog: https://www.jianshu.com/u/ad132373fc48
Resume: https://drive.google.com/file/d/1rzaFqaEPxsFoRY7E1I-yeD2PMt2_gWZ4/view

kpiyush04 · February 17, 2020, 11:30pm

Hi @malin !!
I’m Piyush Kumar a sophomore student from Galgotias University, Greater Noida, India.
I found this project is very interesting. I’m working on python, data analysis and visualization
currently, I am looking forward to contributing to open source.
So suggest to me, what I can do further in this project!

ben.fulcher · February 24, 2020, 8:09am

Dear Shuo, thank you for your message. The implementation of the time-series analysis is less important for this project, as this has already been done. We will want to follow the steps above for evaluating the behavior of an algorithm on a general time-series dataset.
For the web interface, we will try to leverage the existing architecture of CompEngine, as in this repository for the frontend.
I would suggest you read this paper, and familiarize yourself with the CompEngine website. It might also help to get a feeling for the highly comparative approach to selecting data analysis algorithms—you can gloss over many of the details, but this is a good summary to get an overview.
Thanks again for your interest in this project!

ben.fulcher · February 24, 2020, 8:10am

Dear @kpiyush04, welcome and thanks for your interest! Please read my replies to the other users on this for some additional detail, and links to get further detail/relevant readings

Salmankhancodes · March 5, 2020, 8:07am

@ben.fulcher Sir i have 2 queries:

Can we just put all the timeseries data sets into one Dataframe and then analyse that one big dataframe by the user given analysis method.?
what are the parameters for features selection and can we use tsfresh package for feature extraction?

Thank You

ben.fulcher · March 5, 2020, 9:12am

Yes, the results of applying all features on the time-series dataset could be stored as a time series x feature matrix (e.g., in a dataframe).
We are not doing any feature selection, as we want to compare any new feature to our library of existing features.
We will not use tsfresh because hctsa has many more, >7000 time-series features https://github.com/benfulcher/hctsa

ram · March 8, 2020, 8:00am

@ben.fulcher Hello I am Ram verma from Banglore India and i am interested in this project for Gsoc '20.
I had gone through this project,answers and resources mentioned but things are very confusing for me i want to know:
-Where is the diverse time series data(will you please give me the reference link)
-What will be the bases of features extraction from timeseries?
-Where is the feature vector library to which the result will be compared?
-What is Hctsa that is built via Matlab ?Are we going to replicate that project and build it on python?
-How Hctsa project is different from the current project?
Thank you

ben.fulcher · March 8, 2020, 8:21am

Welcome, Ram. Thanks for your message.

An example time-series dataset is linked to in the message to Salman above.
Each time-series analysis algorithm yields a real-valued output that is a feature. We will use the hctsa library as a base for comparison to any new algorithm that produces a real number summary statistic from time-series data.
You can see the hctsa output on the example dataset in the message to Salman above.
We don’t need to do any new computation because the Matlab-computed outputs on the sample dataset are complete. We only need to compute the results on the new algorithm and compare to these precomputed outputs. So no recoding of hctsa into python is required for this task.
I do describe how you can do this task of comparison manually in the hctsa documentation, but in this project we will make it easy, by making a simple web platform that gives the user diagnostics on their algorithm in comparison to a vast interdisciplinary library of alternative methods.

Thanks for your interest, and I hope this answers your questions.

ram · March 8, 2020, 10:47am

but on computation each time series will give a single value as a result and computing 1000 timeseries will give us a 1000 values so are we going to compare all those values to the hctsa output file?
and in the sample dataset there are 8 files will you please help me understand what each of the file contains and it’s use? and are we going to use these files only in our project?

harsht24 · March 8, 2020, 4:44pm

Hello, @ben.fulcher
I am Harsh Tamkiya currently pursuing my bachelor’s degree in computer science from Indore, India. This project seems interesting to me. I have experience in Python, Data Analysis, Data Visualization, and Web Development. I’m looking forward to contributing to INCF GSOC 2020.
Linkedin profile - https://www.linkedin.com/in/harsh-tamkiya-749121165/
Github profile - https://github.com/harsht24

ben.fulcher · March 8, 2020, 11:30pm

Dear Harsh,
Welcome to Neurostars! Glad you find the project interesting
Ben

ben.fulcher · March 8, 2020, 11:31pm

Hi Ram,
We will use the results of the feature computation as a basis of comparing a new feature. This is described in the hctsa documentation linked to above.
Ben

ram · March 9, 2020, 7:07am

and the feature vector that we will get after computing diverse dataser is our work to extract or the user who will upload the code has to design algo in such a way that it will produce a feature vector?

also are we going to embed hctsa library or we are just going to use its output file?

and last when the numeric value from existing library matches our algorithm’s output then will there be any other file which will points to the specific timeseries analysis method ?

(i think by these it ends my all the questions and sorry if you find me annoying )
Thank you

ben.fulcher · March 10, 2020, 6:08am

We will use the output of the hctsa file as a library to match to. When the user uploads their function, we will compute its outputs across the same dataset and compare to the hctsa features. This will be the basis of a set of visualizations for the user about their function, including a visualization of the types of features already in hctsa that are most similar to their drag-and-dropped new method.
[a computationally cheaper version would be to have the user to run the computation locally to avoid server-side computation, but I think we’re ok]
We would like to have the ability for the user to contribute their function at their discretion, just as we have a living library of time-series data in CompEngine.

ram · March 10, 2020, 4:48pm

now got some clarity about what we need to do but still I’m confused with the functions of each file in 1000 emirical data

Salmankhancodes · March 10, 2020, 6:33pm

Hi Ram,
Since i’m following this project from quiet some time , may be i can help you in clarifying your doubts.
As in1000 Empirical Time Series every thing is clearly mentioned in the description section let me summarize you the functions of all the files:

HCTSA_Empirical1000.mat - contains the result of computation. Same file is available for non-matlab environment with the name hctsa_datamatrix.csv .This is the main file that we are going to use for comparison

Information about rows(time-series) of hctsa_datamatrix.csv file is in hctsa_time-series-info.csv and information about coloumns(features) is in hctsa_features.csv
Information about each time series is in hcts_timeseries-data.csv
Remaining 2 png files are example of visualization of the output of timeseries on 1000 sample time series and 250 timeseries

The computed output of the user’s function will be compared with hctsa_datamatrix.csv file…

i hope this might helped you in some way.

@ben.fulcher let me know if i am wrong at any point.

Thank You,