GSoC 2023 Project Idea 11.1 Improving automatic reviewer assignment using large language models for NBDT journal (350 h)

arnab1896 · February 1, 2023, 8:36pm

Automatic reviewer assignment aims to build a model to make better suggestions for reviewers based on an incoming abstract. Last year with GSoC, we explored the tool for automatic reviewer assignments for the Neurons, Behavior, Data analysis, and Theory (NBDT) journal. This year, we try to improve the reviewer assignment process where we try to explore a large language model and fine-tune the large language model. We will mainly focus on the application in the neuroscience domains for the NBDT journal but the tool can be used as an open-source project for other domains.

The candidate will be tasked to explore creating a proper training dataset, selecting and fine-tuning the proper language models, building an open-source model for automatic reviewer assignment, and evaluating the results of the experiments. easy-to-skim-read summary under each project idea

Skill level: Intermediate/advanced

Required skills: Python, Pytorch, Huggingface, Natural language processing

Time commitment: Full-time (350 h)

Lead mentor: Daniele Marinazzo, Titipat Achakulvisut

Project website: https://nbdt.scholasticahq.com

Backup mentors: Konrad Kording

Tech keywords: Python, Pytorch, Huggingface, Natural language processing, Language model, Fine-tuning, Contrastive learning

Shrutik · February 2, 2023, 10:34pm

Hello, @arnab1896 I would like to contribute to this project. Could you please tell me where I should begin and guide me through the process?

Penguin_Man · February 3, 2023, 7:16am

Hello, @titipata, @Daniele_Marinazzo

I recently came across the Automatic Reviewer Assignment project, while exploring the INCF projects for this year’s Google Summer of Code program and was really excited about the opportunity to contribute. I have some background in NLP and I believe my skills and experience align perfectly with this project.

Before I get started, I have a few questions that I would like to clarify.

Can you give me more information about the current state of the automatic reviewer assignment tool and what improvements are being made?
I am also interested in understanding how the fine-tuned language model will be integrated into the existing tool.
In addition, I would like to know more about the training dataset, including the size and source of the data.

And, if possible, could you provide some guidance on where I can start contributing to the project?

titipata · February 4, 2023, 1:55am

Hi @Penguin_Man, thanks for your questions. I’ll try to answer your questions one-by-one

More information about the current state: We use topic modelings such as latent semantic analysis and linear programming to match papers to reviewers. Here, we represent reviewers as a sum or average of calculated topics. However, the LSA can be noisy and cannot well represent the text. The language models are a better choice to represent the title and abstract of the text.
Fine-tuned language model can be done by taking a language model and then fine-tuning it to the specific tasks such as using contrastive learning ([1806.06237] PeerReview4All: Fair and Accurate Reviewer Assignment in Peer Review) or learning-to-rank ([2210.07774] Learning To Rank Diversely) with the given data. In this case, we can craft the public review dataset to fine-tune the language model using these approaches. The learned representation can be used to suggest the reviewer in application
There is quite a lot of public review datasets that may not be specific to the neuroscience domains including https://data.mendeley.com/datasets/wfsspy2gx8 and many more. This is part of the exploration at GSoC to find suitable datasets to fine-tune the model.

Hope this answer some of your questions!

Penguin_Man · February 4, 2023, 8:17pm

Thanks, @titipata for taking the time to provide such a comprehensive and informative response.
I appreciate the information on the fine-tuning process and the sources of public review datasets that may be suitable for the project.

I will start working on my draft proposal and keep these points in mind. Is there anything else that I need to take into consideration while preparing my proposal? I would also like to know if there is any specific format that the proposal should follow. Having a clear understanding of what is expected will enable me to create a well-structured and comprehensive proposal.

Thank you again for your time and support.

arnab1896 · February 8, 2023, 7:46pm

Hi @Shrutik ,nice to hear from you. Did you maybe get a chance to read through the comments of @titipata on this post itself? Some interesting next steps are shared that you may want to go through.

Feel free to go through and ask any queries/pointed questions you may have that the mentors can help you with or give feedback on.

Thanks

not_shobhit · February 13, 2023, 9:22pm

Hi @arnab1896. I want to contribute to this project. I am moderately experienced in NLP but beginner in open source. Please help me find the source code of this project so I could brainstorm some ideas for this task.

saranshkg · February 26, 2023, 7:32pm

Hello @Daniele_Marinazzo, @titipata, @Konrad_Kording, I hope you are well!

I am an interdisciplinary master’s student at Ashoka University (India) studying Computer Science and Psychology. Prior to this, I completed my bachelor’s in Engineering (Computer Science) and worked for around 3 years as a software engineer. However, I am still relatively a newbie with regard to open-source development and neuroscience research and I am hoping to participate in this year’s GSoC as a contributor in order to get started with both of them seriously. I have a decent amount of experience in software engineering, web development (full-stack), machine learning, and NLP and I am very keen to pivot toward Language Technology and Computational Neuroscience. In this light, I think my interests and skills are quite well aligned with this project.

I had been following the last year’s contributions to this project and I was pretty excited to learn about the shift towards LLMs, even if as an exploratory step. Additionally, the vast applicability of this project both outside of Neuroscience and the paper reviewing application appealed to me. I have read through the detailed comments by @titipata (thanks btw) and I am going through the papers/dataset suggested by them. I’ll start working on a draft proposal shortly and I had a few questions that I wanted to ask:

Apart from the suggestions that have already been made in this thread, should I be aware of any other tools/references/approaches/suggestions before I start making a draft proposal?
Will this year’s project be building upon the last year’s developments or will it be a fresh attempt? The reason I am asking this is that in case it is the former, I could start looking into the last year’s codebase as well.
Please let me know if I should continue using Neurostars to communicate more about this project or if there is an email id that I should use in order to request feedback on my draft proposal.

I am really looking forward to discussing more in and around this project and hopefully, working with you!

Shrutik · March 24, 2023, 4:51pm

Hello Arnab,

Sorry for replying so late. I went through the comments and got a lot of insight into the project.
I will be turning in my proposal very soon.

Regards

Shrutik · March 24, 2023, 10:13pm

Hello @titipata

Hope you’re doing well. I am drafting my proposal and had a question regarding the dataset.

I looked into the previous program and found out that they have created a reviewer database.

I wanted to know if we can use this data for the language model or if I can create my own dataset by scraping scholarly websites such as jstor or arXiv.
You also mentioned public datasets, so, can I use them or implement the above task?

Thanks.

saranshkg · March 25, 2023, 8:43am

Hello @titipata and @arnab1896, please let me know which email (or Slack ID) I could use to send a draft proposal to in order to get a review/comments/feedback on it.