Mentors: Jonathan Morris <jmorris28@wisc.edu>, Yohai-Eliel Berreby <yohai-eliel.berreby@mail.mcgill.ca>, Suresh Krishna <suresh.krishna@mcgill.ca>
Skill level: Intermediate – Advanced
Required Skills: Familiarity with the use of agentic AI workflows and the use of LLMs. Familiarity with statistical practice at a moderately advanced level is a plus. Familiarity with setting up and using open-weight LLMs and with fine-tuning LLMs is a plus. Familiarity with Slurm and working with clusters preferred.
Time commitment: Full time (350 hours)
About: Informal use and much anecdotal evidence suggests that the most recent LLMs, accessed via agentic AI coding systems, have reached a stage where they are very capable of exploring large datasets under supervision and with human guidance. Both exploratory and confirmatory analysis appears to be possible with results presented for verification by the practitioner. The A in AStats could stand for autonomous, augmented, automatic, applied, etc.
Aims: This is a new project, that this GSoC contributor will start from scratch, with help and mentorship from us. We have had good success in the past with such an approach, with successful projects going on to second and third years for additional development, and contributors from one year joining in as mentors for the following year. The project will explore and define good practices for robust workflows that incorporate agentic AI into statical exploration and practice. Practitioners already often use recipe-driven methods (e.g. JASP, Jamovi) to guide their use of statistical tools in familiar contexts. A major focus will be on the automatic exploration of large datasets, as well as the possibility of fine-tuning workflows or even models and using open-weight models to reduce cost and customize usage and make workflows more predictable.
Project website: GitHub - m2b3/AStats · GitHub
Tech keywords: Agentic AI, Statistics, Data science,
Python, PyTorch, Visual search, Saliency, Science portals, Vision AI, Vision-language models.