GSOC 2026 Project #33 : University of Wisconsin-Madison - AStats: an agentic-AI approach to applied statistical practitioner workflows

Mentors: Jonathan Morris <jmorris28@wisc.edu>, Yohai-Eliel Berreby <yohai-eliel.berreby@mail.mcgill.ca>, Suresh Krishna <suresh.krishna@mcgill.ca>

Skill level: Intermediate – Advanced

Required Skills: Familiarity with the use of agentic AI workflows and the use of LLMs. Familiarity with statistical practice at a moderately advanced level is a plus. Familiarity with setting up and using open-weight LLMs and with fine-tuning LLMs is a plus. Familiarity with Slurm and working with clusters preferred.

Time commitment: Full time (350 hours)

About: Informal use and much anecdotal evidence suggests that the most recent LLMs, accessed via agentic AI coding systems, have reached a stage where they are very capable of exploring large datasets under supervision and with human guidance. Both exploratory and confirmatory analysis appears to be possible with results presented for verification by the practitioner. The A in AStats could stand for autonomous, augmented, automatic, applied, etc.

Aims: This is a new project, that this GSoC contributor will start from scratch, with help and mentorship from us. We have had good success in the past with such an approach, with successful projects going on to second and third years for additional development, and contributors from one year joining in as mentors for the following year. The project will explore and define good practices for robust workflows that incorporate agentic AI into statical exploration and practice. Practitioners already often use recipe-driven methods (e.g. JASP, Jamovi) to guide their use of statistical tools in familiar contexts. A major focus will be on the automatic exploration of large datasets, as well as the possibility of fine-tuning workflows or even models and using open-weight models to reduce cost and customize usage and make workflows more predictable.

Project website: GitHub - m2b3/AStats · GitHub

Tech keywords: Agentic AI, Statistics, Data science,

Python, PyTorch, Visual search, Saliency, Science portals, Vision AI, Vision-language models.

4 Likes

Hey Mentors !
Thank you for sharing the project details. I am very interested in this project because it focuses on agentic AI workflows and LLM systems, which are areas I am currently learning and working with.

My name is Utsav Punia, and I am currently pursuing a Master’s in Artificial Intelligence and Machine Learning at the Adelaide University, Australia. I am also an intern at SAGE Automation, where I work on LLM-based agentic workflows using tools such as LangChain and LangGraph, building systems where LLMs select tools and execute multi-step tasks. My work there also involves MCP servers, Ignition, and elements of statistical modelling for data-driven workflows.

Although I am not yet deeply familiar with some areas mentioned in the project, such as advanced statistical workflows or SLURM, I am eager to learn, adapt, and apply them. In fact, after reading the project description I started getting familiar with SLURM, as I understand it is widely used in research environments and will likely be important for my future ML research work as well. My eagerness to learn is also reflected in my past experience, where my coursework and internship have required me to quickly adapt to different technologies and work across new technical areas.

I would be happy to discuss the project further and hear any advice on how best to shape a strong proposal and which areas I should focus on while preparing. I am highly motivated to work on this project and would be ready to dedicate my full effort to it.

-Utsav Punia
Linkedin

1 Like

Hello, thank you for the interest. Slurm is not important here, but knowledge of statistical workflows is, or at least, willingness and ability to learn quickly is, since that is what the agentic workflow is trying to achieve.

Please join the AStats community at alphatest.scicommons.org.

@Utsav_Punia

Hey Suresh,
That sounds great! I’ve requested to join the AStats community and look forward to the discussions there.

During my AI/ML coursework, I have worked on several projects that involved the full data analysis pipeline, including data collection, preprocessing, exploratory data analysis, visualization, and building predictive models. I believe this experience makes me comfortable with statistical workflows, and I am eager to learn and adapt quickly to any new methods or tools required.

Hi mentors,

My name is Gaurav Singh and I’m a third year undergraduate at IIT (BHU) Varanasi . I work at the intersection of AI/ML and scientific computing and have experience building LLM-based systems and agent-style workflows using tools such as LangGraph, along with developing ML pipelines in Python/PyTorch that involve data preprocessing, eda, modelling, and evaluation. Some of my projects include building agentic AI systems with retrieval and multi-model orchestration, as well as pipelines that evaluate and validate LLM outputs on structured tasks. These experiences gave me exposure to data analysis workflows and automated experimentation pipelines, which I believe align with the goals of AStats. I would like to deepen my understanding of formal statistical workflows (e.g., hypothesis testing, model validation, and structured statistical analysis pipelines) and learn how to integrate them effectively with agentic AI systems.I have requested to join the AStats community and look forward to participating in the discussions there as well as contributing to the project wherever possible.
Thank you.
Best regards,
Gaurav Singh

1 Like

Hi, I am Utkarsh Tyagi,
An undergraduate from the Indian Institute of Information Technology Sonepat. I’ve been working with agentic AI systems and scientific Python for a while now. I found the idea of an agent that checks assumptions before picking a test, rather than just running whatever, is something I haven’t seen done properly anywhere.
I have some background in both the stats side (scipy, statsmodels, hypothesis testing) and the LLM/agent side (LangGraph, Ollama), so I think I can contribute meaningfully here.

One thing I was curious about was from the survey paper you cited. It states that LLM Agents frequently select the wrong test for non-normal data. Is building an evaluation harness in scope for this project, or a later-phase thing? I believe it can serve as a foundation for fine-tuning down the line.

Thanks,
Utkarsh Tyagi

1 Like

Please read the discussions above for what has already been discussed.

Yes, evaluating is in scope.

1 Like

Hi Suresh, following up on your reply. I built a prototype this weekend based on the pseudoreplication problem your cited survey.

The prototype included an evaluation harness that ran 40 scenarios with known ground truth and scored the full pipeline on test selection correctness. Running this test revealed something: the biggest failure mode wasn’t assumption checking, instead it was structure detection. Agents (including GPT-4.1) treat repeated-measures data as independent, committing pseudoreplication before any test is even selected. So I added a structure inference layer that runs before assumption checking. This effectively catches repeated-measures designs that naive tools treat as independent.

Validated on the sleepstudy dataset confirmed the improvement: the pipeline correctly routes to the Friedman test via Subject column detection, while the naive approach inflates the effective sample size 10x.

Code: [GitHub link]

I’m eager to discuss open judgment calls, particularly small-sample policy and unequal RM downgrade, where I’m genuinely unsure of the right statistical decision.

1 Like

Hello Jonathan, Yohai-Eliel, and Suresh,

My name is Atta ul Asad, a final-year Computer Science student at COMSATS University Islamabad, Pakistan. My research specialisation is Natural Language Processing and sequence modeling with Transformer architectures.

I have carefully read the project description and the survey paper referenced. What excites me most about AStats is a challenge I think is fundamental but easy to underestimate: understanding what the user is actually asking before any statistical decision can be made.

In my NLP research, I have found that even small differences in phrasing carry completely different analytical intent. For example:

  • Is there a difference between the two groups? → independent two-sample test
  • Does the score change across sessions? → repeated-measures / longitudinal
  • What predicts recovery time? → regression modelling
  • Are these two variables related? → correlation

These require fundamentally different pipelines, yet a user will rarely phrase their question in a statistically precise way. Real queries are ambiguous, domain-specific, and often missing key structural information. I believe a robust natural language intent and variable extraction layer, sitting before any assumption checking or test selection, is one of the most important and open design problems in AStats.

My background that directly supports this:

  • NLP research using seq2seq Transformer models (PyTorch, HuggingFace)
  • Experience with zero-shot classification, intent detection, and LLM-based pipelines

I have joined the AStats community at alphatest.scicommons.org and am currently building a small prototype focused on the NL query understanding layer, specifically intent classification and variable extraction from free-form statistical questions using zero-shot classification. I will share it on this thread shortly.

My question for the mentors: When the user’s query is ambiguous, for example, it is unclear whether the design is independent or repeated-measures, is the agent expected to ask a clarifying question, or should it make a best-guess inference from the data structure alone? I think this decision significantly shapes the architecture of the front-end layer.

Looking forward to your guidance.

Best regards,
Atta Ul Asad

1 Like

@ATTA_UL_ASAD - user should be in the loop. user can turn that off if needed. as for the rest, you seem to be on the right track. we can offer comments on 1 version of your proposal. all the best !

@utkarsh_tyagi – you seem to be on the right track. you can include your early investigations in the proof of concept part of your proposal. all the best.

Hi Jonathan, Yohai-Eliel, and Suresh !

I’m Trung Duc Anh Dang, currently an MSc student in IT and Cognition at the University of Copenhagen (UCPH) and a CS Valedictorian from Hanoi University of Science and Technology(HUST). I’ve been following the AStats project with great interest; your focus on robust, agentic workflows for statistical exploration aligns perfectly with my research in LLM steering and trustworthy AI.

My Background & Skills:

  • LLMs & Agentic AI: My current research focuses on steering LLM latent spaces and RAG evaluation. I recently ranked #1 in the CLEF 2025 Multilingual Text Detoxification task, where I developed and deployed custom finetuned LLM pipelines.
  • Coding & Infra (MLOps): I am highly proficient in PyTorch and HuggingFace. I have architected production-ready MLOps pipelines using Terraform, Docker, and GCP. I am very comfortable with the Slurm/cluster environments mentioned in the project description.
  • Open Source & Learning: While I have managed personal repos for finetuning open-source models (e.g., Whisper for low-resource languages), I am eager to transition my skills into a large-scale collaborative environment like GSoC to learn best practices for open-source maintenance.
  • Stats: With a background in Computational Cognitive Science, I am committed to making statistical exploration more interpretable and less of a “black box.”

My Idea for AStats: I’ve been following the discussion and found Atta Ul Asad’s points regarding query ambiguity very insightful. However, I believe we should move beyond traditional “intent and entity” extraction.

In my experience, rigid intent/entity mapping often introduces new failure points—specifically the complex overhead of storing, versioning, and re-mapping entities as the underlying data structure or statistical plan evolves. Instead, I’d love to discuss an approach that prioritizes dynamic intent grounding:

  1. A “Refinement-First” Planning: Rather than a black-box extractor, the LLM should treat “intent” as a dynamic conversation. When a query is ambiguous (e.g., the independent vs. repeated measures distinction), the system should generate a Plan (DAG) and explicitly ask clarifying questions. This bypasses the need for fragile entity-to-variable mapping by grounding the user’s intent directly into a verified statistical design.
  2. Trusted & Managed Tool Registry: A unified, versioned wrapper for libraries like SciPy/Statsmodels. This architecture allows the grounded intent to trigger reliable, reusable tools that are managed centrally and can be easily extended with new models or data sources.

I am excited about the potential of AStats to make high-level data science more accessible and predictable. I’d love to hear your thoughts on these directions!

Best regards,
Trung Duc Anh Dang
(GitHub: ducanhdt)

1 Like

@Ducanh_Dangtrung - these are all open questions empirically as to what works best and in which scenario. sicne this is a greenfield project, there is substntial flexibility once the project starts in terms of which directions to take it. this can be discussed by whoever is still around once the gsoc coding period starts (included selected intern(s), volunteers, etc). you are welcome to propose a path, provide proofs of concept or mockups, always explaining what you want to do, why, how, when and why you should be the person to do it.

good luck ! we look forward to seeing your proposal.

1 Like

@suresh.krishna Thank you for the guidance! I completely agree that these are empirical questions. I’m excited about the flexibility of a greenfield project like AStats. I will focus my proposal on providing a clear why and how for the dynamic refinement approach, alongside a mockup/PoC to demonstrate how it handles the mapping overhead I mentioned. Looking forward to submitting the full proposal!

1 Like

Hi, building on the earlier discussion around failure modes like pseudoreplication and incorrect test selection…

It seems that many of these issues originate before the actual statistical test stage, particularly in how dataset structure and assumptions are interpreted.

I wanted to understand the role of the initial data understanding (profiling) stage in this context.

If we already extract signals like normality, outliers, variance properties, and column roles, should this layer also attempt to capture data structure hints (for example, detecting repeated-measures patterns or grouping relationships), rather than leaving all structure inference to a later agent/reasoning stage?

This might help reduce early misinterpretations, such as treating dependent samples as independent, before the agent reaches test selection.

So I wanted to ask:

  • should structure inference be part of the initial data understanding layer, or

  • is it expected to remain a separate step handled later by the agent?

Would appreciate your guidance on how this boundary is being thought about.

Thanks!

2 Likes

You are welcome to propose a coherent framework and work within that. These are details that will make more sense once the actual work starts.

Sir i sent the proposal for review. Please check that whenever you are free.

Hello,

I’m Roaa, a 4th year AI/Data Science student at Zewail City. I’ve been building a prototype for the AStats project over the past week and wanted to share progress.

Repo: github.com/Roaa-838/INCF-AStats-cli-prototype/

What I’ve built:

I started with the assumption checking layer (Shapiro-Wilk for normality, Levene’s test for variance homogeneity) and built a decision tree that routes to the appropriate test based on what assumptions hold. Then implemented the test execution layer - 9 scipy tests with proper effect size calculations (Cohen’s d, rank-biserial r, eta-squared, etc.).

The eval harness shows 8/8 on synthetic benchmarks, and I tested it on the Iris dataset to verify it works on real data.

Design approach:

The pipeline uses deterministic routing (rule-based decision tree) because statistical assumptions are math, not opinions. But it’s designed to integrate with an LLM layer for parsing natural language queries upstream and generating domain-specific interpretations downstream.

Questions as I continue:

  1. Should I focus next on more real dataset validation (neuroscience, clinical trials), or on the LLM integration layer?

  2. For the R backend - is rpy2 the preferred approach, or subprocess for cluster environments?

Thanks for the project!

1 Like

Please send me a DM with a link to a Google Doc here. @ATTA_UL_ASAD