Sooo, this is an inherently tricky thing for multiple reasons.
The traditional way to prepare to do clusterwise correction is to estimate the spatial extent of correlations among the FMRI noise: the “noise” signal in task FMRI is the residuals from modeling (in AFNI, often called “errts”= “error time series”). In olden times, one estimated the spatial extent of noise as Gaussian; in modern times, we use the “ACF” (=autocorrelation function) parameter fitting, which is done using 3dFWHMx (or, even easier, using -regress_est_blur_errts as an afni_proc.py option—hopefully you are using afni_proc.py to set up your single subject processing!). Once you have the ACF parameters for each subject, you can typically average these across a group (for a given site/acquisition protocol these tend to be quite similar across a group), take the group mask and use 3dClustSim to estimate the size of clusters in simulated noise with those spatial characteristics-- for your desired sidedness of testing (see Chen et al., 2018!), voxelwise p-value threshold and FPR/alpha level, you can see what cluster size the noise-only simulations produced, which becomes your minimum cluster size for your task data. There are still subtleties to this (the residuals are not pure noise; they contain structure from our inability to model the signal perfectly, for example, but such is life).
Now, some subtlety comes in when you have non-task FMRI data, such as resting state or naturalistic scans: your output time series of interest after the modeling/regression stage is your residual time series! So, we are in the odd situation of not having a separate “noise” and “signal” estimate. What do we do about clustering? Well, we actually default to the above paradigm, the same programs on the same residuals to estimate the clustersize of “residual-only” data for the group. This is somewhat rooted in practicality and in the empirical fact that the spatial estimates of structure in the residuals of resting/naturalistic data are quite similar to those of task data (likely due to our continued inability to make detailed FMRI models; sigh). Anyways, this should still provide a pretty good estimate of the spatial extent of noise-only (or "uncontrolled) structure in the time series; if anything, it may be a conservative estimate of that, because having real structure in there would tend to bump up the apparent size of noise-only clusters, making cluster size thresholds more conservative.
For your ISC data, where your actual analysis is on the paired correlation maps, the above seems like a reasonable way to approach clustering, as well. You are basically trying to set a clustersize threshold to ask the question: how big should a cluster be to be likely not due to chance/noise alone? Looking for the spatial extent of “noise-only” in your acquired time series seems a reasonable way to approach that-- this is helped by the practical considerations noted above, including the fact that subjects in the same protocol tend to have similar spatial-extent-of-residual-structure characteristics.
For some explicit code related to these things, you might want to check out these pages:
AFNI code for Taylor et al., 2018
AFNI code for Chen et al., 2018