Consider the statistical challenges around Dysonian SETI. The standard approach (for searches within the Milky Way) is to find stars that are "outliers", e.g. those that are significantly dimmer than their spectrophotometric characteristics and parallax would suggest. Statistical interpretation of the outliers of a single metric's distribution is not straightforward, however.

The normal/gaussian distribution is not normally found in nature. Real-world distributions are often skewed and long-tailed. Outliers (or "black swans") are to be expected. It's not clear which selection threshold should be used, or what it could mean for a star to be an outlier.

That said, what if we had a second anomaly metric that is nominally independent from the first, unless ETI is what correlates the two metrics? Then you could use a probabilistic approach that looks at intersections of anomalous subsets obtained with each metric, in order to determine if those intersections are bigger than would be expected by chance.

An interesting approximation of such a "second metric" can be obtained by assuming that multi-stellar extraterrestrial civilizations exist and that there's spatial clustering in their colonization patterns. The hypothesis is not far-fetched: If a civilization is able to build detectable astro engineering, interstellar travel (and colonization) shouldn't be far beyond their capabilities. Additionally, spatial clustering should be expected for the same reasons that human populations cluster on Earth.

I've written a number of Kaggle kernels (i.e. Jupyter notebooks) based on these ideas:

  • New Stellar Magnitude Model.
    This kernel trains a model of stellar magnitude and produces an anomaly metric for each of 221K stars. Model RMSE is ~0.10 magnitudes. This is very good. It means a star that is 20% dimmer (or brighter) than expected could begin to be considered anomalous.
  • Multi-Stellar SETI Candidate Selection, Part 1.
    This kernel looks at the K nearest neighbors of each star, and calculates the skewness of their anomaly distribution. This skewness metric is nominally independent from anomaly and it shouldn't depend on regional magnitude shifts or error. Skewness measures whether a distribution has a right tail that is longer than the left tail, more or less.
  • Part 2.
    This kernel looks at unusually dim stars that occur in regions that heavily skew right, and produces a list of candidates.
  • Part 3.
    This kernel is essentially a fork of Part 2, except it reverses metric ordering, such that it looks for unusually bright stars that occur in regions that heavily skew left. Its results are more significant than those of Part 2.
If skewness variance is nothing but random noise, then candidate sets shouldn't be much bigger than you would expect by chance. But there does appear to be significance with a wide range of thresholds. So there's something to the observed regional skewness, regardless of cause.

My expectation is that candidates obtained in this manner are less likely to be spurious than those obtained in the usual way.