There is a feature of Kaggle that I had not tried previously: kernels. A kernel is a script with markdown (e.g. a Jupyter notebook) hosted by Kaggle. I decided to post an analysis as a kernel:
Dysonian SETI With Machine Learning
I think this is a great way to present statistical analyses. All of the code is there in the open, with markdown explaining what it does. Any Kaggler can clone a kernel, make modifications and re-run it. A kernel has an output tab, where readers can download analysis results. The code is run in the Kaggle (i.e. Google) cloud, so the reader doesn't need Jupyter installed. Kernels have interesting capabilities. For example, I have a couple of interactive 3D scatter charts in my kernel.
What the kernel is about
There was a paper a few months back (Zackrisson et al. 2018) that proposed a methodology for the detection of stars that host artificial megastructures big enough to produce noticeable dimming (e.g. Dyson Swarms).
Zackrisson et al. (2018) singled out one star: TYC 6111-1374-1. This is unrelated to the kernel, but there is an interesting observation I can make about TYC 6111-1374-1. If you look at a periodogram of its DASCH light curve, the second best periodicity peak is ~4.31 years, which is nearly a exact match of a well-known claim (Sacco et al. 2017) about a key periodicity of KIC 8462852, the only other star within the purview of Dysonian SETI at the time. The peak is not the main peak and it fails to reach significance, which is why I don't have a technical write-up about it. I did, of course, communicate this curious find to Gary Sacco privately.
That was enough to peek my interest, nonetheless. Zackrisson et al. (2018) was based on Gaia Data Release 1 (DR1), and DR2 came out at around the same time as the paper. Additionally, it occurred to me that state-of-the-art machine learning techniques – what Kaggle competitors rely on – could result in more precise detection, and therefore a lower likelihood of producing spurious candidates. And it does.
As it turns out, TYC 6111-1374-1 is not anomalous. After controlling for modeled error and so forth, the machine learning model indicates that, while the star is dimmer than expected, its anomaly metric doesn't even reach 1-sigma. Primarily, its Gaia DR2 parallax, which should be more accurate, changed substantially relative to DR1.
The machine learning model did produce an interesting result on stars with an anomaly metric of at least 3-sigma: they cluster more than other stars. In the kernel I compared the group of anomalously dim candidates with an equal-sized group of anomalously bright controls. The shortest distances between stars in the two groups are more often those between two anomalously dim stars.
That's the analysis I presented in the kernel, as it's simple enough and has fast run time, but there are other ways to analyze the data. For example, if you split the anomalously dim candidates, it can be shown that those in the main sequence or below (in a H-R diagram) are separable in space from those that are above the main sequence. (Yes, the model does determine that some giants are anomalously dim.)
Of course, the clustering might be explainable in some way that's currently not clear. There could be data systematics/bias or a model artifact. It could be that these machine learning models are unable to fully adapt to ISM irregularities. We could also be dealing with a star type that exhibits clustering. White dwarfs, for example, have been found to occur in clusters.
Sacco et al. (2017). A 1574-day periodicity of transits orbiting KIC 8462852. arXiv:1710.01081
Zackrisson et al. (2018). SETI with Gaia: The observational signatures of nearly complete Dyson spheres. arXiv:1804.08351