Over the last few UMAP blogs, I have looked at a case of UMAP cluster outliers to see if the data is actually some specific population of its own or whether it is simple debris or something else.
I back gated the outlying clusters on the singlet and FSC/SSC plots and found these events are quite randomly spread over the data which leads me to think it may be down to some non-specific staining, but there are a few more ideas I want to consider first.
Are these outliers sample specific?
Do these events occur in all samples or in specific ones within the 10 that were merged?
For this I look at the file scattered histogram.
The red histogram plot on the left shows the event counts across the ten different data files which have been merged, this plot is gated on the singlet gate A (below). The histogram plot on the above right shows the events from the outlying cluster gate B (shown below) and as we can see the majority of these events reside within three of the data files within the ten merged.
Further using the cluster gate B and the file scatter parameter, I look at a common antibody (CD3) to determine if the three different outlying clusters appear together or separately (below).
From the above plot we can see that the different outlying clusters do show separately at different intensities of staining for the CD3 antibody. The plot below shows the same but overlaying the rest of the data (grey), on this plot the red and orange clusters are showing higher levels of stain intensity than the rest of the data which for me backs up my non-specific antibody binding theory.
What do you think about UMAP?
There is another possibility I want to check, the UMAP algorithm in use. We use a UMAP Algorithm from the GitHub site in the CytoSwarm software, and looking on the site, it seems that a few other people have experienced some outlying data using a UMAP algorithm (some example links below):
This is not the specific algorithm that CytoSwarm uses (jlmelville/uwot), but there are a few cases reported over varying UMAP algorithms, so this could theoretically be a reason for the outlying events above.
So with these theories in mind, I have a few ways I could resolve this…
Within VenturiOne I can use the zoom function to zoom in to the relevant data in the UMAP plot and put a gate around it effectively gating out the outlying cluster data (see below).
I could save the above merged FCS file excluding the outlying events, then re-run the multidimensional analysis on the gated data.
I could use the information from the file scatter plot to identify the files which contain the outlying cluster data, and re-run the multidimensional analysis excluding these files.
What would you choose?