Basic setup

First and foremost, you’re going to need some data, and if it’s not too much trouble, also know where it lives on your hard drive.

folder_path <- "~/testimg.raw/"
idx_path <- paste0(folder_path, "_FUNC001.IDX")
file_path <- paste0(folder_path, "_FUNC001.DAT")

When you go through the process of calibrating an instrument before acquisition, this has absolutely no effect on the data stored in the .DAT file. Values are stored as-is and recalibrated in software post-acquisition. The coefficients of the calibration function are stored in plaintext in the _HEADER.TXT file

coefs <- getCoefs(paste0(folder_path, "_HEADER.TXT"))

The .IDX metadata stores the memory addresses of the start of each scan in the associated .DAT file. This helps speed up reading the scans, but also means if there’s a Windows copy error for your 1 megabyte file then your 25 gigabyte file becomes unusable. If you have no .IDX file, you can reproduce one with the minimal amount of data to access scans using https://github.com/drobertsicl/ubmf

adds <- getAdds(idx_path)

The .IDX metadata also contains per-scan TIC values, number of data points per pixel, base peak, and base peak intensity values. These can come in handy and take seconds to read.

scan_meta <- getMeta(idx_path, coefs)

Estimate the image dimensions, if using a path with a ~ then use path.expand.

numScans <- estDims(path.expand(folder_path))
numScans
#> [1] 172 168

In some cases, the function will report that a wacky dimension has been returned. Often this is simply that one dimension is calculated to be 10 times larger than it’s supposed to be, and you can obviously just alter that number. In more challenging cases, it may not be obvious at all what your dimensions are supposed to be.

As long as more/less scans than intended have not been acquired (which is actually frequently the case) you can also calculate all theoretical image dimensions as factor pairs of the number of scans.

dimslist <- estDims2(nrow(scan_meta))
plotlist <- plotDims(dimslist, as.matrix(log(scan_meta$tic)))

You can then inspect each individual plot to see which dimension is correct. In this case it’s the very last one, so we’ll cut to the chase.

plotlist[[46]]

However, if your run was terminated early, or the instrument decided you absolutely had to have an extra pixel somewhere, or the ghost from Ghost did a pottery lesson with your stage motor, you can try the following.

plot(scan_meta$scan_time)

plot(scan_meta$scan_time[1:500])

scan_n <- 50
if(length(unique(abs(diff(timeTest(scan_meta$scan_time[1:scan_n], scan_meta$scan_time))))) >1){
  print("More than one disjunction size detected, data may have variable length rasters or be incomplete")
}
cat(paste0(c(" Suggested x dimension: ", abs(diff(timeTest(scan_meta$scan_time[1:scan_n], scan_meta$scan_time)))[1], "\n", "Median no. of scans per raster: ", median(abs(diff(timeTest(scan_meta$scan_time[1:scan_n], scan_meta$scan_time)))), "\n", "Suggested y dimension: ", nrow(scan_meta) / abs(diff(timeTest(scan_meta$scan_time[1:scan_n], scan_meta$scan_time)))[1])))
#>  Suggested x dimension:  172 
#>  Median no. of scans per raster:  172 
#>  Suggested y dimension:  168

Very, very occasionally, there are images which HDImaging appears to do some kind of under the hood transformation on. If you have a weird, warped image, this might help. This interfaces to maldichrom.exe exactly the way HDImaging does, processes one peak per pixel, and extracts the x/y dimension information. It is therefore very slow and stupid.

You MAY have to be running RStudio with admin privileges, because randomly it will be unable to generate temporary files. This may be an R version specific issue and I’m not going to spend time investigating it. The cell below will generate a massive number of output messages in your console and hopefully not in this vignette, so try not to have anything useful in there before you run it

Plot a TIC image to make sure you’re not about to do something crazy and spend half an hour processing the wrong data

cimgplot <- matrix(data = 0, nrow = numScans[1], ncol = numScans[2])
cimgplot <- scan_meta$tic[1:prod(numScans)]
cimgplot <- imager::as.cimg(cimgplot, x = numScans[1], y = numScans[2])

p <- ggplot(data = as.data.frame(cimgplot)) +
  geom_raster(aes(x = x, y = rev(y), fill = log(value))) +
  scale_fill_viridis_c(option = "magma", name = "Intensity (A.U)") + coord_fixed() +
  scale_x_continuous(expand=c(0,0))+scale_y_continuous(expand=c(0,0)) + xlab(NULL) + ylab(NULL) + ggtitle("log TIC Image") +
  theme(plot.title = element_text(hjust = 0.5))
p

Now read in your data. If you don’t use multiple threads this will be extremely slow to the point of not being worth it.

plan(multisession, workers = 8)
handlers("txtprogressbar")
start <- Sys.time()
with_progress({
peaklist_sewage <- readScans(file_path, adds = adds, batch_size = 250, calib_coefs = coefs)
})
stop <- Sys.time()
stop-start
#> Time difference of 36.59566 secs

For comparison, to read this data into R any other way, you’d need to convert it via proteowizard, then read that intermediary file in via mzR

start <- Sys.time()
msconexe <- "C:/Users/eatmo/AppData/Local/Apps/ProteoWizard 3.0.24094.d2966db 64-bit/msconvert.exe"
convertR(path.expand(folder_path), path.expand(folder_path), msconexe)
peaklist <- mzR::openMSfile(paste0(path.expand(folder_path), "testimg.mzML"))
peaklist <- mzR::peaks(peaklist)
stop <- Sys.time()
stop-start
#> Time difference of 1.768819 mins

Me personally, I prefer 30 seconds. But is the data actually correct?

If the following evaluates as FALSE, all the intensities match

FALSE %in% c(unlist(sapply(peaklist_sewage, "[[", 2)) == unlist(sapply(peaklist, "[", , 2)))
#> [1] FALSE

If the following evaluates as false, all the m/z values match

FALSE %in% c(unlist(sapply(peaklist_sewage, "[[", 1)) == unlist(sapply(peaklist, "[", , 1)))
#> [1] TRUE

And you might think “way to go dumbass”, but let’s look at how this difference manifests in a single scan

plot(x = peaklist_sewage[[1000]][["mz"]], y = abs(peaklist_sewage[[1000]][["mz"]]) - peaklist[[1000]][,1], ylab = "Deviation from proteowizard m/z value", xlab = "m/z")

As the values are stored based on exponentiation, we see this error get magnified across different ranges as the exponent changes. As R has no native support for 32 bit floats, all calculations are performed to 64 bit precision. Truncating this to 32 bits results in:

library(float)
#> Warning: package 'float' was built under R version 4.4.3
FALSE %in% c(as.float(unlist(sapply(peaklist_sewage, "[[", 1))) == unlist(sapply(peaklist, "[", , 1)))
#> [1] FALSE

A perfect match, but not necessarily the speediest.

Memory size

MSI data is typically obscenely large. In this extremely small case it’s only about 1 gigabyte

file.size(file_path) / (1024^2)
#> [1] 1039.63

Due to the 64 bits of precision however, the size in RAM is larger than the size on disk

as.numeric(object.size(peaklist_sewage)) / 1024^2
#> [1] 2092.046

If all m/z values are converted to 32 bits:

for(i in 1:length(peaklist_sewage)){
  peaklist_sewage[[i]][["mz"]] <- as.float(peaklist_sewage[[i]][["mz"]])
}
as.numeric(object.size(peaklist_sewage)) / 1024^2
#> [1] 1589.262

In this instance, all intensities are stored as integers. This isn’t guaranteed as the format can store non-integer values, so there’s no code yet to force intensities to integers, but that depends on surveying more data.

for(i in 1:length(peaklist_sewage)){
  peaklist_sewage[[i]][["intensity"]] <- as.integer(peaklist_sewage[[i]][["intensity"]])
}
as.numeric(object.size(peaklist_sewage)) / 1024^2
#> [1] 1069.502

And now the size in memory is more or less the same size as on disk.

What are the practical benefits of any of this?

First and foremost, this is architecture and OS independent. Secondly, there’s no need for duplication of data into an intermediary format. Thirdly, it is at best significantly faster, and at worst the same as current approaches. Fourthly, you’re going to analyse it in an interpreted language anyway. Fifthly, say you have an absurdly large whole-slide image and you don’t want to spend a weekend converting and peak picking it, we can do something about that.

In simple cases, you may be able to simply cluster the TIC value and successfully segment your image.

tree <- genieclust::gclust(log(scan_meta$tic), k = 5)
cut <- cutree(tree, k = 5)
cimgplot <- matrix(data = NA, nrow = numScans[1], ncol = numScans[2])
for(i in 1:length(scan_meta$tic)){
cimgplot[i] <- cut[i]
}
cimgplot <- imager::as.cimg(cimgplot)
cimgplot <- imager::mirror(cimgplot, "y")
# tinker with rotations
p <- ggplot(data = as.data.frame(cimgplot)) +
geom_raster(aes(x = x, y = y, fill = factor(value))) +
scale_fill_manual(values = c("#E69F00", "#000000", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#bcbcbc", "#a27500", "#5083ab", "#2a7353", "#ada82e", "#265283", "#944503", "#90587a")) + coord_fixed()
p

In this case, it’s ALMOST successful, but the persistent raster artifact results in scans that resemble the distribution of tissue TICs. Sometimes, including the number of data points, base peak, or base peak intensity can help.

tree <- genieclust::gclust(cbind(log(scan_meta$tic), scan_meta$bp), k = 5)
cut <- cutree(tree, k = 5)
cimgplot <- matrix(data = NA, nrow = numScans[1], ncol = numScans[2])
for(i in 1:length(scan_meta$tic)){
cimgplot[i] <- cut[i]
}
cimgplot <- imager::as.cimg(cimgplot)
cimgplot <- imager::mirror(cimgplot, "y")
# tinker with rotations
p <- ggplot(data = as.data.frame(cimgplot)) +
geom_raster(aes(x = x, y = y, fill = factor(value))) +
scale_fill_manual(values = c("#E69F00", "#000000", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#bcbcbc", "#a27500", "#5083ab", "#2a7353", "#ada82e", "#265283", "#944503", "#90587a")) + coord_fixed()
p

In this case, we encounter a resounding “not really”. There are two ways of potentially mitigating this. The first is raster-wise normalisation:

tree <- genieclust::gclust(normRaster(as.matrix(cbind(log(scan_meta$tic), scan_meta$numpeaks)), numScans, method = "median"), k = 5)
cut <- cutree(tree, k = 5)
cimgplot <- matrix(data = NA, nrow = numScans[1], ncol = numScans[2])
for(i in 1:length(scan_meta$tic)){
cimgplot[i] <- cut[i]
}
cimgplot <- imager::as.cimg(cimgplot)
cimgplot <- imager::mirror(cimgplot, "y")
# tinker with rotations
p <- ggplot(data = as.data.frame(cimgplot)) +
geom_raster(aes(x = x, y = y, fill = factor(value))) +
scale_fill_manual(values = c("#E69F00", "#000000", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#bcbcbc", "#a27500", "#5083ab", "#2a7353", "#ada82e", "#265283", "#944503", "#90587a")) + coord_fixed()
p

This worked well enough that we can probably get a good tissue mask. There’s also a separate function called deband, which eliminates horizontal artifacting by scaling directional Fourier coefficients to have the same variance as the rest of the image (if that doesn’t mean anything to you, don’t worry about it)

tissueclusts <- c(4,5)
tissuepix2 <- which(cut %in% tissueclusts)

# remove objects smaller than x pixels
remnum <- 2
cimgplot <- matrix(data = 0, nrow = numScans[1], ncol = numScans[2])
cimgplot[which(cut %in% tissueclusts)] <- 1
cimgplot <- imager::as.cimg(cimgplot)


test <- split_connected(cimgplot)
test <- test[-which(sapply(test, sum) < remnum)]
test2 <- imappend(test, axis = "c")
test2 <- imsplit(test2,"c") %>% add
test2 <- as.cimg(test2)

tissuepix2 <- which(test2 != 0)

# fill the outline of the largest connected blob, enlarging image by one pixel then returning to original dimensions so contours still works on image edge
templot <- cbind(rep(0, nrow(as.matrix(test2))), as.matrix(test2), rep(0, nrow(as.matrix(test2))))
templot <- rbind(rep(0, ncol(as.matrix(templot))), as.matrix(templot), rep(0, ncol(as.matrix(templot))))

ct <- contours(as.cimg(templot),nlevels=1)

# get top 3 connected cluster

objs <- order(sapply(ct, function(sublist) sum(sapply(sublist, length))), decreasing = TRUE)[1:3]

test <- expand.grid(x = 1:nrow(templot), y = 1:ncol(templot))
tissuepix2 <- point.in.polygon(test$x, test$y, ct[[objs[1]]]$x, ct[[objs[1]]]$y)
tissuepix2 <- which(tissuepix2 == 1)
cimgplot <- matrix(data = 0, nrow = nrow(templot), ncol = ncol(templot))
cimgplot[tissuepix2] <- 1
cimgplot <- imager::as.cimg(cimgplot)
cimgplot <- (as.cimg(as.matrix(cimgplot)[2:(numScans[1]+1), 2:(numScans[2]+1)]))
obj1 <- which(cimgplot == 1)

cimgplot <- matrix(data = 0, nrow = numScans[1], ncol = numScans[2])
cimgplot[obj1] <- 1
cimgplot <- imager::as.cimg(cimgplot)
cimgplot <- imager::mirror(cimgplot, "y")
# tinker with rotations
p <- ggplot(data = as.data.frame(cimgplot)) +
geom_raster(aes(x = x, y = y, fill = factor(value))) +
scale_fill_manual(values = c("#000000", "#FFFFFF")) + coord_fixed() +
  ggtitle("Largest connected cluster") + theme(plot.title = element_text(hjust = 0.5))
p


test <- expand.grid(x = 1:nrow(templot), y = 1:ncol(templot))
tissuepix2 <- point.in.polygon(test$x, test$y, ct[[objs[2]]]$x, ct[[objs[2]]]$y)
tissuepix2 <- which(tissuepix2 == 1)
cimgplot <- matrix(data = 0, nrow = nrow(templot), ncol = ncol(templot))
cimgplot[tissuepix2] <- 1
cimgplot <- imager::as.cimg(cimgplot)
cimgplot <- (as.cimg(as.matrix(cimgplot)[2:(numScans[1]+1), 2:(numScans[2]+1)]))
obj2 <- which(cimgplot == 1)

cimgplot <- matrix(data = 0, nrow = numScans[1], ncol = numScans[2])
cimgplot[obj2] <- 1
cimgplot <- imager::as.cimg(cimgplot)
cimgplot <- imager::mirror(cimgplot, "y")
# tinker with rotations
p <- ggplot(data = as.data.frame(cimgplot)) +
geom_raster(aes(x = x, y = y, fill = factor(value))) +
scale_fill_manual(values = c("#000000", "#FFFFFF")) + coord_fixed() +
  ggtitle("2nd largest connected cluster") + theme(plot.title = element_text(hjust = 0.5))
p


test <- expand.grid(x = 1:nrow(templot), y = 1:ncol(templot))
tissuepix2 <- point.in.polygon(test$x, test$y, ct[[objs[3]]]$x, ct[[objs[3]]]$y)
tissuepix2 <- which(tissuepix2 == 1)
cimgplot <- matrix(data = 0, nrow = nrow(templot), ncol = ncol(templot))
cimgplot[tissuepix2] <- 1
cimgplot <- imager::as.cimg(cimgplot)
cimgplot <- (as.cimg(as.matrix(cimgplot)[2:(numScans[1]+1), 2:(numScans[2]+1)]))
obj3 <- which(cimgplot == 1)

cimgplot <- matrix(data = 0, nrow = numScans[1], ncol = numScans[2])
cimgplot[obj3] <- 1
cimgplot <- imager::as.cimg(cimgplot)
cimgplot <- imager::mirror(cimgplot, "y")
# tinker with rotations
p <- ggplot(data = as.data.frame(cimgplot)) +
geom_raster(aes(x = x, y = y, fill = factor(value))) +
scale_fill_manual(values = c("#000000", "#FFFFFF")) + coord_fixed() +
  ggtitle("3rd largest connected cluster") + theme(plot.title = element_text(hjust = 0.5))
p

Concatenate these to be our tissue mask:

tissuemask <- unique(c(obj1, obj2, obj3))
length(tissuemask) / prod(numScans)
#> [1] 0.2175388

It’s now only ~22% of scans, and we can subset only those scans. In reality this has cut out a chunk of the middle tissue, and you may want to expand a border around your tissue with morphological operations, freehand select an ROI, etc. However, I can’t knit a vignette with an interactive input, so you’re free to figure out the function roiSelect on your own.

plan(multisession, workers = 8)
handlers("txtprogressbar")
start <- Sys.time()
with_progress({
peaklist_sewage <- readScans(file_path, adds = adds, indices = tissuemask, batch_size = 250, calib_coefs = coefs)
})
stop <- Sys.time()
stop-start
#> Time difference of 21.25537 secs

And now just to prove that we’ve actually selected the right scans, we’ll calculate their TICs and throw them into an image:

temptic <- rep(0, prod(numScans))
temptic[tissuemask] <- log(unlist(lapply(sapply(peaklist_sewage, "[[", 2), sum)))
cimgplot <- matrix(data = temptic, nrow = numScans[1], ncol = numScans[2])
cimgplot <- imager::as.cimg(cimgplot, x = numScans[1], y = numScans[2])
p <- ggplot(data = as.data.frame(cimgplot)) +
geom_raster(aes(x = x, y = rev(y), fill = log(value))) +
scale_fill_viridis_c(option = "magma", name = "Intensity (A.U)") + coord_fixed() +
scale_x_continuous(expand=c(0,0))+scale_y_continuous(expand=c(0,0)) + xlab(NULL) + ylab(NULL) + ggtitle("log TIC Image") +
theme(plot.title = element_text(hjust = 0.5))
p

In the case of large data, it’s probably a better idea to incorporate code to peak pick each scan as it’s processed e.g via MassSpecWavelet. But for now, this is just about data accessibility.

Processing data with .RAW SEWAGE (Raw Spectra Extracted With A Ginormous Effort)

Basic setup

Memory size

What are the practical benefits of any of this?