Have you ever run into the problem of trying to write a vector or matrix that R cannot store? Dr. Kylie Bemis, faculty in the Khoury College of Computer Sciences at Northeastern University, ran into this problem during graduate school and wrote a package called matter that solves it. Development for matter was supported by a grant from the R Consortium.
Dr. Bemis holds a B.S. degree in Statistics and Mathematics, an M.S. degree in Applied Statistics, and a Ph.D. in Statistics from Purdue University. She’s run the Boston Marathon twice and has won numerous academic awards including the John M. Chambers Statistical Software Award from the American Statistical Association.
Dr. Bemis is currently working on providing support for matter and better handling of larger data sets and sparse non-uniform signals. Sparse data in mass spectrometry requires handling data that is zero or does not exist at all. The goal is to better interpolate or resample that data.
In particular, by providing support for non-uniform signal data, matter will be able to provide a back end to mass spectrometry imaging data. But working with large files is applicable in a lot of domains, covering other kinds of hyperspectral data. It is a problem in digital signal processing without many solutions.
What is matter and what does it do? What problem are you solving?
matter is an R package designed for rapid prototyping of new statistical methods when working with larger-than-memory datasets on disk. It provides memory-efficient reading, writing, and manipulation of structured binary data on disk as vectors, matrices, arrays, lists, and data frames.
Data sets might be larger than available memory. matter is completely file based and does its interpretation on the fly. matter aims to offer strict control over memory and maximum flexibility with file-based data structures, so it can be easily adapted to domain-specific file formats, including user-customized file formats.
matter is done with most of its major milestones and is currently on version 2.2. To download and get started now, see: http://bioconductor.org/packages/matter/. The matter 2 User Guide is here: http://bioconductor.org/packages/release/bioc/vignettes/matter/inst/doc/matter-2-guide.html
What type of formats are you extending matter to?
Technically, we’re not extending to formats but rather improving support to existing formats. Improving sparse matrix support. We do have sparse matrices in matter, but it’s not easy to work with them as dense matrices. The main idea of matter is to work with LTR matrices without loading them into memory. We have a little of that with sparse, but it’s written in R so it’s not the fastest in matter. The dense matrix is written in C and C++ so it’s efficient. It also means that we can use the alt-rep framework that R has introduced. You can have something that looks like an ordinary R matrix or array, and in the background, it’s supported by a matter matrix or array.
A few packages make use of this, and it’s something that we are working on and improving for the dense matrices. We can’t do that yet with the sparse matrices because the way alt-rep works is through the C layer of R. We have to implement alt-rep through that level and since the sparse matrix representation is in R, we can’t currently have a sparse matrix alt-rep. That’s another reason that we want to have a sparse matrix in C and C++. Not only will it become faster, but as alt-rep becomes more mature and more efficient, we can use alt-rep with sparse matrices. The main thing that comes out of that is to hopefully use matter matrices in places that you wouldn’t normally get to use them because with an alt-rep object, the function doesn’t have to be aware that this isn’t a regular R matrix or array.
Beyond that, I want to improve R data frames and string support. Right now, we have a prototype data frame in the package, but that’s something that people will be interested in down the line. So, looking at the best way to make an out-of-memory data frame more mature and fully featured now is another thing that we are working on now. Lastly, string support, we have some built-in support for string. It’s not the most intuitive to use, and I’m not entirely sure how useful that will be for some people. But, based on the way that matter is built to be flexible, I realized that is something that we could do. I wanted to sort of explore what we could do in terms of storing string and character vectors in terms of reading from files. One of the bigger challenges going forward is that it’s not going to be as efficient as the others based on its nature. If you have a character vector, each string might be of a different length. And so, with a lot of different types of formats, we are assuming we have an idea of what the file structure looks like. Where, with strings that won’t be the case. So we have to start by parsing the document and figure out what we are breaking on. Whether it’s new lines or other methods.
How exactly does matter allow the use of such large files?
The way that matter works with these large files is kind of simple. Any programming language (R, C, C++, Java) will have a function somewhere that allows you to read some small chunk of a file, whether text or binary. What matter does to allow us to work with these files is that it calls a C++ function that reads a small chunk of a file. Where the magic happens is that we assume that we have a very large matrix or array and the column in that matrix or section in an array might come from different parts of a file or different parts of multiple files. That’s kinda where matter comes in. matter stores a blueprint or a dictionary of where the different columns are stored in that file, maps them to that location in the matrix or array, and depending on what part of the matrix we are accessing, matter figures out what part of the file or different files to read and figures out the most efficient way to read those so it does not do too many reads (since that’s the slowest part of the process).
matter goes and figures this out, reads it into memory, rearranges it if needed so it’s in the desired shape, and returns it to you as an ordinary in-memory matrix or array.
A lot of packages like these use this kind of thing. They use memory mapping, which maps directly onto some big file. Operating systems are really good and efficient at doing that sort of thing. The problem we have is that a lot of time our data comes from multiple files and different places in the file. So, we needed something more flexible. The main thing I wanted to do was figure out how to map between where the different columns were based on where they were in memory and how I wanted to interact with them as a programmer. That’s the main thing that matter does.
matter is hosted on Bioconductor, but other sectors (such as sentiment analysis) also use large datasets. How can matter be used in these applications?
Right now it is hosted on Bioconductor because we come from a Bioconductor background. My work comes from a mass spectrometry background with imaging and proteomics. So, Bioconductor was a natural home for matter and our other packages were hosted there. As a system, it made sense to host it there. There is no reason it can’t be used by other domains.
With Bioconductor, there is no reason that it’s restricted to just Bioconductor packages. Most of the packages are based on bioinformatics itself, but they might have applications outside of bioinformatics. And matter is one of those. So, if you are working in a different area, downloading and installing isn’t that hard and not really any more difficult than downloading and installing a package from CRAN.
So, right now we think Bioconductor is the best place to host matter, and we think people should be able to use it in most domains. The use of sparse data is more useful for mass spectrometry and proteomics, so that is why we are working on these, but we hope that it is usable across other domains.
Why did you realize you had to expand?
The main reason was that we needed more sparse matrix formats that we had as a prototype in R, but we needed it to be as fast. To do this we needed to implement it in C and C++ and make sure it’s more compatible with alt-rep and some of the other Bioconductor big data packages. A lot of that came from mass spectrometry imaging. We work with a lot of imzML formats. This is specific to mass spectrometry imaging. It has 2 files for every data set. One is an XML file which is a text file that describes the experiment and the mass spectrometry in the binary file (metadata). The second file is a binary file that has all the mass spectrometry, intensity arrays, and mass-to-charge ratio from the dataset. There are two subformats, one is sparse and one is dense. A lot of the more recent experiments and recent high mass spectrometry and spatial resolution require storing more in a sparse format. We are seeing more datasets coming in using this sparse format. I have another package called Cardinal that has matter as a backend. We didn’t want it to be slower on these large sparse data sets. So, to deal with that we needed to improve the sparse data sets (processed imzML). That was the main driver, making sure things were fast for the new data sets. Also, there is a lot of potential for sparse data and sparse matrices. That was important for me.
How did you get involved?
My background is in bioinformatics and my Ph.D. is in statistics. I started working on a collaboration with Graham Cook’s lab at Purdue. They worked on a project called DESI for mass spectrometry imaging. The idea is that with mass spectrometry, we can collect from a sample and the mass spectrometry tells us the different abundances in the sample. And with imaging mass spectrometry, we collect hundreds of thousands of mass spectrometry from across the surface of the sample. If we choose a particular chemical, we can reconstruct an image of where that chemical comes from and where it comes from, and where we are seeing it at different abundances across the sample. My Ph.D. was working on developing different statistical methods for analyzing this type of mass spectrometry imaging data, which is really interesting data because we have mass spectrometry which is very high dimensional. Then we also have this spatial component. So we have the x and y coordinates along with the image. So it’s a very interesting, complicated problem.
During my Ph.D. I developed some statistical methods for working with this data. I also developed the Cardinal package for working with and analyzing this kind of data. One of the main difficulties with importing this data and working with it in R in the first place. That wasn’t something that existed in the first place, so that was implemented in Cardinal. Then, as these experiments got bigger, the data files got larger, and I realized I couldn’t just pull the whole thing into memory and I needed a way to work with these larger-than-memory files, especially because a lot of our users, labs in life sciences, chemists, and others, don’t necessarily have the access or know how to work on the cloud or in clusters. So a lot of work is done on PCs.
At that point I developed the project matter, primarily to be a back-end for Cardinal, but also as something that can be used by anyone who uses larger-than-memory files without converting to other formats. One of the main ideas of matter was that we didn’t want you to convert to another format if it’s easier to work with the original format. That’s what we did with imzML. A lot of these instruments are collecting data in some proprietary type and they have to convert to imzML. We didn’t want to make them convert to another format again to use matter.
What was your experience working with the R Consortium? Would you recommend applying for a grant to others?
The R Consortium was very understanding of delays with COVID and private issues. I’m grateful for that. The grant process provides a structure for building a complete application. That is useful, and I would recommend it to others.
I also presented at an User! conf, early on, and Bioconductor, and I am considering doing more. I think connecting with the community and presenting your ideas and getting feedback is a key part of the process.
What do you do for your day job?
I am teaching faculty at Khoury College of Computer Science at Northeastern University in Boston. My teaching duties include teaching R masters for the data science program. The classes I teach are introductory data science for our master’s students, Introduction to Coding for Master’s students, and the capstone project-based course where the students develop a project, work on it as a group, and hopefully include it in their portfolio for industry and interviewing.
What non-project hobbies do you have?
I am a writer and a runner. I write science fiction and fantasy. I haven’t been able to write too much during COVID. I have published two short stories and am working on revising a novel. One of my short stories was nominated for an Otherwise award, which is an award for LGBTQ+ authors. That is something that I am proud of. I also have started running again, and I’m trying to get faster.
About ISC Funded Projects
A major goal of the R Consortium is to strengthen and improve the infrastructure supporting the R Ecosystem. We seek to accomplish this by funding projects that will improve both technical infrastructure and social infrastructure.