
Genamap
AUGUST 2016 - MAY 2017
Visualization application of a machine learning tool for Genome Wide Association Studies.
See website here for more information.

- Researched and designed various ways to display a large data set (on average over 15 million data points)
- Implemented a tile server that aggregates data based on a linear function, allowing for quick visualizations of large data sets
- Worked with React for frontend, MongoDB for data aggregation, and Node.js for server-side development
- Utilized user research (gathered by expert interviews from University of Pittsburgh researchers) to inform development and design decisions
- Helped improve the user experience by improving features throughout the application
- Redesigned the visuals (i.e. logo) and email experience of Genamap
INTRODUCTION
When I joined the Genamap team, the state of the application was slow and unusable most of the times. When analyzing medium to large data sets, the application would either take a long time to load the graphs, or completely crash the browser. At the time, the visualizations were built with D3.js, and each data point was drawn using a separate SVG. But because we're dealing with genome-wide association studies, the data that the tool was intended for will typically always be large (the genome has over 3 billion base pairs). Although the machine learning algorithms used to develop these visualizations were solid, the web application was failing to provide a smooth user experience to understand these results.
My role on this research team involved this particular issue of "how do we effectively display the results that our users would want, but not in a way that makes our browser slow and ineffective?"
Genamap when I joined the team. This matrix is only displaying a small data set, yet would take ~3 minutes to load.
PROCESS
Brainstorming:
We didn't want to scrap the D3 efforts that were already in place for the matrix and Manhattan plot visualizations at first, so I figured the one true inefficiency of this application was that it rendered all of the data points, when in fact you can only see a small subset of that. So one way to quickly address this issue was to chunk the data such that only the current viewport is rendered and viewed. This allows us to query only the points that are needed, and saves from having the user wait for a long time.

We improved the placement of the graph and text so that words wouldn't be cut off or look awkward. We then thought of adding a mini-map to help give context to the viewport / help navigate around the visualization more quickly.

We thought about page numbers under the matrix, but thought that a page number didn't correspond to what the user will see.

We then thought about a "page-like" interface, but instead of pages, we're providing a range of data points to jump to.
Solution:
Our team conducted user interviews with researches at the University of Pittsburgh who conduct studies involving GWAS. These interviews brought us insights: for example, not every data point is interesting, but hotspots in the data (outliers / spikes in the data) are the most interesting.
When we evaluated these (+ more) options and our user research, we decided against chunking the experience such that you wouldn't be able to understand how one page might connect to the next. Because our user interviews also indicated that not all of the data points needed to be there, only the "hotspots," we started to think about simplifying the process of finding these hotspots for our users. We looked towards Google Maps for inspiration. How does Maps allow us to view the world in one view? But a street's shops and restaurants in another? The idea of aggregating our data on various levels became a reality in the Spring of 2017 when we started working on a tile server, similar to Google Maps.
By implementing a tile server, we would have to scrape the D3 implementation that was currently in place. Because this would involve a lot of work, we wanted to plan and design this new visualization tool to be sustainable for future visualizations and applications. Knowing this, we made sure to research the tradeoff of various tools in the market: which tool should we use to implement the visualization? Which tool do we use to aggregate our data? Do we want to preprocess the aggregations or have it be query-able? etc.
After deciding to use HTML tables (React-virtualized library) to display the visualizations (e.g. matrices, Manhattan plot, etc.), we shopped around for a database that supported aggregations. After deciding on MongoDB at the time, we started implementing a MVP for this new visualization tool, Geneviz if you will. The end result is a tool that allows the user to quickly dive into the graph to hotspots that they find particularly interesting.
Geneviz allows you to scroll into a particular column to get aggregates of its surrounding space. You can apply a linear function, such as min() or max() to determine what you want to see aggregate data of.