How to Process and Visualize High-Density Aquaculture Data at Scale

Written by Manolin | Oct 10, 2025 8:53:14 PM

Written by Denver University Fellow, Vincent Engelstad

When organizations face mountains of data, the instinct is often to reach for the most sophisticated algorithms available. But sometimes the most powerful solutions are surprisingly simple. This post demonstrates how targeted statistical methods can unlock valuable insights from large datasets, particularly in environments where computational efficiency and speed are critical factors.

During my work at Manolin, I was challenged to develop next-generation aquaculture analytics that would help fish farmers make better decisions about their operations. We focused specifically on feed products used for salmon and trout populations, leveraging Manolin's rich dataset on product usage and fish health outcomes. My goal was to analyze, rank, and visualize feed performance while mapping the range of possible health outcomes associated with different products.

Percentile-Based Ranking System

Working with my project sponsor, we needed to choose a ranking method that could handle the messy realities of aquaculture data: extreme outliers, skewed distributions, and highly variable farm conditions. We settled on an average percentile-based system that would be robust against these challenges.

The approach was straightforward: calculate percentile values for each observation, then average those percentiles by product across five key metrics: feed efficiency, mortality, growth, stress, and filet quality. Each metric was derived from farm-level measurements that we binned into meaningful categories.

Manolin's existing analytics platform already emphasized rigorous data preprocessing, so I inherited a relatively clean dataset. Still, I needed to address the usual trouble makers in data, outliers, negative values, and null entries before diving into the analysis.

What seemed like a simple ranking problem quickly revealed two critical challenges:

Computational efficiency: How do you make percentile calculations scalable when processing massive, continuously updating data streams?
Communication: How do you translate statistical rankings into clear, actionable insights that non-technical stakeholders can easily understand and trust?

Hurdle #1: Dynamic and Scalable

The idea for solving this hurdle came as a literal shower thought from my main point of contact at Manolin, John Costantino. He had already tried to run code to calculate the percentiles for the dataset, but quickly exceeded the 48 GB of memory on his machine. When calculating percentiles for a dataset, all the relevant values need to be held in memory and sorted from smallest to largest. For smaller or static datasets this might not pose much of an issue. However, the data environment that these sets of functions might end up being used in would be both dynamic and potentially contain millions of observations. Some googling later and John had sent me a paper for the percentile estimation algorithm, T-digest. T-digest is able to accurately estimate percentile information with less memory by using a clustering algorithm.

Two features made T-digest great for Manolin's needs:

Extreme efficiency: The algorithm handles data streams far larger than our use case, making it future-proof as the platform scales.

Incremental updates: New farm data can be incorporated without retraining the entire model, important when receiving daily updates from hundreds of farms.

I was able to implement T-digest in python with less than 50 lines of code. Below you can see what is needed just to define the T-digest function:

Once the dataframe, batch_scores, and keys were defined, as well as functions for saving each model version, running the function was straightforward and took only a few minutes in my colab environment.

I did not run it on the full production dataset that contains millions of rows while testing. Instead I used more compact specific dataset closer to ~30,000 rows that had a subset of data that would be used in production. So you might ask, how do I know that T digest will work with this much larger dataset in production?

One of the best parts about this algorithm is the model’s ability to keep cluster size relatively constant based on tuning parameters regardless of the amount of data. My initial model used ~30,000 rows of data and the result was 20kb per T-digest. This is in contrast to the current production version sitting at only 26kb despite it being trained on ~100x more data.

Hurdle # 2 - An effective visualization for all

This can be a deceptively challenging problem to approach, as some of the most informative and accurate visualizations in data analysis can be difficult to interpret without prior analytics experience.

The ranking and product outcomes visuals needed 3 qualities to be successful for Manolin's intended use case:

Visually Appealing

Informative

Easily intelligible

Product Outcomes

After brainstorming visualization approaches with the Manolin team, we landed on joyplots as our solution. These ridge-style distribution plots were perfect, they clearly show individual measurement distributions while making it easy to compare multiple products at a glance. Unlike box plots, which require viewers to interpret quartiles and outliers, joyplots let the data speak for itself through intuitive distribution shapes.

Example Joyploy (Measurement Name Removed)

These plots are able to tell a story of how different products can have outcomes with varying distributions. Products with taller peaks had more consistent results in these measurements while products with more spread out curves could produce wider ranges of outcomes. While this information alone should not be used for decision making, it was clear in establishing different outcomes for fish metrics based on the product.

Ranking System Visual

The ranking algorithm, and visualization pipeline, came together in roughly 400 lines of Python code. The core function ingests percentile data and metric scores, then ranks products across five visible performance categories plus a sixth hidden metric: a weighted average that determines the final ranking order.

I designed the system to export ranking scores for flexible visualization, allowing the results to be integrated into dashboards, reports, or presentation tools beyond Python's ecosystem.

The Final Results

The visualization successfully communicates product performance differences through an intuitive color-coded bubble chart. Darker blue bubbles indicate higher performance scores, while lighter blue (turquoise) represents lower scores, creating an immediately readable performance map.

The project delivered three key tools for Manolin's analytics platform: a memory-efficient percentile calculation system using T-digest, a scalable product ranking algorithm, and automated joyplot generation functions.

Together, these components gives Manolin new tools to process and present aquaculture product performance.

View full post