Even the memes recognize that modeling is the easy part. Source

Don’t Sleep On Data Operations

When most people think about deep learning practitioners, they think of data scientists who whisper to machine learning models using special powers they learned during their PhDs.

While that may be true for some organizations, the reality of most practical deep learning applications is more banal. The biggest determinant of model performance is now the data, not the model code. And when data is supreme, data operations becomes the most important part of your ML team.

An Intro To Data Operations

Fundamentally, data operations teams are responsible for the maintenance and improvement of the datasets that models train on. Some of their responsibilities include:


Here, we train a very simple model on the Speech Commands audio dataset and analyze its failure cases to see how best to improve it!

In the last decade, deep learning has become a popular and effective technique for learning on a variety of different data types. While most people know of applications of deep learning to images or text, deep learning can also be useful for a variety of tasks on audio data!

At my previous job working on self driving cars, I spent a lot of time working on deep learning models for imagery and 3D pointclouds. While we had some cool applications of audio processing, I only dabbled in it and never seriously worked on it.

However, here at Aquarium, we have…


An example of an interactive embedding visualization generated in Aquarium.

Machine learning on unstructured data (like images or audio) is much harder than machine learning on structured data. Fortunately, deep neural networks can produce structured representations known as “embeddings” that are remarkably effective in organizing and wrangling large sets of unstructured data.

At Aquarium, we heavily utilize embeddings to make it easier for our users to work with unstructured data. In this article, we’re going to explore why these embeddings are so effective and how to use these embeddings to speed up common ML workflows.

Structured and Unstructured Data


The sight of this curve can strike fear into the hearts of machine learning practitioners. Source

Whenever an ML team discusses what they should do to improve their models, there’s inevitably a point at which someone throws up their hands and says, “Well hey, let’s just get more data and retrain the model. Maybe that will help.”

There’s some promise to this idea. Holding the model code fixed and iterating on data can be the most effective way to improve your models. Randomly sampling data to retrain on yields great improvements up to a certain point. But after that point, you start to get less and less improvement to model performance as you add more data.


This is the second part of our series on scaling ML teams! Read on if you’re a leader in an ML team that is starting to grow beyond 10 people and scale from one team to a collection of teams working on the ML pipeline. If you’re a smaller team or you haven’t seen our first post on scaling from 0–10 people, you may want to read that first.

Building An Org

If your ML teams are misaligned, your org can start to feel a lot like Westeros.

It’s been a frantic few years. The ML pipelines have become more mature, more models have been deployed in new areas, and total ML-related headcount has grown quite a lot. What…


Illustrated With Memes

So you’re doing an ML project! Maybe you want to build an object detection system for a robotics application or you want to add a recommender system to your webapp.

You’ll need a team to build and improve this ML system. In the beginning, this can be a single (very stressed) engineer hacking together an MVP, but it can evolve into an entire department with highly specialized teams and hundreds of people. At each stage of developing a model pipeline, you will encounter different problems that require different team structures to overcome.

Leaders (Engineering Managers, Tech Leads, Product Managers, etc.)…


Adventures in ML Engineering

Improve your model by improving the data!

Let’s say you’re training a machine learning model to solve a problem. You’ve just gotten something working. Maybe you used a model from scikit-learn or tensorflow/models, but now it’s at 80% accuracy and you want to make it better. What do you do?

You may be tempted to try some clever feature engineering. Perhaps using Yolo v7 would be better than the Yolo v3 you’re using now. Or hey, maybe you should try that new optimizer (NAdamGradProp with momentum) you saw on arxiv-sanity last week.

Hold your horses. Before you try anything, figure out what the problem is and then…

Peter Gao

Cofounder and CEO of Aquarium! Ex-Cruise, Khan Academy, and Pinterest.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store