Alexandru Burlacu

MLOps for independent research

2023-01-12T21:00:00+00:00

… Or how to run experiments on a budget.

On December 5th, I was presenting online at the Belgium MLOps meetup in Ghent. I thought more people would benefit from the content of that presentation and my experience in general. So, I decided also to have it as an article on my blog. Also, while working on that presentation, I found a few unexpected things, but later about it.

Oh, by the way, one of the best alternative titles was:

I went with food | Image based on the slides by the author

Prologue - Some context

I believe it’s important to outline my main research driver:

I’m searching for methods to train strong neural networks from scratch with minimum annotated data. Ideally, with minimum data.

Why? Throughout my career, I had cases when data was scarce and expensive to acquire, and even a pre-trained model couldn’t help. So I had to create small bespoke models to tackle my problems. It was a huge pain, and I want to never go through that hell again and wish no one else would have to either.

Besides, sometimes, using a pre-trained model can be restrictive, depending on its license. Currently, the most relevant type of restrictive license for AI is RAIL. If you wonder why such licenses are restrictive and don’t want to dive into the legal aspects, here are a few good links.

To form a more nuanced view of ML and licensing, see the two-part essay by Christopher Moran on The Gradient. We won’t dive any deeper in this rabbit hole, otherwise we’ll stray waaaaay too far from this blog’s scope.

So anyway, in the summer of 2021, I had a research internship at Université Paris Sorbonne Nord. I had my own research agenda, and my supervisor was super cool about it. My research project was about searching for more sample-efficient self-supervised learning techniques (SSL). I was working with images, but the method should be modality-agnostic.

The only downside, stemming from my not wanting to work on some existing, grant-covered project, was that I had no access to the necessary hardware.

But that’s alright. It is, isn’t it?

You want to do some independent research

How do you proceed?

Solution: You buy a GPU.

🪄🪄 Or better yet, you buy many GPUs. 🪄🪄

Problem solved.

Bye.

Hold on, seriously. How do you proceed? A good GPU machine will set you back a few thousand USD, even with the crypto boom somewhat behind.

Besides, my project was pretty short-term, so such an investment would be a net loss. And I’m not even counting the time I could spend on it playing games instead of training nets.

And if that wasn’t enough, depending on where you live and the quality of your electric wiring, such a machine will bring more pain and expenses than joy. Have you ever had your personal computer/workstation randomly shutdown due to excesive power consumption, maybe even taking down all your desk appliances with it too? I have.

Free solution: Google Colab

A popular alternative would be to use Google Colab. But not so fast. There are some limitations worth mentioning. Colab’s free tier will only allow you one GPU per account, you have to be mindful of the daily GPU quota (about 8 hours within 24h), and you can’t even run the same notebook in parallel even if it uses the CPU runtime.

What about Colab Pro/Pro+?

You are not guaranteed any specific GPU. It could be a P100, a T4, or, once in a blue moon, a V100.
It’s still a single notebook. What if I want multiple?
What are “compute units”, and how much each GPU costs?

If I am to pay for a service, I’d like to understand what I am paying for and how I’m billed. The opacity of Colab Pro and Pro+ is something I’m not sure I’d be willing to accept.

The first (not so) good solution

Given all that, I decided for my first variant to rely on Colab because it has free access to some GPU resources. With the saved money, I indulged myself with over 20 different kinds of cheese and too many macaron flavors to count. Vive le France!

To run more experiments and somehow circumvent the limited access to GPUs, I was using multiple Google accounts. Each account had a copy of the same Colab notebook and only had to change hyperparameters. If you wonder whether managing these identical-but-not-quite notebooks was a mess, I’ll answer you - it was an absolute mess.

As for my storage solution - I was storing model checkpoints in a shared Google Drive, and given that a blob’s storage consumption is associated with the account that created it and not where it’s stored, in practice, the amount of available Google Drive storage is doubled.

What about experiment tracking? - Google Sheets. Yes, it started to become a mess after the 3rd change of the experiment setup.

Towards a better solution

Of course, it was unsustainable and slow. And painful. And annoying. And somewhat challenging to replicate. So, I needed another solution, and by this time had outlined some constraints:

Constraint One: Messy environment, mainly Jupyter, with relatively limited code refactoring
Constraint Two: Ideally, I wanted numerically replicable experiments
Constraint Three: Also, experiments take a long time, so I want to run many at the same time
Constraint Four: Cost is a big issue because the research is self-funded

Based on these constraints, I had my core requirements: Cost-efficiency, Flexibility, and Reproducibility. I had some ideas in mind to accomplish these requirements, but I needed computing resources, so my next stop was to use a public cloud.

I picked GCP because I’m most familiar with it. I know about alternative GPU clouds like Paperspace or Linode, but I felt that they might be more expensive. Plus, again, I am most familiar with GCP.

If you look long enough, you'll hear the song | Image based on the slides by the author

Initially, I was provisioning stuff from the Web console. But it was tedious and error-prone, I like CLIs better, and I had Terraform and Ansible on my radar for a while.

Core requirements: Cost-efficiency

Based on this requirement, here are some decisions that stemmed from it.

I needed the cheapest powerful machines - Preemptible VMs with GPUs
I also needed a simple way to quickly spin machines up and down so that I don’t forget anything running and I don’t waste time while setting up the environment - Terraform FTW, and Ansible too
I had a hunch that by using the most powerful machine and maximizing its usage, I would have the best price-performance ratio - thus, I chose A100 GPUs. To be absolutely honest, another driver for this decision was the coolness factor
I was running multiple experiments in parallel, as fast as possible - used Papermill for the hands-off launch of multiple notebook-based experiments. Occasionally was using tmux from the Jupyterlab terminal window, but it was a total pain.
Best cost-optimization is not to run things at all - so I used HPO to select what configurations to run. For HPO, I used Optuna.

Of all the HPO tools out there, why did I choose Optuna, you may ask?

I like their API. It integrates nicely with Python control structures, like for-loops, or if-elif-else.
Optuna uses a Bayesian HPO approach. Bayesian methods are pretty accurate and more hands-off than random search, allowing me to launch the hyperparameter search sweep and not think about narrowing down the search space.
A downside of Bayesian Optimization methods is that they are slow-ish / not very parallelizable. But that’s ok, my degree of parallelization is 2-5 parallel runs, and I didn’t intend to go multi-node.

These decisions converged in the following architecture.

I'd get spanked by any half-decent security consultant for this architecture | Image based on the slides by the author

So, a lot of stuff going on here. Let me explain. On the left side, you’ll see the configuration files on the local machine, which are used to instantiate the infrastructure on the right side. Basically, it starts with terraform apply, which reads and executes all terraform files in the project, like this snippet below.

terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "3.5.0"
    }
  }
}

provider "google" {
  credentials = file("project-name-some-id.json")

  project = "project-name"
  region  = "${var.region}"
  zone    = "${var.region}-a"
}


resource "google_compute_instance" "vm_instance_worker" {
  name         = "gcp-vm-instance-worker"
  machine_type = "a2-highgpu-1g"

  boot_disk {
    initialize_params {
      image = "deeplearning-platform-release/pytorch-latest-cu110"
      type  = "pd-ssd"
      size  = 150
    }
  }

  metadata = {
    ssh-keys              = "username:${file("~/.ssh/sshkey.pub")}"
    install-nvidia-driver = true
    proxy-mode            = "project_editors"
  }

  scheduling {
    automatic_restart   = false
    on_host_maintenance = "TERMINATE"
    preemptible         = true
  }
}

resource "null_resource" "provision_worker" {
  provisioner "local-exec" {
    command = <<EOF
                ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook \
                -u username \
                -i "${google_compute_instance.vm_instance_worker.network_interface.0.access_config.0.nat_ip}," \
                --extra-vars "tracker_uri=${google_compute_instance.vm_instance_tracker.network_interface.0.access_config.0.nat_ip}" \
                ./config-compute.yml
            EOF
  }
}

The .tf files use the GCP provisioner, and as such, they need a service account key (credentials in provider "google") to be able to provision resources like VMs, buckets, and networks.

Once the infrastructure provisioning part is done, the local-exec provisioner is triggered, which is responsible for running the Ansible playbook and configuring each provisioned VM. It installs drivers, sets env vars, and launches MLFlow or Jupyterlab as background processes. See an example Ansible playbook below.

---
- hosts: all
  name: jupyter-install
  become: username

  tasks:
    - name: install nvidia drivers
      shell: sudo /opt/deeplearning/install-driver.sh

    - name: test nvidia drivers
      shell: /opt/conda/bin/python -c 'import torch; print(torch.cuda.is_available())'
      register: nvidia_test

    - debug: msg="{{ nvidia_test.stdout }}"

    - name: install mlflow
      shell: /opt/conda/bin/pip install mlflow==1.20.2 google-cloud-storage==1.42.3 optuna==2.10.0 papermill==2.3.3

    - name: launch jupyterlab
      environment:
        MLFLOW_TRACKING_URI: 'http://{{ tracker_uri }}:5000'
        MLFLOW_S3_ENDPOINT_URL: gs://some_bucket_address
        PATH: /opt/conda/bin:{{ ansible_env.PATH}}
      shell: "nohup /opt/conda/bin/jupyter lab --NotebookApp.token=some_token --ip 0.0.0.0 --no-browser &"

I am provisioning two VMs, one for the experiment tracker and one for running experiments. I also need a firewall to allow TCP traffic on select ports, specifically 5000 (MLFlow), 8888 (JupyterLab), and 22 (SSH). Finally, I have a GCS bucket as the artifact repository for MLFlow.

Notice that my VMs receive a copy of my SSH public key. It’s necessary to allow SSH connections from my local machine because Ansible uses SSH to connect to its targets.

Core requirements: Flexibility and Parallelism

Research is quite messy. I try to fix the mess by extracting common code, maybe writing some utils, but sometimes I prioritize running experiments. As mentioned, I was using Jupyter and Optuna. To make them work nicely together, I used Papermill.

Papermill allows for parametrized, programmatic execution of Jupyter notebooks. Let me explain with a table:

Capability	Example Usage
Parametrizes notebooks	Propose hyperparameters
Can inspect them	Extract final scores
Executes them	Run notebooks from the command line
Stores them	Save specific notebook variants

So, in my setup a Python CLI program with Optuna and Papermill is used to launch multiple parallel experiments, something like this:

python notebook_hpo.py \
  -i Test.ipynb \
  -o './out/Test.{run_id}.ipynb' \
  -p ./parameters.yml \
  -j 8

Or, if you prefer a diagram to a code snippet, here’s one:

I'd get spanked by any half-decent UML afficionado for this diagram | Image based on the slides by the author

Core requirements: Reproducibility

I have suffered enough in the industry from unreplicable training runs, so I needed to eliminate this issue in my research.

I needed tracking and determinism.

I won’t dive deep into the matter of running reproducible experiments. But I’ll allow myself to repeat some stuff. You can find a more detailed overview here, in the The takeaways > Replicable experiments part.

The deterministic experiments checklist (for PyTorch):

The most important thing you can do is to seed your pseudo-random number generators (Python, Numpy, PyTorch, CUDA), aka PRNGs.
Be reasonable about (non-)determinism: Calling torch.use_deterministic_algorithms() is a Nope for me because it will throw erros when calling .backward() for some layers. On the other hand, setting torch.backends.cudnn.{benchmark,deterministic} properties is fine; they won’t throw errors.
Special considerations about parallel data loaders, specifically for PyTorch users, don’t forget to also seed them in each of your DataLoader workers, like this:

def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

g = torch.Generator()
g.manual_seed(0) 
dl = DataLoader(train_dataset, batch_size=batch_size, num_workers=num_workers, 
                worker_init_fn=seed_worker, generator=g)

That’s kind of it with the determinism part. How should I handle my experiment tracking infra?

I need a minimal, dedicated, non-preemptible VM (n1-standard-2 works fine) because I don’t want my tracking server preempted without first having a DB backup on my laptop, and implementing a half-decent backup script wasn’t something I wanted to do
The experiment tracking server is a self-hosted MLFlow; I am quite familiar with it
The tracking database is SQLite. SQLite, being basically a single file, allows me to scp it to my local machine when done working and load it with Terraform file-provisioner on startup
All my artifacts are checkpointed to GCS, or rather, I’m using GCS as an artifact repository for MLFlow

My tracking strategy:

Track all modifiable hyper-parameters
During fine-tuning, track loss, top-1 and top-5 accuracy on both training and validation splits
During pre-training, only track loss
No need to track data because I use standard datasets like CIFAR100 or STL10
Based on my previous experience, I find it quite annoying working with nested runs, so I don’t use those
I created a new experiment on qualitative/untracked change (a different dataset, changed pre-processing code, a different SSL pre-training method)

Some of it is also explained in detail in that same article referenced above (here it is, for your convenience), in the The takeaways > Experiment tracking part.

Tracking all this stuff with MLFlow, also allows me to compare runs with parallel coordinate plots, which is the best way to look at your hyperparameter optimization runs, IMO!

By the way, if you’re not familiar with MLFlow, here’s a link.

Was it all worth it?

TL;DR: Yes, let me show you why.

First, let’s assume the following setup: ResNet50, pre-training (PT) + fine-tuning (FT), for 10 epochs, with batch sizes 512 (PT) and 4096 (FT).

Let’s first do some benchmarks.

GPU type	pre-training time	fine-tuning time	compared to A100
Colab K80 12GB	965s	310s	5.1x slower
T4 16GB	420s	122s	2.2x slower
A100 40GB	166s	80s	1

Let’s do some simple math with the same setup.

A model takes 7.2GB of VRAM. Except for A100, it uses 8.4GB for the same setup. No idea why.

GPU Type	Nr. of parallel runs
Colab K80 12GB	1
T4 16GB	2
V100 16GB	2
A100 40GB	4 (5 runs with `batch_size` 448)

Let’s do some more math.

GCP billed my A2 instance for 44h. Meaning I was running experiments for almost 44h. Of course, I was launching those experiments manually with my script, and there was some idle time, but it was minimal. Anyway, 44 billed hours on A2. For the same volume of work with a T4 GPU, I’d get billed for…

44h x (5 runs / 2 runs) x 2.2 speedup == 240h w/ T4

… for 240 hours. That is a lot more, even if T4 GPUs are considerably cheaper!

Hold on, 5 parallel runs on A100 are possible when using 448 batch size, not 512. That’s almost a 10% smaller batch size, so the training should take roughly 10% more time in this setting. Well, based on a few experiments, changing the batch size from 512 to 448 results in just 3-5% pre-training slowdown, plus there’s the fine-tuning part, which we don’t alter, so all in all, it’s still going to be roughly 2.2x faster than T4.

Anyway, for that 44h I paid 48 USD.

Before we move forward, let’s make one thing clear: based on the information we have so far, Colab Pro/Pro+ is not worth it, compared with my setup, at least.

Colab Pro+ is 43 EUR/month. It does not guarantee the accelerator type, uses opaque “compute units” payment, and 200+h on T4 will consume those units in no time.

Let’s do some more math. How much would I have to pay for 240h of using a T4 GPU, with a decent VM instance, like an n1-standard-8?

240h x 3.15 USD/h / 17.381h = 43.5 USD

Based on these calculations, I paid a ~5 USD premium for a ~6x speedup. Totally worth it.

In fact, I would have paid more than 43 USD for 240h on T4. Because it seems the network is 1.8-2x slower on N1 instances, resulting in a long time to download the necessary dataset after each provisioning. A few test runs of A2 and N1-standard-8 instances averaged 9m 30s and 19m, respectively, to download the CIFAR100. On a side note, I could have kept copies of the datasets in a GCS bucket, but I didn’t. Maybe I thought it would cost a little too much for its worth, and I’d be annoyed by it. But what’s done is done. Given that I would need to run a T4 instance for considerably longer to do the same amount of work, I’d also have to provision my infrastructure more often, leading to more times I have to wait until my CIFAR100 or STL10 datasets are downloaded. That would definitely result in more than 43 USD.

So A2 is both faster and cheaper in my setup. I wish my gut feeling would always work this well.

It might not seem like it, but A100 is the better deal | Image based on the slides by the author

So, I hope you can see that using the most expensive single GPU setup on GCP turned out to be the best decision. It costs roughly the same or even less than using the seemingly most cost-efficient one while being soooooo much faster. Even if running an A2 instance was 2x more expensive than N1 with T4 GPU, I’d still take that expense to be able to do 240+h of work in 44h.

Future directions

It may seem like I have my setup optimized to the limit. But it has room for improvement. I’d say the room is the size of a nice large kitchen with an isle in the middle and a terrace for summer dining.

The most impactful missed opportunity is using Mixed Precision. Surprisingly, I wasn’t using it. Maybe because of my old trauma installing APEX from scratch. But now it’s pretty easy, or so they say. Thankfully A100 GPUs have a magic trick, which seems to be enabled by default on PyTorch. This trick is called the TF32 float number representation. It’s a reduced-precision floating point number representation, which can be run on Nvidia’s Tensor Cores and allow for a transparent and easy switch to FP32 when necessary.

A trickier thing I’d like to do is to optimize the data loading. CPUs are underutilized in my setup. Given that my datasets are all standard, I’m considering using FFCV.

A few more niche things, with lower priority than the stuff described above:

Threaded checkpoint saving because it’s in the same thread as training and takes a few seconds at the end of each epoch.
Try MosaicML for additional gains. I’m thinking to specifically the ChannelsLast and ProgressiveResizing, but also PyTorch’s OneCycleLR.
Automatic restart from checkpoints (GCP MIGs + startup scripts) for longer training runs.

Not my case, for now:

Model/Tensor/Pipeline parallelism - largest model is ResNet101
Huge datasets - I’m not even planning to use ImageNet
Collaboration - I was the only one working on it and only discussed the results with my supervisor

A few takeaways

Automate stuff - I’m sure you’ll be glad you did when you can spin up a complete work setup in minutes with a single click. And shut it down with the same ease. Not to mention leaving an instance running will be a thing of the past.
Track your experiments - If you want to reproduce your excellent results or figure out what other tricks to try, keeping a log of what you did and how it went is essential.
Invest in maximizing resource utilization - Having powerful hardware means nothing if it stays idle or is underutilized. Make sure you feed it enough work, so your investment breaks even faster.
Most powerful hardware can be the most cost-effective - That said, using the newest, most advanced, and most powerful hardware can be not only fun but also cost-effective. And finally,
Moving faster costs money - but it’s worth it.

P.S.

“Eventually I will buy a GPU”, from the Director of “I will stop binge-playing PS5” and “I promise I’ll go to the gym consistently”.

A fable about MLOps… and broken dreams

2022-11-21T22:12:00+00:00

For a while, I was considering presenting more often at conferences and meetups. I was postponing it for quite some time, but this summer, I thought, “No more!” and applied to be a speaker at the Moldova Developer Conference. And I was accepted with a talk about MLOps! I thought I’d make the talk a kind of fairytale/fable story with blackjack and easter eggs. Fast forward a few weeks ago, in the first half of November, I was presenting at the conference, and because not everyone could attend it, I also decided to make a blog post on the topic I was presenting.

Intro

This article is divided into two parts, The Story and The Takeaways. Let’s start with the story.

The fable about MLOps…

Note that all the characters in the story are fictional. So is the setting in which the story happens. They are not inspired by concrete people or organizations but rather distilled from my many experiences and a few industry stories. Alright, story time.

Act 1: We need a PoC to prove ML is a good investment

I'm sure you can figure out this reference | Image based on the slides by the author

In an alternate reality, or maybe just another time and place, there was a company - Lupine Corp. Lupine Corp. is a logistics company with a very long history, dating back since the revolution. However, no one remembers which one, could be the French, or the Bolshevik. Like any respectable company, they have a set of values and principles they abide by. One of their core tenets is to be cost-efficient. The other one is - no unnecessary risks.

They were hyped by Hadoop, in 2020. I mean... | Image based on the slides by the author

Lupine Corp. are also reputable for doing their due diligence. So they knew that before launching their ML initiative, they needed to have their prerequisites in place.

They made sure to know their success metrics, meaning they established some KPIs and a way to report and track those.
They also had their data easily accessible and discoverable, not just existing somewhere in their databases. They knew this would be very important for the data scientists they will hire.
Finally, the leadership knew that Data Science and ML are much more unpredictable than traditional software engineering, and they adjusted their expectations accordingly.

Side note: With only these 3 points, Lupine Corp. were so much better prepared for ML than the majority of the companies out there.

Lupine Corp. imposed some budget limitations because of the unpredictable nature of ML projects, so they only hired two people:

Nifel Nifenson (image below, left), who previously worked for two years as a lone Data Scientist in a small company
Nafaela Nafarri, PhD (image below, right), a Senior Data Scientist with six years of experience

Nifel Nifenson is a very results-oriented guy. One could say he’s the (rough) embodiment of the Lean Startup philosophy. Nafaela Nafari has a strong analytical mind. When Lupine Corp. asked them to deliver some results ASAP, they did just that and then some more. The results were very promising and done in record time. Senior management was ecstatic, and more use cases were in discussion.

Dream team. Left - Alexander the Great in the Battle of Issus Mosaic. Right - Pallas Athena by Rembrandt | Image based on the slides by the author

Act 2: Expanding the team. Signs of trouble.

As all things in business and life, with larger scale, cracks became more apparent.

Nifel, Nafaela, and the new team members got along very well. It was a very nice team to work with. Everyone was professional and friendly. Yet somehow, the team’s velocity (as per Scrum, or “throughput” as per Kanban) wasn’t scaling as expected. It even started to go down after a few months. More people and more time were required to complete the same work Nifel and Nafaela had done a few months before. But why was this happening?

There are many reasons why. For example, many promising experiments couldn’t be replicated, even with all the notes the team took. Also, they observed increased complaints from some of the users of their deployed models. The first few weeks after the models were put in production, everyone was happy, but in time more and more bad feedback was received.

And if all that wasn’t enough, some of those productionized use cases started to receive a lot of traffic, sometimes up to two thousand concurrent users. They decided to horizontally scale their existing docker containers to serve them all. It wasn’t resource-efficient. It was hard to manage. And the latency SLAs were thrown out of the window with worrying regularity…

Act 3: Bringing the big guns

Lupine Corp. was upset with the prospect of their ML initiative imploding, so they hired Nuf Nufelman as the new Head of Data Science.

Previously he worked as a lead data scientist at a big non-FAANG company, similar in structure to Lupine Corp. but quite different culturally. His previous employer was basically a “throw money at the problem” type of company, and Nuf was shaped by this mentality too. Nuf was also a great DevOps believer.

Nuf was born and raised in Odessa, but lost his way, a bit | Zeus' statue at Versaille | Image based on the slides by the author

He understood that the problem Nifel’s and Nafaela’s team faced was a replicability problem.

… and a retraining problem.

….. and a scalability problem.

They needed a well-structured process to research, develop, evaluate and productionize their work consistently.

In a meeting with the higher-ups, Nuf told them that if Lupine Corp. was serious about their ML intentions, they had to adopt MLOps, wholly and without question. They accepted.

To streamline adoption, Nuf suggested they don’t develop all the tools in-house but instead pay for an ML-platform-as-a-service (MLPaaS) by All-You-Need-And-A-Kitchen-Sink ($AYN). All-You-Need-And-A-Kitchen-Sink is a recently IPO-ed startup that “solves all the MLOps pains”.

Surprisingly, it worked.

Most of the past problems went away.

But a lot of the internal processes still needed adjustments. Because it was quite a generic tool, a lot of glue code had to be written. Also, people didn’t like using it. The learning curve was steep. And some of the API design choices and documentation could have been more pleasant to work with.

And did I mention the Enterprise tier was a-seed-investment-grant-per-month expensive? If you ever complain about AWS bills, this one was probably even worse, but I digress.

Act 4: Burning cash and its consequences

The ML and Data Science initiative continued to grow at Lupine Corp. They hired more people and sometimes heard more complaints about their ML platform. It was slightly annoying but not that important for the upper management. They had different pains.

How could they ever be content when this new MLPaaS gizmo was burning cash like crazy? And recall their main tenets. Increasing their operational efficiency was a recurring topic during their meetings.

But as anything in old, large corporations, it was a lot of talking and not so much doing.

And then, the earnings call day came…

That day rang both the telephones and hell's bells | Christ in Limbo by a Follower of Jheronimus Bosch

Financials showed Lupine was burning a lot more cash than its competitors. They were no startup or scaleup. This was showing financial recklessness. Shareholders didn’t like it. Neither did the stock market. Their stock plummeted 20% in a week. Something between Meta and Netflix.

To alleviate the issue, Lupine Corp. decided to optimize its operations. Now for real.

They laid off many employees working on non-critical aspects of the business. Whether possible, they terminated said initiatives too.

It was clear one of the main reasons they were burning money was their ML platform. Obviously, the ML initiative was impacted. Nafaela and Nuf stayed, but Nifel was laid off. Layoff decisions were based on tenure and seniority.

Poor Nifel | Image based on the slides by the author

Cutting costs worked. But it wasn’t a good long-term strategy, and Lupine Corp. knew this all too well. They needed to optimize their OpEx. So now, Lupine Corp. was looking for someone who could help. And they found someone. Someone,

Legen- waaaait for it -dary

Meet Nahum Nahreba.

He’s a platform engineer. He is known for thinking from first principles and building nimble, scalable solutions. He’s something of a Jeff Dean, although he might not be able to shift bits from one computer to the other. He helped scale a few startups. It wasn’t the first time he had to work on ML platforms.

Trully a legend | Image based on the slides by the author

TL;DR: He came. He saw. He solved the mess.

He persuaded Lupine Corp. to greenlight a major refactoring of the ML platform, pruning it of many unnecessary features, reducing the bill, and implementing a few features and tools internally, with a specific focus on developer experience and integration with the rest of the company’s infrastructure. It’s a fable, not a technical report, so I won’t dive deep into how he did it.

And so they lived relatively happily until Lupine Corp. management discovered IoT…

The end.

The takeaways

So, how could Lupine Corp. avoid this mess? And how can other companies like them avoid it too?

First things first, we need to give credit where credit’s due. This fictional company did a lot of stuff others don’t, so their success chances were already pretty high. They knew what success looked like for them, they had their data available and discoverable, and they had a correct mindset about this initiative. In my practice, most companies don’t have that.

I would argue one of the reasons Lupine Corp. had such ~~fun~~ hard times was a well-known quote:

“Premature optimization is the root of all evil” - Donald Knuth

… as cited by Nifel Nifelson, and most SWEs. Nifel, in this story, had somewhat more software engineering experience, and it was his responsibility to use an SWE mindset when starting their ML journey. He knew by heart the quote above, the KISS principle, and many others. But he also, like most of us, didn’t quite understand the nuances behind said quotes. Nifel treated MLOps as overengineering. Under management’s tight deadline and pressure to show good results and prove himself a specialist, he created good ML models but not-so-good ML systems.

By the way, the “fuller” quote sounds like this:

“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”

If only Nifel knew it like this… So the takeaway #1: Start early with MLOps.

Nifel’s (counter-)example shows we must consider adopting MLOps practices early on. But it’s not so simple either.

Software and data people are an enthusiastic bunch. We want to use many tools to solve many problems. We’re very prone to over-engineering. If we were rockstars, I think this tendency towards abuse would have manifested a bit differently. Thankfully we aren’t rockstars.

By the way, it’s not my first piece on picking tools, so you’d like to check out the other article about it.

When starting with MLOps, we can be overwhelmed by multiple tools, terms, concepts, and practices. We’ll hear from every corner how crucial it is to have pipeline orchestration, 17 types of ML and data tests, three types of observability, feature stores, model stores, metadata stores, stores to store stores… alright, I’m exaggerating now, but you got the idea.

You don’t need all this tooling, not from the start, even if it comes all bundled together, like AWS, GCP, or Azure offerings.

Using a fully-featured MLOps solution from the beginning usually doesn’t work.

Either because it’s too generic. And/or there are too many upfront costs. Also, it takes a lot of work to onboard your users.

Going head-first into MLOps is a bad idea for most of the same reasons.

What you do need in the beginning is to…

quickly find and access your data
seed that model training code
record your experiment configuration

Then make sure to

easily deploy your models
have some tests

The rest will come after. All that said, the takeaway #2: Start small with MLOps.

Now onto more technical advice.

Simple data collection and discovery

Lupine Corp had this, but I’m sure you don’t. So, what should you do? First, you need to understand The Why? We’re past the Big Data hype by almost ten years. Organizations now have lots of data… but it takes a lot of work to use it properly. It wouldn’t be an exaggeration to say that for the absolute majority of the projects I worked on, accessing datasets was my second most annoying problem. The first one was the lack of a baseline and success metrics. As I said, Lupine Corp. was in fact really good. Your company probably isn’t.

Alright, we know what “data collection” is. ETL pipelines and all that. Or a few scripts running as CRON jobs, dumping files into an S3 bucket. But what about data discovery?

A short googling session will reveal terms and technologies like data governance, data lineage, Amundsen from Lyft, Apache Atlas, Google Data Catalog… yeah, no. Not yet.

Have a shared spreadsheet. In it, each row is about a dataset. Name, short description, update frequency, contact person, and location in the object store. That’s it, at least in the beginning.

Do this, and your data scientists and ML engineers will be happy as hell. You’ll get recruits just by word of mouth.

Here’s a wacky architectural diagram for what you need for simple data collection and discovery.

A few backup and automation scripts running on a schedule, S3 or something similar, a spreadsheet. If you can't do this, please don't hire ML engineers, you'll just waste money. | Image based on the slides by the author

Pro tip: when you dump your raw data into those buckets, don’t override your old data. You’ll see why later.

Replicable experiments

This one requires a few steps, but they’re relatively straightforward. First, you need to seed your pseudo-random number generators, aka PRNGs.

Not everyone knows this, or maybe not explicitly, but ML code is full of randomness. We need to initialize the parameters of our ML models - we use some random distributions. We also need to shuffle our data - also randomness. This is trivial for a machine learning practitioner. What is less trivial is how this randomness is “created”. You see, randomness in computers is not entirely random.

(Optional Paragraph) We use special algorithms, based on stuff like chaos theory, which given an initial state, or a seed, and a set of usually recurrent rules, will generate a sequence of values. The rules are fixed, so the algorithm is deterministic, but the values are chaotic, meaning there’s no discernable pattern. Now, the seed value, the initial state used in these PRNGs is usually a genuinely random number, it can be the exact current temperature of the CPU, the clock drift between multiple CPU cores, or some other value that is naturally random. But you can manually provide the initial state, and thus when running the same sequence of operations multiple times, get the same sequence of values.

Back to our business. We can seed, or manually provide the initial states for our PRNGs so that running the same code will give us the same results - same models, same performance.

This is super important because if we can get the same results, we can properly validate and compare ML models and pick the best ones.

Python ML code has multiple sources of randomness, which can, and should, be seeded. This is because most numerical libraries in Python are written in C/C++/Fortran, and Python is a convenient wrapper to access these routines.

But there are a few more things between you and numerically replicable experiments besides PRNGs.

cuDNN is also standing in the way. cuDNN is NVidia’s low-level set of primitives for deep learning. It has multiple GPU-optimized implementations for convolutions, pooling, linear layers, various activation functions, and so on. Now, cuDNN has a clever way of achieving maximum performance on different hardware for various scenarios. It tests multiple implementations of the same algorithm at the start of the program and picks the fastest one. This selection can be non-deterministic (read random). Why? I am not sure, but as far as I understood, its heuristics might behave differently if there’s anything else running on the GPU. To disable this behavior, one has to set the torch.backends.cudnn.benchmark = False. To my knowledge, there are also a few other sources of randomness in cuDNN, and you can disable (some) these by setting torch.backends.cudnn.deterministic = True. And if you’re interested in finding out more on how to run replicable PyTorch experiments, check out this page from the docs. And if you’re not, search if there are similar behaviors in your favorite framework.

Finally, most of the time, ML algorithms will try to take advantage of modern multi-core CPUs, and when designing replicable experiments, one has to think about it too.

import random, os
import numpy as np
import torch

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

g = torch.Generator()
g.manual_seed(0) 
dl = DataLoader(train_dataset, batch_size=batch_size, num_workers=num_workers, 
                worker_init_fn=seed_worker, generator=g)

Pro tip: when testing a machine learning model configuration, run it multiple times using different seed values. It will reduce the chance that you’re just lucky.

But to replicate experiments, one needs to know all their parameters, which brings us to the next part…

Experiment tracking

import mlflow
from mlflow.models.signature import infer_signature

with mlflow.start_run():
    mlflow.log_param("batch_size", 32)
    # Metrics can be updated throughout the run
    mlflow.log_metric("accuracy", 0.973)
    mlflow.log_metric("accuracy", 0.981)

    with open("outputs/test.txt", "w") as f:
        f.write("hello world!")

    mlflow.log_artifacts("outputs")

    model_signature = infer_signature(example_inputs, model.predict(example_inputs))
    mlflow.sklearn.log_model(model, artifact_path="./sklearn-model", 
                             registered_model_name="sklearn-rf-reg-model",
                             signature=model_signature)

Just try to track as much as possible. I do. And it helped me a great deal. If you are ok with managing your infra, use MLFlow. If you would rather pay for a good managed solution, Neptune.ai and Weights and Biases are very nice.

Pro tip 1: For maximum benefit, group similar algorithms together. It will make it easier to compare those with stuff like parallel coordinate plots.

Pro tip 2: Also, try to track and version all your data. Either with DVC or something else. That’s why you shouldn’t override the raw data in the buckets. Because if you do override it, you won’t be able to replicate the results of your experiments.

So you have a trained ML model. You can also fully replicate it. Now what?

ML Serving

You need to deploy and serve it somehow! How? Use docker and an app server! Consider Ray Serve, BentoML, or Seldon if you care about SLAs. These are specialized solutions that provide impactful features like adaptive batching, model pooling, and so on. If you care much about SLAs, try Triton Inference Server from NVidia. If you want to dive deeper into details, read my blog post on the topic.

ML Tests

What about tests? ML code is still code. So it needs tests. ML testing is a big and hairy problem. I promise I will eventually write some article about it, but for now, think about this problem like this:

You need to have two types of tests,

Behavioral tests, which will measure predictions. These can become your regression suite, where you add various edge cases on which you don’t want to fail ever again
Unit/Integration tests, which will measure training, serving, and preprocessing code correctness. Stuff like “The model should reduce its loss after one iteration” or “The shape of the output should be [x,y,z] given that the input shape was [x,m,n]” and so on. These will spot bugs in your implementation.

Depending on your application domain, here are a few links to help you with ML testing.

CI/CD

If you have done everything until this point, having CI/CD should be easy. Kudos for triggering conditional steps for retraining if the training/model code changes. The conditional build behavior can be implemented with either something like dvc repro + some caching between runs or clever git diff manipulations.

# not the most production ready hack, but maybe it will help you
...
jobs:
  check:
    runs-on: ubuntu-20.04
    outputs:
      DIFFS: ${{ steps.diffs.outputs.DIFFS }}
    steps:
      - uses: actions/checkout@v3
        with:
          fetch-depth: 0 # actually will need some adjustments
          # fetch only as many as necessary: https://github.com/actions/checkout/issues/438
      - name: Last good run commit
        run: |
          curl -s \
          -H "Accept: application/vnd.github+json" \
          -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \
          https://api.github.com/repos/{{ USER }}/{{ REPO_NAME }}/actions/workflows/training-trigger.yml/runs?status=success | jq \
          -r ".workflow_runs[0].head_commit.id" > last_good_commit.txt
      - name: Show and set DIFFS
        id: diffs
        run: |
          DIFFS=$(git diff HEAD $(cat last_good_commit.txt) --name-only | tr '\n' ' ')
          echo "::set-output name=DIFFS::$DIFFS"
          echo $DIFFS

  train:
    needs: check
    if: contains(needs.check.outputs.DIFFS, 'train.py')
    uses: {{ USER }}/{{ REPO_NAME }}/.github/workflows/training.yml@master

One important thing to note is somewhat related to CI. ML projects tend to have a naturally tight relation between EDA, data processing, training, and serving code. As a result, I highly recommend designing ML projects as monorepos and adopting monorepo-related practices and patterns for building, versioning, and code compatibility.

Epilogue

All the advice above is focused on simplicity. You must understand that the solutions I suggest have a very clear scope. These are solutions you should only consider at the beginning of your MLOps journey.

Let me make it simpler with a table.

Q	A
Is it going to scale?	Nope
Is it production-ready?	It’s PoC-ready
How quickly can I set it up?	A few days at most
Is it better than doing nothing?	Yes!!!
Is it cost-effective?	Hell yes
Is it more cost-effective than using a paid or even an existing OSS solution?	IMO much more so

These recipes are Maximum ROI - Minimum Effort solutions to get you started. Eventually, you will discover that they don’t quite suit you. Only then switch to something else. You’ll make a better-informed decision then.

P.S.

I was serious about presenting more often at conferences and meetups. And that’s why I will also be presenting at the Belgium MLOps meetup on 5th December 2022. So if you’d like to learn about my MLOps adventures in setting up my research environment, please join us via this link.

P.P.S.

The story is based on the “Three Little Pigs” one in its Romanian/Russian variant, where the piglets are named Nif-Nif, Naf-Naf, Nuf-Nuf. Now, the local, russian-speaking population has a joke about the 4th piglet, which I’ll let you guess his name. Special kudos to those who also get the meaning/connotation of the fourth piglet.

How to Solve the Model Serving Component of the MLOps Stack

2022-09-24T22:00:00+00:00

This blog post was written by me and orginally posted on Neptune.ai Blog. Be sure to check them out. I like their blog posts about MLOps a lot.

Model serving and deployment is one of the pillars of the MLOps stack. In this article, I’ll dive into it and talk about what a basic, intermediate, and advanced setup for model serving look like.

Let’s start by covering some basics.

What is Model Serving?

Training a machine learning model may seem like a great accomplishment, but in practice, it’s not even halfway from delivering business value. For a machine learning initiative to succeed, we need to deploy that model and ensure it meets our performance and reliability requirements. You may say, “But I can just pack it into a Docker image and be done with it”. In some scenarios, that could indeed be enough. But most of the time, it won’t. When people talk about productionizing ML models, they use the term serving rather than simply deployment. So what does this mean?

To serve a model is to expose it to the real world and ensure it meets all your production requirements, aka your latency, accuracy, fault-tolerance, and throughput are all at the “business is happy” level. Just packaging a model into a Docker image is not “the solution” because you’re still left with how to run the model, scale the model, deploy new model updates, and so on. Don’t get me wrong, there’s a time and place for Flask-server-in-Docker-image style of serving; it’s just a limited tool for a limited number of use-cases, which I’ll outline later.

Now that we know what serving implies, let’s dive in.

Model Deployment scenarios

When deciding how to serve our ML models, we must ask ourselves a few questions. Answering these should help us shape our model serving architecture.

Is our model user-facing?

In other words, does the user trigger it through some action and need to see an effect dependent on our model outputs in real-time? If this sounds too abstract, how about an example? Are we creating an email autocomplete solution like the one in Gmail? Our user writes some text and expects a relevant completion. This kind of scenario needs an “interactive” deployment. This is probably the most common way to serve ML models. But it’s not the only way.

Suppose we don’t need the model’s predictions right away. We’re fine waiting even an hour or more to get what we need. How frequently do we need to get these predictions? Do we need something like a weekly excel report or tagging some inventory item descriptions once per day? If this sounds about right, we can run a “batch” process as a way to serve our model. This setup would probably be the easiest to maintain and scale. But there’s another, 3rd way.

Does the latency matter?

You don’t need to “respond” to the user but still must act based on the user’s action. Something like a fraud detection model that gets triggered on a user’s transaction. This scenario asks for a “streaming” setup. A scenario like this is usually deemed the most complex to handle. Although it would sound like the interactive setup would be harder to build, streaming is generally harder to reason about and thus harder to implement properly.

Let’s dive into the details of each of these setups, the best time to use them, and the trade-offs.

Model Deployment setups

We should consider a few general “setups” based on our business needs when it comes to exposing ML models to the outside world for consumption.

Batch model serving

This one is the easiest to implement and operate of all possible setups. Batch processes are not interactive, i.e., they do not wait for some interaction with another user or process. They just run, start to finish. Because of this, there are mostly no latency requirements; all it needs is to be able to scale to large dataset sizes.

Because of this latency insensitiveness, you can use complex models – Kaggle-like ensembles, huge gradient boosted trees or neural networks, anything goes, because it is expected that these operations won’t be done in milliseconds anyway. To handle even multi-hundred GB datasets, all you need is something like CRON, a workstation/a relatively capable cloud VM, and to know how to develop out-of-core data processing scripts. Don’t believe me? Here’s an example to prove my point.

It becomes a bit more challenging if you need to handle TBs of data. You will need to deal with multi-node Apache Spark, Apache Airflow, or something like it. You’ll have to think about potential node failure and how to maximize the resource utilization of said nodes.

Finally, if you’re operating at Google-size datasets, check this link. Operating at such a scale brings issues like “chatty neighbors”, straggling tasks/jobs, “thundering herds”, and timezones. Yeah, and congratulations on your gargantuan scale.

Streaming model serving

As we already mentioned, batch processes are not the only ones that don’t need to wait on user interaction, i.e., they are not interactive. We can also have our models act on streams of data. These scenarios are much more latency-sensitive than batch processes.

Standard tools for streaming model serving are Apache Kafka, Apache Flink, and Akka. But if you need to operate your model as a streaming/event-driven infrastructure component, these are not your only options. You can create a component that will be a consumer of events on one side and a producer on the other. Whatever you do, be mindful of back pressure. Streaming setups care a lot about being able to process large volumes of continuously flowing data, so be sure to not make your deployed ML models the bottleneck of this setup.

Another thing to consider when developing streaming ML serving solutions is model serialization. Most streaming event processing systems are JVM-based, either Java or Scala native. As a result, you will likely discover that your model structure is limited by the capabilities of your serializer. For a story about how model serialization can become an issue, check out this article’s sub-section – the resulting models can be tedious to deploy.

Here are some useful links regarding the same –

Interactive model serving (via REST/gRPC)

The most popular way to serve ML models – using a server! In fact, a lot of people, when discussing ML serving, refer to this specific setup rather than any of the three. An interactive setup means the user somehow triggers a model and is waiting for the output or something caused by the output. Basically, it’s a request-response interaction pattern.

There are many ways to serve ML models in this setup. From a Flask or FastAPI server with an in-memory loaded ML model to specialized solutions like TF Serving or NVIDIA Triton, and anything in between. In this article, we will mainly focus on this setup.

I’ve seen people developing batch solutions where the ML component is actually a server being called by said batch program. Or components in a streaming event processing system calling HTTP servers that serve ML models. Being a flexible, reasonably simple to reason about, and well-documented approach, many are “abusing” the interactive pattern.

Note on Cloud, Edge and Client-side serving

What if we are developing a mobile app and want our ML-enabled features to work without the internet? What if we want to provide our users with magical responsiveness? To make waiting for a response on a web page a thing of the past. Enter client-side serving and serving ML on edge.

Things to consider

When designing ML systems, we need to be aware of this possibility and the challenges of such a deployment scenario.

Deployment on browser clients is straightforward using TF.js. ONNX can also be an option, albeit a bit more complicated.
As for mobile, we have multiple variants, including CoreML from Apple, TFLite from Google, and ONNX.
For edge devices, depending on their compute performance, we can either run ML models just like we’d do in the cloud or create custom TinyML solutions.

Notice that, in theory, browsers and smartphones are edge devices. In practice, they are treated differently because of the wildly different programming models. More often than not, edge servers are classic computers, either running on ARM or x86 hardware, with traditional OSs, just much closer to the user, network-wise. Mobile devices need to be programmed differently because of the big difference between mobile and more common OSs. More recently, mobile devices have specialized DSPs or co-processors optimized for AI inference.

Browsers are even more different because browser code is usually architected around the idea of a sandboxed environment and the event loop. More recently, we have web workers, which make the creation of multi-process applications easier. Also, when serving an ML model in a browser, we can’t make any assumptions about the hardware on which the model will run, resulting in a potentially horrible user experience. It can very much be that a user opened our web app with the ML model on a low-end mobile device. Only imagine the lags that site will have.

Trade-offs

There could be multiple reasons to move ML serving closer to the edge. Usual motives are latency sensitiveness, bandwidth control, privacy concerns, and the capability to work offline. Keep in mind that we can have various hierarchical deployment targets, spanning between the user’s client device to an IoT hub or router closest to the user, to a city or region-wide data center.

Deploying on edge devices or client devices usually trades off model size and performance for reduced network latency or the possibility of dramatically reducing the bandwidth. For example, deploying a model for automatic face recognition and classification on a mobile phone maybe isn’t such a good idea, but a tiny and simple one that can detect whether there’s a face in the scene or not is. The same goes for an automatic email response generator vs. an autocomplete keyboard model. The former usually isn’t needed on-device, while the latter must be deployed on-device.

In practice, it is possible to mix edge/on-device models with a cloud-deployed model for maximum predictive performance when online, but with the possibility to retain some AI-enabled features offline. This can mostly be done by writing custom code, but it is also possible to use something like Sedna for KubeEdge if your edge devices are capable of running KubeEdge.

A real-world use-case

A common but less discussed scenario for deploying on edge – A retailer wants to use video analytics in their grocery stores. They developed a suite of powerful computer vision models to analyze the video feed from their in-store cameras and were met with a hard constraint. The internet provider couldn’t ensure the upload latency, and bandwidth from their locations couldn’t support multiple streaming video feeds. The solution? They bought a gaming PC per store, put it in the staff room, and did their video analysis locally without needing to stream videos from the stores. Yes, this is an edge ML scenario. Edge computing is not only about IoT.

Serving ML models the right way

Model serving has a tight relationship with metadata stores, ML model registries, monitoring components, and feature stores. That is quite a lot. Plus, depending on concrete organizational requirements, model serving might have to be integrated with CI/CD tooling. It might be necessary to either ensure a staging environment to test newly trained models or even continuously deploy to production environments, most likely as a shadow or canary deployment.

End-to-end MLOps architecture and workflow with functional components and roles | Source: https://arxiv.org/abs/2205.02302

What makes a deployment good?

Keep in mind that a good model serving solution isn’t only about cost-efficiency and latencies but also about how well it is integrated with the rest of the stack. If we have a high-performance server that is a nightmare to integrate with our observability, feature stores, and model registries, we have a terrible model serving component.

A common way to implement the whole model deployment/serving workflow is to have the model serving component fetch concrete models based on the information from the ML model registry and/or metadata store.

For example, using a tool like Neptune.ai, we can track multiple experiments. At some point, if we decide we have a good candidate model, we tag it as a model ready for staging/canary. Remember, we’re still interacting with Neptune.ai, no need to use any other tool. Our ML serving component periodically checks in with the ML model registry, and if there’s a new model with the compatible tag, it will update the deployment like this. This method allows for more accessible model updates without triggering image builds or other expensive and complex workflows. An alternative approach is to redeploy a pre-built serving component and only change its configuration to fetch a newer model, something like this. This approach is more common in cloud-native (Kubernetes) serving solutions.

Of course, as mentioned earlier, frequently, the model serving component has to interact with feature stores. To interact with feature stores, we need to be able to serve not just serialized ML models but also have support for custom IO-enabled components. In some cases, this can be a nightmare. A workaround is integrating the feature stores at the application-server level and not at the ML serving component level.

Finally, we also need to log and monitor our deployed ML models. Many custom solutions integrate with tools like the ELK stack for logs, OpenTelemetry for traces, and Prometheus for metrics. ML does bring some specific challenges, though.

For a dive into what a good observability setup consists of, be sure to check out another blog post of mine.

First, we need to be able to collect new data for our datasets. This is mostly done either through custom infrastructure or ELK. Then, we need to be able to track ML-specific signals, like distribution shifts for input values and outputs. This is a highly un-optimized scenario for tools like Prometheus. To better understand these challenges, check out this blog post. A few tools try to help with this, most prominently WhyLabs and Arize.

What do we really care about?

Other than the usual suspects - tail latencies, number of requests per second, and application error rate, it is advisable to also track model performance. And here’s the tricky part. It’s rarely possible to obtain ground-truth labels in real-time or with a short delay. If the delay is significant, it will take longer to identify issues impacting our users’ experience.

Because of this, tracking the inputs and outputs distribution and triggering some action if these diverge significantly from what the model is expecting is pretty common. While this is useful, it doesn’t quite help track our predictive performance SLO (service-level objective).

The problem of tracking performance

Let me explain, on one hand, we can reasonably assume that divergences in our inputs and outputs distributions can result in degraded performance, but on the other hand, we don’t actually know the exact relation between the two.

We can have scenarios where a distribution for a feature drifts a lot from the expected distribution but has no significant impact on our ML model performance. We will have a false alarm in this case. But these relations change over time. So next time, when the same feature drifts again, it can result in a significant loss of predictive power of our ML models. As you can imagine, this is a nightmare to manage. So what can be done?

The solution – detection and mitigation

We deploy and update ML models to better our business. Ideally, we must “link” our model SLOs with business metrics. For example, if we notice that the ratio of users clicking on our recommendation drops, we know we are not doing well. For a text auto-correction solution, a similar business-derived model SLO could be the ratio of accepted suggestions. If it falls below some threshold, maybe our model is no better than the previous one. Regretfully this isn’t always this easy to do.

Because this problem can be so hairy, we usually extract ML model performance monitoring into a separate component and only track the system-level metrics, traces, and logs at the ML serving component level. We hope that as the infrastructure for ML model monitoring becomes better, ML serving components will provide significantly better integrations with these tools to make the troubleshooting of deployed models significantly easier.

Evolving model serving

Because the interactive serving setup is the most popular way to productionize ML models, we will discuss what a basic, intermediate and advanced setup looks like. What differentiates a good setup from a mediocre one is cost-effectiveness, scalability, and latency profile. Of course, the integration with the rest of the MLOps stack is also important. In general, deciding on what architecture and tools to use is always a tricky affair, with numerous trade-offs. If you’re interested in advancing your decision-making when it comes to making technical decisions, be sure to check out this article on what questions should you ask and some of the trade-offs you should expect. Don’t mind that it’s about programming languages, most questions apply to tools and frameworks too.

Basic setup

Recall, at the beginning of the article, I mentioned that there’s a time and place for an ML-model-in-Flask-server-in-a-Docker-container style of serving. A lot was said about this kind of serving, so I won’t dive into much detail. Note that the ML model can be either backed in the container or attached as a volume. If you are only creating a demo API or know for a fact that you won’t have much traffic (maybe it’s an internal application, which only 3-5 people will use), this can be an acceptable solution.

Or, if you can provision multiple very capable cloud VMs with powerful GPUs and CPUs and don’t bother having poor resource utilization and sub-optimal tail latencies, then it can also work. I mean, Facebook is doing very few tests for their software and still manages to be a huge tech corporation, so it may not always make sense to follow all software engineering best practices. Pros

This setup has the advantage of being very easy to implement and relatively scalable (need to handle more requests => run multiple replicas). Cons
The biggest issue is poor resource utilization because models are triggered on each request for a single input entry, and the web servers don’t need the same hardware as ML models.
Then, there’s a huge lack of control over tail latencies, meaning you can’t enforce almost any SLO with this setup. The only hope to somewhat control your tail latencies is a good load balancer and enough powerful machines to run multiple replicas of your ML serving component.

Simple ML serving with a replicated container. The ML model can be either backed in or attached as a volume. | Source: author.

To improve this setup, we must move onto a medium-level configuration.

Intermediate setup

As mentioned above, we need to split the ML inference from the application server component to optimize the resource utilization and have better control over our latencies. One way to do it is using a publisher-subscriber asynchronous communication pattern, implemented with ZeroMQ or even Redis, for example.

So, after this “schism”, we can do a lot of cool tricks to perfect our serving component into an advanced one.

First, we can enforce much more granular and fine-tuned timeouts and retries. With such a setup, it is possible to scale the ML servers independently from the application servers.
Then, the most fantastic hack for this is to do adaptive batching. In fact, it’s such a great technique that it would make a solution almost advanced-level, performance-wise.

A good model serving solution isn’t just about how good is the server performance but also how easy it is to integrate the rest of the ML sub-systems. A machine learning serving component would need to provide at least some model management capabilities to easily update model versions without needing to rebuild the whole thing. For this kind of setup, the ML/MLOps team can design their ML workers to periodically check in with the model registry and, if there are any updates, fetch new models, something like this or this.

A medium ML serving blueprint, with both replicated application servers and ML servers. The solution also uses a feature store and a model registry. | Source: author.

I am sure you noticed that the moderate setup is considerably more complex than the basic one. This complexity brings major downsides to this approach. At this stage, one needs some form of container orchestration, usually K8s, and at least some system observability, for example, with Prometheus and ELK.

Advanced setup

To be fair, a medium-level setup is enough for most ML serving scenarios. You shouldn’t consider the advanced ML serving setup as a necessary evolution of the last setup. The advanced setup is more like “heavy artillery”, which is required only in exceptional cases.

With all the bells and whistles proposed in the solution above, a question arises – “Why did we bother so much with all these tricks if there are pre-made solutions?”. And indeed, why? The answer would usually be – they needed something custom for their setup.

Specialized solutions like NVIDIA Triton, Tensorflow Serving, or TorchServe have solid selling points and pretty weak ones too.

Pros

First, these serving solutions are very well optimized and usually perform better than a “medium + bells and whistles” solution.
Second, these solutions are straightforward to deploy; most provide a docker container or a Helm chart.
Finally, these solutions usually contain relatively basic support for model management and A/B testing.

Cons

Now the downsides. The biggest one is the awkward integration with the rest of the MLOps ecosystem.
Second, related to the first, these solutions are hard to extend. The most convenient way to solve both these is to create custom application servers that act as proxies/decorators/adapters for the high-performing pre-built ML servers.
Thirdly, and this is probably a thing that I personally don’t like, is that these solutions are very constraining in terms of what models can be deployed. I want to keep my options open, and having a serving solution that accepts only TF SavedModels, or ONNX-serialized isn’t aligned with my values. And yes, even ONNX can be limiting, for example, when you have a custom model (see the subsection – the resulting models can be tedious to deploy) which uses operations yet unsupported by ONNX.

As you might have already guessed, I don’t use these solutions for the most part. I prefer PyTorch, so TF Serving is a no-go for me. Note, it’s just my context. If you use TF, consider using TF Serving. I tried it a few years ago for a TF project. It’s pretty good for serving, but a bit cumbersome for model management, if you ask me.

I said I use PyTorch primarily, so maybe TorchServe? To be frank, I haven’t even tried it. Seems good, but I’m afraid it has the same model management issues as TF Serving. What about Triton? I can speak of its older incarnation, TensorRT Inference Server. It was a nightmare to configure and then discover that because of a custom model head, it couldn’t be served properly. Plus model quantization issues, plus the same woes of model version management as the previous two candidates… To be fair, I’ve heard it got better, but I still am very skeptical of it. So, unless I know my model architecture is unchanged and I need maximum possible performance, I will not use it.

Adaptive batching as a way to more efficiently use ML models. Source: Seldon MLServer docs

To summarize, specialized solutions like NVIDIA Triton or Tensorflow Serving are powerful tools, but if you opt to use them, you better have serious performance needs. Otherwise, I would advise against it. But that’s not all –

Even if these solutions are feature-rich and performant, they still need extensive supporting infrastructure. Such servers are best suited as ML workers, so you still need application servers. To have a truly advanced ML serving component, you need to consider tight integration with your other systems and ML and data observability, custom-built or using services like Arize and Montecarlo.
Also, you need to be able to perform advanced traffic management. The systems mentioned above provide some limited support for A/B testing. Still, in practice, you would have to implement it differently, either at the application server level, for more fine-grained control, or at the infrastructure level, with tools like Istio. You usually need to be able to support gradual rollouts of new models, canary deployments, and traffic shadowing. No existing pre-built serving system provides all these traffic patterns. If you want to support these, be ready to get your hands, and whiteboards, dirty.

Note on cloud offerings

TL;DR: Cloud offerings give you “full-lifecycle” solutions, meaning that the model serving is integrated with solutions for dataset management, training, hyperparameter tuning, monitoring, and model registries.

Cloud offerings try to give you the simplicity of the basic setup, with the feature-richness of the advanced setup and the performance of the moderate one. For most of us, this is a fantastic deal.

Common differentiators for cloud offerings are serverless and autoscaled inference, with GPUs and/or special chips support.

Take Vertex AI from Google, for example. They provide you with a full MLOps experience and relatively easy model deployment, which can be served either as a cloud function or an autoscaled container, or even as a batch job. And because it’s Google, they have TPUs, which come in handy for really large-scale deployments.
Or, with an even more complete solution, take AWS. Their SageMaker, just like Vertex AI, helps you along the whole MLOps lifecycle. Still, it also adds a simple and cost-efficient way to run models for inference with Elastic Inference accelerators, which seem to be fractional GPUs, possibly via NVIDIA’s Ampere-generation MIGs, or using a custom chip called Inferentia. Even better, SageMaker allows for post-training model optimizations for target hardware.

Yet neither offers adaptive batching, some form of speculative execution/request hedging, or other advanced techniques. Depending on your SLOs, you might still need to use systems like NVIDIA Triton or develop in-house solutions.

Conclusion

Running ML in production can be a daunting task. To truly master this, one has to optimize for many objectives – cost-efficiency, latency, throughput, and maintainability, to name a few. If there’s something you should get from this article, then let it be these three ideas –

Have a clear objective and priorities when serving your ML model
Let the business requirements and constraints drive your ML serving component architecture, not the other way around.
Think of the model serving as a component in the broader MLOps stack. Armed with these ideas, you should be able to filter subpar ML serving solutions from the good ones, thus maximizing the impact for your organization. But don’t make the mistake of trying to get everything right from the beginning. Start serving early, iterate on your solution, and let the knowledge from this article help you make your first few iterations somewhat better. Better to deploy something mediocre than not to deploy anything.

References

Interviewing for a Senior ML Engineer position

2022-07-23T01:00:00+00:00

Interviewing is always a tiring and sometimes awkward process. Thankfully there are lots of resources online to help you prepare. But what if you need more specific advice for a more niche position?

This post is based on my personal experience going through the interviewing process at 5 not-FAANG companies. I also had some experience interviewing for not-senior ML Engineering roles at another 3 companies last year. So, I will also do a comparative analysis.

Before we begin…

Let me start with a short prologue to explain why I’m writing this piece. In January 2022, I decided, again, it was time to search for another job outside of my home country. But this time, I decided to be sneaky/smart about it, so I changed my LinkedIn address to show that I’m in London. I also groomed a bit more my LinkedIn page to show some highlights of my recent experience. And then magic happened. For weeks I had recruiters invite me to interviews. I didn’t even have to apply myself to anything, only to accept or reject opportunities arriving from recruiters. What surprised me was that the majority of options were senior or even lead roles. So, I felt like an imposter, but I still accepted a few of these and started the process. And then I searched for tips on how to nail senior ML engineering interviews… and found almost nothing. Sh*t. And that’s how I ~~met your mother~~ decided to write this blog post.

I brushed up my interviewing skills through mock interviews. I was also searching for technical questions for Senior ML roles. Surprisingly, I couldn’t find anything. All the info was only for MLE roles. It seemed a bit strange. In retrospect, it all makes sense now.

I know you are eager to find out why, so I’ll just give the TL;DR right away - ML and Senior ML have more or less the same complexity/hardness for technical questions. Surprise!

I bet you didn’t expect that. I know I didn’t. But then, what is different? And how does the interviewing process works for Senior ML Engineers?

Senior vs non-senior ML interviews

Based on my experience, I haven’t noticed much difference between senior ML and ML engineering interviews at the technical level.

What I did notice is the focus on soft skills for senior positions, and I don’t necessarily mean communication skills. Instead, how a candidate handled failures, team-level conflicts, cross-team communication, how they solved their most challenging problems, or how they handled a poor decision.

I recall the first technical interview for a Senior ML role I had. I was anxious about what kind of questions will I receive. It wasn’t so bad, I had tougher questions than that, but the focus was undoubtedly higher on how I handled some scenarios or how I would do it now.

Aspect	ML engineer interview	Senior ML engineer interview
Coding	Your usual leetcode-medium questions	Same, haven’t seen dynamic programming at this stage
Take-home assignment	Either do EDA or deploy an ML model, focus on code quality, ease of use and tests	Same, take-home assignments are not harder for senior positions
ML Trivia	How algs. work? What would be the best solution for a type of problem	On average, the same as for ML engineer
System Design	How to implement a system for a given scenario? Data collection issues?	On average, same as for ML engineers, just be more conscious of budget constraints
Behavioral	Focus on collaboration, individual growth, and adaptability	Focus on failures, conflict management, and cross-team collaboration

One position for which I did notice some big differences when it comes to the technical questions is Research Engineer. I’m talking questions like how does JPEG compresses images, how to compute nth Fibonacci in O(log n) time, or how to compute PCA from scratch. Now, for a research engineering position, these kinds of questions do make sense because of the innovative and research-oriented nature of the projects they have to work on. These frequently can involve a lot of convert-math-to-code or let’s-break-it-down-and-then-improve type of tasks.

Anyway, to give you a more detailed view, let’s see what is the general interviewing process when it comes to these kinds of roles.

The general interviewing flow

First, let’s go over the main steps in the process. Generally, there are at least 4 steps:

You have the first call with a recruiter or hiring manager. You get to know each other, go over your CV in general, discuss what makes you search for jobs, or accept invitations to interview, what you know about the company, what you are searching for, and so on. A pretty simple step if you ask me. Then, suppose the hiring manager thinks your goals and interests align with what the company seeks. In that case, you will be invited to the second, technical step. The dreaded one.
I call this step just technical for a reason. Some companies split it into 2, a take-home assignment and then a discussion based on it. Others have the typical coding interview. And others yet just have a technical discussion. The technical discussion usually covers ML theory and some specifics, like what is transfer learning, or what transformer architectures are. It might also be a pen-and-paper exercise where you can be asked to infer how PCA works. The latter is more common for more research-oriented roles.
Most of the time, there are two technical interviews, the second being more focused on system design interview. Or maybe some more technical challenges and discussions, YMMV, because this is very company- and team- specific.
Finally, the last round of interviews is usually reserved for everything else that wasn’t covered in the previous steps, usually the behavior interview. Some companies have three rounds, combining the 3rd step with the 4th.

Now, let’s dive into details.

1st interview

Pretty simple. Make sure to learn about the company, even if you were invited to interview with them. At this point, the company searching for candidates has a few objectives:

to understand how interested you are in the company/position
are there any legal constraints that need to be acknowledged, like visa status
or personal constraints, like the necessity to work remotely Also, at this stage, the recruiter is looking whether you’d be a good fit based on your career aspirations, personal opinions, and past experiences.

But don’t be fooled, there’s a probability of failure even at this stage. For example, if the recruiter feels you’re not interested in the position or if your career plans don’t align with the responsibilities of this position.

2nd/3rd interview

As mentioned, different companies do this stage differently. I found three types. Given that we have two steps here, most companies do a mix of these three methods.

The “take-home-assignment tribe”

Take home - either an ML serving solution or EDA + modeling. No one will expect you to deliver a robust, production-ready solution for the ML serving project, nor will anyone complain that your Jupyter notebook doesn’t contain a SotA ML model for a given dataset. The focus is on code quality, the presence of tests and features, ease of running the code for the former, and reproducibility and soundness of the solution for the latter.

Focus on quality over quantity. A good way to show professionalism is to follow up with clarifying questions once you receive the task. And please, read it carefully. Too often have I seen people doing it all wrong and not even bothering to check the exact constraints for the homework.

The “coding challengers”

Too much was said about it. One point I consider worth reiterating is how important it is to actually talk through your problem-solving process and ask clarifying questions. I would argue that this could be even more important than solving the problem. Also, don’t forget about:

Asking about possible edge cases and then covering them.
Explaining the time and space complexity of your solution.
If you have the time, extra points for going through your code “debugger-style”. That is, step-by-step while telling what the current values of all your variables are.

The “technical discussionists”

Discussion with a team of engineers. It usually goes like this: Technical/ML Trivia + NotSoOptional[ML System Design] + Optional[Behavioral]. ML questions are mostly one of:

“How would you handle X scenario”
“What is Y? How does this work?”
Occasionally, for research-heavy roles - “Could you compute Z from scratch, here’s a Google Doc”, as a follow-up to the previous questions.

Where $Y \in \{BatchNorm, DropOut, SkipConnections, DataAugmentation, SGD, Transformers, Attention, et al.\}$ $Z \in \{PCA, Linear Regression, kNN, kMeans\}$

Sometimes technical discussions take a more ML-System-Design flavor.

It’s (was) COVID, so system design is usually only verbal unless you can also text-draw a solution while sharing your screen. Pseudo-code also helps. ML System Design seems not to be any different. It’s still one of “Design a Search Engine for X”, or “How are you going to design an X-which-is-actually-a-recommender-system”.

---------   r/w  ----------    ----------   HTTP/2
|  DB   | <------| API    |<-- | NGINX  |  <-------  Client
|       |        |        |    ---------- 
---------        ----------

Example of "text-drawing" #1

                                 /-------> Users Service --> MySQL
                                /
Client w/ Browser Cache ---> Gateway -----> Posts Service  --> Cassandra x 6
                                                |                 write_to: 2
                                              Redis               read_from: 1

Example of "text-drawing" #2

Extra points for talking through efficiency/budget/business considerations at this step. For example, proposing to split the application in two, with ML logic on a GPU-enabled machine and business logic on a more conventional server. Or thinking out loud about a buy vs. build decision about some sub-component.

Some personal opinions

I prefer take-home projects + technical discussions. This combination makes for a more meaningful technical discussion. It allows the candidate to express their ideas about how a proper production system should be designed based on the take-home assignment. Plus, a good take-home project can highlight candidates’ abilities to write code and how they handle logging, testing, documentation, and deployment. I would argue it’s much better than just solving leetcode problems.

I even used take-home assignments to filter candidates when we were hiring for my team. I know the main cons of it, but I believe that a well-defined problem can be solved in one or two evenings, a couple hours each. Not great, but I feel much more relaxed than doing a 45m coding interview. Speaking of the devil…

I don’t like coding challenges. IMO, it’s usually just lazy bs. These kinds of practices can be understandable for FAANG (well, more like MANGA nowadays) companies because of their scale*. But, when coding challenges are done by small companies, I mostly find this as just bad taste.

Disclaimer *: I don’t mean that at Google-scale, they need their devs to know very well how to sort an array or find 2 numbers that add up to something. I mean that they have to go through so many candidates that they need a standardized, time-efficient, and repeatable way to check their capabilities. It doesn’t seem realistic for companies this big to give take-home assignments and thoroughly check these without incurring significant time and productivity losses. That’s the sad reality.

To add to the mess of coding interviews, companies are actually misusing them. Coding interviews are supposed to check for a candidate’s problem-solving and communication skills. You need to show the interviewer what is your thought process and how are you tackling a new problem. Usually, it shouldn’t matter much if the solution you implemented is optimal or not. You need to be aware of this, though. Regretfully, interviewers usually just look for the “correct” answers, like it’s an exam and not a discussion, making the whole experience miserable.

In theory, coding tests are even worse. Because there’s no way to see the candidate’s thought process and the way they are tackling problems. Thus, it becomes just a timed exam that has no actual value in assessing how good a candidate is. In practice, because most interviewers are no better, I would take a coding test over a coding interview almost any day of the week.

So, if I were to rank coding interviews, I would arrange them like this:

“Discussion” coding interview
Coding test with no interviewer at all
Exam-like coding interview, without much support from the interviewer

Of course, there are exceptions. One time, at band camp (jk), I had a fantastic experience with a no-interviewer coding challenge. It was a 3.5h HackerRank challenge, in 3 stages, for a research engineering position. The questions ranged from probability to ML model serving, numerical stability, and basic ML theory. Then, for the second stage, it was a code review exercise! I was given a piece of code and had to identify a bug and suggest an improvement. How cool is that?! The final part was an actual coding challenge to implement a graph algorithm. It was exhausting, but at least it wasn’t generic, and because it was so diverse, I felt like it enabled people to show where their true strength lies.

Alright, I’ll stop complaining and move on to the next section of this post.

4th interview

This one is primarily behavioral. Although I would say the candidate is always asked behavioral questions, it’s just at this stage, it is the primary focus.

I really like the questions about past experiences and how they can be improved, or if something didn’t work, why? I feel these questions correlate more with actual skill rather than generic theory questions.

A few questions that I really liked were:

If I ask your manager what’s your greatest weakness, what would they tell me?
What was a situation in which you made a mistake? How would you prevent it now by having more experience?
Give me an example where you made a poor technical decision and then had to fix it. How did you do it?

Generally, any question which asks to reflect on past mistakes is especially cool. Why? They help uncover how you grew since then, how humble you are, and how your critical thinking works.

I have no recollection of such questions in a non-senior ML interview, but plenty of those for senior/lead positions. So maybe think about such scenarios before your next interview.

Some final tips to prepare

To really nail that interview process, I like doing mock interviews. The best way to do it (that I found) is Pramp.com. It’s not an advertisement, you can check the link - it has no referral code or anything. I just really find them helpful, especially for coding interviews and somewhat for system design interviews.

For ML system design, the best thing I have found so far is Chip Huyen’s booklet - Machine Learning Systems Design. And of course, for generic system design - The System Design Primer.

And remember, to really prepare for the behavioral interviews. Be ready to answer questions about how you failed and what you learned from it. Focus more on behavioral questions, specifically ones highlighting your leadership potential and learning-from-mistakes type of situations. For a good list of behavioral questions, see this PDF from LinkedIn.

Throughout the process, ask questions and show your interviewers that you are engaged in conversations with them and are interested in the role. Ask them about their technical and business priorities, how specific processes are implemented in the organization, and their current pain points. Here’s a good list of questions you can ask.

Interested in becoming a senior engineer? You’ll need both strong ML and superior soft skills to get that senior position. Also, maybe check my post Becoming a Senior Engineer, which should help you define your own roadmap.

A little disclaimer (last one in this post)

These posts were almost done since February, but due to the tragic events unfolding in Ukraine, I thought it wouldn’t be nice, to say the least, to post it back then. In Moldova, there’s a saying “Satu’ arde da baba sî chiaptănă” which translates to something like “The (unreasonable) old lady is grooming while the whole village burns”. I didn’t want to be that lady, so I thought it would be better to wait until things become at least somewhat less chaotic.

#Слава Україні! #Героям слава!

AutoML Solutions: What I Like and Don’t Like About AutoML as a Data Scientist

2022-07-04T22:00:00+00:00

This blog post was written by me and orginally posted on Neptune.ai Blog. Be sure to check them out. I like their blog posts about MLOps a lot.

There’s a sentiment that AutoML could leave a lot of Data Scientists jobless. Will it? Short answer – Nope. In fact, even if AutoML solutions become 10x better, it will not make Machine Learning specialists of any trade irrelevant.

Why the optimism, you may ask? Because although a technical marvel, AutoML is no silver bullet. The bulk of work a data scientist does is not modeling, but rather data collection, domain understanding, figuring out how to design a good experiment, and what features can be most useful for a subsequent modeling/predictive problem. The same goes for most ML engineers and other data professionals.

Inspired by CRISP-DM workflow, but with all the real-world feedback loops | Image by author

Indeed, AutoML sounds like some sort of algorithmic magic, that upon receiving your labeled data, will output the best possible ML model for it. Truth be told, AutoML is a bit like interacting with a genie: “Be careful what you wish for”, or rather, what data you give it.

Remember the saying, garbage in – garbage out? Due to the additional feedback loops in an AutoML system, compared to a classic ML solution, the “garbage” will be amplified beyond your wildest imagination. I personally wasn’t careful enough and fell into this trap a few times, but more on that later.

Based on personal experience and the references at the end of the article | Image by author

Before making any more claims, we first need to understand what AutoML is, and what it isn’t.

The current state of AutoML

In practice, AutoML can take quite different forms. Sometimes a relatively efficient hyperparameter optimization tool (HPO), which can pick different ML algorithms, can be called an AutoML tool. A few notable examples are TPOT, AutoKeras, and H2O.ai AutoML (not to be confused with Driverless.ai). I could even speculate that given a GUI/Web interface to interact with these kinds of tools, and enough marketing budget, one can create a startup out of these.

An example AutoML loop. Image by TPOT from Epistasis Labs | Source

For some Deep Learning folks, AutoML would be about NAS, aka Network Architecture Search algorithms or methods. These methods are actually a very interesting research direction, which brought us such computer vision architectures as EfficientNet, AmoebaNet, and methods like DARTS, ENAS, and PNAS. A couple of notable open-source tools for NAS are Microsoft’s NNI and MXNet AutoGluon.

Recall my speculation about HPO + nice interface == profit? It was more of a simplification, but some companies actually did this, of course adding features, scalability, security, and customer service, and it works, and it indeed helps organizations enable data scientists to solve a lot of problems. H2O’s Driverless.ai is probably the most well-known solution of this kind, but part of DataRobot and Dataiku’s products are also managed AutoML behind an easy-to-use interface.

I believe a special mention is for AutoML offerings from cloud giants like Google, Azure, and AWS. I don’t have much experience with Azure and AWS, but I can speak about my experience with Google’s Vision AutoML. From my experiments and knowledge, these solutions are some of the few that actually use NAS in a developer-oriented product and this is amazing.

Note that the NAS won’t be used for quick runs. The last time I checked, specifically Google Vision AutoML was using Transfer Learning for quick runs and NAS for 24-hour runs. It’s been a while since I checked though.

Let’s structure all of this information a bit, shall we? The table below should give you a high-level sense of how different tools are AutoML, in one way or another.

Name	Is it Open Source?	On-prem/Managed?	Features	Kind
Microsoft NNI	Yes	On-premise	HPO + NAS + Some other interesting stuff	NAS, has a Web UI
AutoGluon	Yes	On-premise	NAS, supports penalizing big models	NAS
AutoKeras	Yes	On-premise	NAS, depending on scenario has baselines it tries first	NAS
TPOT	Yes	On-premise	Builds pre-processing + algorithms + ensembles pipelines	HPO++, actually uses genetic algorithms
H2O.ai AutoML	Yes	On-premise	Basically a free version of Driverless.ai	HPO++, has a Web UI, w\ integrated evaluation
H2O Driverless.ai	No	On-premise	Uses many pre-processing, feature encoding and selection schemes	HPO++ with a nicer UI, w\ integrated evaluation
Google Vision AutoML	No	Managed	Basically a managed, simple to use NAS	Transfer learning + NAS, a minimalist UI and w\ integrated evaluation
DataRobot	No	On-premise/Managed	An integrated platform with XAI, Inference server, Model and Experiments management	AutoML part seems to be an HPO++ w\ integrated evaluation and XAI and a lot of other stuff

Fundamentally, AutoML is trading computational budget (or time) for expertise. If you have no idea how to solve a problem, you will opt for the largest possible search space and wait for the search to finish. On the other hand, if you want to cut your expenses for powerful servers, or don’t want to wait for a week until the results arrive, and know some things about your problem, you can reduce the search space and arrive at a solution faster.

AutoML should really be treated more like an exploration tool rather than an optimal model generation tool. It’s not an alternative to a data/ML professional.

AutoML – The good parts (pros)

Alright, I think we have established that AutoML is not a panacea for all ML issues. Then what is AutoML good for?

Speeding up the model exploration stage

Let’s be honest, for most of us more often than not, we are usually not especially experienced in the domains we’re working on. Note that by domain I don’t mean computer vision, NLP, or time series, but rather advertising, e-commerce, finance, cell biology, genomics, and the list can go on for much longer. To add to the challenge, businesses require quick and impactful results.

I have a semi-personal story on how AutoML can bridge the gap between those with expertise and those without. A few years ago, I was at a summer school about Deep Learning and Reinforcement Learning. The organizers arranged a Kaggle competition, basically trying to forecast some time series. I intentionally omit details, you know, it’s semi-personal so… Anyway, there were PhDs, and postdocs, all trying to fit exceedingly complex models, some others were focusing on creating meaningful features. I, for having somewhat shallow knowledge of working with time series, and pure laziness decided I could just use AutoML, namely TPOT. Without much EDA beforehand, and even less so feature engineering. My result was in about the 50th percentile. Now, what do you think the winning submission was? Also TPOT, but with basic outlier removal, converting dates and times to categorical features like is_it_weekend and the likes of it, and running TPOT for 2 days.

The moral of the story – if you lack subject matter expertise, or time to learn it, or are just lazy, AutoML is a fairly good starting point. It also frees up time to work on those features, and as seen from my story, features do indeed make a difference.

Although my story suggests it, it’s not always about delivering the final model, sometimes analyzing the generated candidates for some patterns can be of help too. For example, whether the best solutions use Naive Bayes, Decision Trees, Linear Classifiers, or maybe the AutoML tries to create increasingly complex ensembles, meaning you would also need a very expressive model to solve your problem.

A very good baseline

So, you’re working on a new ML project. The first thing you do, model-wise – you implement a simple heuristic baseline and see where you stand. Second, you try a simple ML solution and analyze how much it improves the baseline. One thing you can try to do after this stage, at least what I like to do, is to try to estimate what would be your upper bound in terms of predictive performance, and let an AutoML solution squeeze the most out of your data and preprocessing.

Not only does it sometimes deliver superior results quickly, but it also shifts your perception towards working on better features.

Note that sometimes you don’t have the resources or are constrained by some other factors. So YMMV, but do keep in mind this use case for AutoML when working on new projects.

Identify quickly – what works and what doesn’t?

The space of possible combinations of feature transformations, algorithms, their hyperparameters, and ways of ensembling said algorithms create an immense search space of possible ML models. Even when you know what solutions can work and what can’t for a given problem, it’s still a vast search space. AutoML can help to fairly quickly test what configurations are more likely to work.

“How?” – you may ask. By running AutoML multiple times, and tracking:

what configurations get picked more often,
how often,
what is dropped,
how quickly is it dropped,
and so on.

In a way, this is some kind of meta-EDA. One might say – Exploratory Model Analysis.

Now, why would you be interested in it? We want the best model, why not get straight to it? Because what we should aim for isn’t one good final model, but an understanding of what works, and what doesn’t. And based on this understanding, we can better solve problems further down the line. Even with AutoML, no one exempts you from such lovely issues as needing to periodically retrain your models on new data and also trying to reduce budget expenditure on ML.

AutoML – The bad parts (cons)

A false sense of security

Honestly, this is the thing I hate the most about AutoML. It feels like magic and makes you lazy. And just like any automation, the more you use it, the more catastrophic it is when it fails.

Because of this, it’s easy to introduce data bugs. And due to AutoML’s sometimes opaque nature, these bugs are very hard to spot.

I have a personal anecdote about this, too – one that I will probably never get tired of recalling. We were working on a cell classification problem, where the distinction between the positive and negative classes was tough to observe even for a human. The images could really be at least somewhat accurately classified only by SMEs. We were trying for a few months to create a computer vision model to automate this task. The results weren’t good. Even with the most custom-built solution, which took into account various properties of our dataset and was capable of learning from small amounts of data without overfitting, the accuracy was close to 69%. On a binary classification problem.

At that stage, we had the opportunity to use Google Vision AutoML which was still in beta. The quick run results were a bit worse than ours. Eventually, we decided to run the full training, which was a bit pricey, and to make the most out of our data, we manually augmented the images to increase the dataset size. Lo and behold, 98.8% accuracy. Great success!

Only I was skeptical about it. After months of failed experiments, hundreds of hyperparameters tried, and dozens of methods used, I couldn’t believe some NAS could beat the problem, and do so by light-years. My superior was preparing to announce our outstanding results to the investors and other stakeholders. I insisted we inspect what was going on. A few weeks later, with a few dozens of partially occluded images, total confusion, and despair, I figured it out.

We manually augmented the dataset before using it with Google Vision AutoML, but we didn’t manually specify the splits. As a result, augmented versions of the same image were in training, test, and validation splits. The model just memorized the images. Once we fixed it and ran it again, we got ~67%.

The moral of the story – don’t get comfortable with AutoML, it’ll bite you in the back.

Prone to over-optimization/over-fitting

Depending on the nature of your data, and your model validation setup, some AutoML solutions can easily overfit. By the nature of data I mean its properties like label distributions, how many outliers you have, and the overall quality of your dataset. To be fair, often it’s not the tool’s fault, but yours, meaning most of the time the cause of overfitting is in your evaluation setup. So watch out how you evaluate candidates, how you split your data, and if working with time-series – I don’t envy you. Treat the AutoML process like hyperparameter optimization, and split your data accordingly using something like nested cross-validation.

You can find a comprehensive guide how to properly evaluate any machine learning model here in this post.

Too much emphasis on optimization

As mentioned a few times already, the correct way to think of AutoML is as an enabler that lets you focus more on the data side of things. But in reality many fall trap to the idea that model hyperparameters, and the model in general, are the most important factor in an ML project because AutoML solutions can sometimes show excellent improvements, reinforcing this idea.

The resulting models can be tedious to deploy

I once had the opportunity, or misfortune, depending on when you ask me, to work on ad price forecasting. And eventually, I tried using AutoML, namely TPOT. It ran well and gave pretty good results, so we decided to have our best-performing model deployed. I was asked to convert the model into something that a Golang or, at least, a Java backend would understand because deploying Python services was a no-go.

After a few hours of research, I discovered PMML, plus I already knew about ONNX. Long story short, PMML-capable libs vary a lot in what models can they read. So, while my ensemble Python model generated by TPOT was somewhat unproblematic to convert to PMML format, making a Go program understand it was impossible. Why? Because the Go lib didn’t know how to work with ensembles, preprocessing, and most models except for some decision trees, linear classifiers, and maybe Naive Bayes. As for ONNX, it also proved problematic to convert a scikit-learn ensemble pipeline to ONNX.

Often AutoML candidate models grow very complex, and converting them into anything becomes a headache. That’s why a lot of production ML is based mostly on linear classifiers, Naive Bayes and random forests, and GBDTs. You will rarely if ever see some complex stacked ensemble of different classifiers. They are a priori slow and very hard to make fast or compatible with non-Python environments.

Hard to analyze/debug the model

Recall the Google Vision AutoML story. Google didn’t have any facilities to deeply inspect models, a la XAI. Also, there was no way to obtain some kind of interpretability or explanations of predictions for individual images. As a result, I was stuck with obfuscating parts of input images and analyzing the predictions. Generally, explainability and debugging tools for AutoML are a special problem. AutoML-generated models tend to be quite complex, thus hard to analyze. Additionally, most of the time the complexity hits twice, because a complex model will take more time to run predictions, and this, in turn, makes obtaining explanations using black-box analysis tools even more burdensome.

If you’re interested in some of the most popular black-box XAI tools, check out this post.

AutoML vs Data Scientists

Before I give you some numbers, just keep in mind that depending on the problem you’re trying to solve, your experience with AutoML will vary greatly. So, let’s dive in.

A word on AutoML benchmarks

The literature on AutoML benchmarks is fairly scarce, and most often it compares the performance of AutoML solutions omitting the performance of humans. Also, the studies are mostly about tabular datasets. Thankfully, we do have some work in establishing standardized ways to assess the performance of different AutoML solutions.

First, there’s the AutoML benchmark, and then there’s also a so-called Kaggle benchmark, which you can find examples of in this paper and in this Medium post. For information on the use of AutoML/NAS in computer vision and text classification tasks, the easiest thing to do is to check the results of the NAS Bench(mark) and a few other competitions. Still, not much comparative analysis between people-led and algorithm-led designs.

Is all hope lost?

No. On one hand, you can always try to run your models against the datasets mentioned above and see how good/bad you are against AutoML. But of course, this isn’t the answer you’re looking for. Enter “Man versus Machine: AutoML and Human Experts’ Role in Phishing Detection”. I’ll give you the gist of it, and a personal remark.

Comparisons of the AUC score and training duration of the best model built using AutoML and non-AutoML frameworks* | See the article for more details

* One thing to note – Duration is calculated as the time it takes for a model to be trained on the given dataset.

The authors conclude that AutoML models significantly outperform people when the dataset these solutions are applied to have some overlap in their classes and generally show high degrees of non-linearity. In other words, hard datasets. Otherwise, the performance is on-par with not using AutoML. They also claim that AutoML solutions usually take much longer to create high-performing models compared to non-AutoML.
And here’s the catch, the authors don’t mention the time it takes to come up with a high-performing model. Why you may ask? Because for their non-AutoML solutions they take existing scikit-learn algorithms and don’t tune them at all. What does it all mean? First, take the duration conclusion with a grain of salt. Second, AutoML will only ever make sense for hard datasets, with noise, overlapping classes, and high degrees of non-linearity. Otherwise, you’ll be better off with the default settings of some off-the-shelf algorithm.

Their findings of the correlation between dataset complexity and AutoML advantage are quite in line with my personal experiences and the results of AutoML Benchmark, in which on more complex datasets some AutoML solutions have a 10%+ advantage in AUC and accuracy over manually created models. As you may recall from my story in the first part of AutoML cons, what took me a few months of work, Google’s AutoML almost matched in 24 hours.

How does all of this information help you? If you know your dataset is well-behaved, maybe don’t bother with AutoML. But how would you know? You can try running a few classic ML models, and see how their cross-validation performance varies. Or maybe just “look” at your data.

Personally, I use AutoML first in the beginning as a quick exploration tool, and then when all hope is lost. Never in between. To help you make up your own mind about AutoML, check out the links below, and run experiments.

What if… everyone would use AutoML, always?

Before we dive into this thought experiment, recall that AutoML works by trading computation for expertise. If we are clueless and have tons of computing power, this is “The Tool”. Let’s analyze what would happen if we went all-in with AutoML in the case of a more classic, established business, and in the case of an innovative company.

Major enterprises, like Ford

Depending on what department would use AutoML instead of their existing ML/DS tools, we might have somewhat good results, for example in marketing and sales, somewhat worse results in logistics and planning, and probably absolutely rubbish results for stuff like ADAS, which is advanced driver assist systems and simulation software. Besides, the increase in computing power required for the company to run these AutoML solutions would most certainly set them back by a non-trivial amount of cash.

And even if they would have the money and irrationality to go all-in on AutoML, it still would be a bad idea, due to strict requirements for model interpretability, which a complex ensemble model resulting from AutoML just can’t give. Hard pass.

Innovative companies, like Palantir

If we’re talking specifically about Palantir, I believe with or without AutoML, their software doesn’t really care, because it’s about integrating and smartly using the data assets of an organization. Still, most of the analysis doesn’t require very advanced ML algorithms, so using AutoML would be a waste of money. Why use it when the best model is still going to be a linear regression or a decision tree. Why you may ask? Because their clientele is organizations that value model interpretability very much, again.

For any other innovative company, AutoML would have its place, but still within some serious limits. A lot of the time, the problems faced by these organizations can’t be simply formulated as supervised classification or regression, which makes it tricky to use AutoML.

The more innovative the use case, the harder it is to use off-the-shelf solutions. Can you imagine using an open-source AutoML tool to develop new drugs, or composite materials, or optimize the placement of transistors on a specialized chip? Me neither. These tasks can easily and should be treated as research directions. Is anyone in need of a startup idea?

An analysis

Maybe you noticed that a major problem for industry adoption of AutoML is interpretability. You might think “Oh, but maybe they haven’t heard about stuff like SHAP, or XAI (Explainable AI) in general? That ought to change their minds”. I assure you, it won’t. Not soon, anyway.

You see, there’s a major difference between model interpretability and explainability. The former means that the model can be understood, as it is. The latter usually means either that there’s a way to infer why a certain prediction was made, or in more academic/cutting-edge cases, that a model will “tell you” the reasoning behind its prediction. And maybe you already see the problem here. No one can guarantee you that the explanation is correct.

This is the reason why, for example, there were thousands of people developing neural network-based computer vision models to detect if a patient has COVID based on their X-ray scans, and yet no major medical institution was using these. Doctors need to understand very well why the predictions were made. Same as legal, accounting, sales, marketing, and all the rest have different, sometimes non-negotiable requirements about model interpretability. And that’s why organizations are still big fans of linear models and decision trees and shy away from dense Neural Networks.

So what would be a good use case for AutoML?

Now, let’s see some concrete use cases which can benefit the most from AutoML:

Batch jobs

Most AutoML tools do not take into account model complexity/compute requirements, as a result giving you very well-tuned models which can be extremely slow or computationally demanding. Because of this, using such models is impossible in interactive or streaming scenarios, so what you’re left with is using them for batch jobs.

Maybe running ML as batch jobs sounds not that exciting, especially after you read about incredible feats of engineering of deploying ML models directly interacting with users, maybe even on edge devices, or how people are using ML models in streaming scenarios to process billions of events in near real-time, but trust me, a lot of businesses have processes that are absolutely fine with running on a schedule once in a few hours, days, or even weeks. You’ve certainly heard that quickest results beat most accurate results when it comes to business, but there are plenty of situations where accuracy is more critical than time.

Testing the waters for a problem

I said before, and I will say again – AutoML is best suited for quick prototyping. It’s my favorite use-case for AutoML and one that helps me assess where an upper bound of performance might be, with my current dataset and pre-processing/feature engineering in place. When you adopt this mindset, you slowly turn towards a more data-centric ML/AI paradigm because you just assume that you will always get an optimized model.

Keep in mind that this should be done after the EDA stage. Also, if possible, try to reduce the search space, based on your EDA. If there are no significant correlations between attributes and the target variable you can confidently drop linear classifiers from the search space. What I like doing is running a few quick experiments with a reduced search space using an AutoML tool, with only the simplest models, with different random seeds, because of replicability, and see what are the best performing models. Based on that, I can adjust the search space for the next runs.

Takeaways

AutoML is both a blessing and a curse. As with any tool, it can be used right to the greatest advantage, or it can be misused and then bad-mouthed.

One thing to keep in mind is don’t abuse it.

It can be tempting to throw AutoML at any problem, even before analyzing your data or understanding your problem. Don’t be that person.

Another important thing you should get from this blog post: Invest all the time you save using AutoML on feature engineering. Think of it this way, if you would have the best model for your dataset, what else can you do to improve the performance of your machine learning system? Obviously, you can fetch more data or ensure that the data is of higher quality or have more informative features. Of course, AutoML won’t give you a perfect model, but the rationale holds. With modeling (almost) out of the way, and better performance still possible, you should focus on improving your data and features to reach those performance objectives. And if the results look too good – debug it.

Most importantly, make sure you understand very well the business requirements. So before running AutoML for hours on powerful CPUs and GPUs, take a few minutes to discuss whether your users will appreciate the slight increase in predictive performance, and won’t mind the lack of model interpretability.

As you can see, depending on who you ask, AutoML can mean quite different things. I recall the first time I figured that most of what is marketed as AutoML can be done with a multi-core workstation, a hyperparameter optimization library, and all of it wrapped in a simple UI, I was somewhat disenchanted. As long as it works for you, I guess.

References

Choosing programming languages for real-world projects

2022-06-17T22:00:00+00:00

A few years ago, when I was in my senior year at the university, during the distributed systems lecture our professor asked us a very nice question:

If we were to choose between a fancy new programming language, or Java/C#, for a greenfield commercial project, what would we choose and why?

If you’re wondering what it has to do with distributed systems, I have to say - half of it was about software architecture.

The classroom was split into 2 camps, obviously. The fun and somewhat sad fact was that the Java camp won. I was part of that camp, even though I don’t like Java, to say the least. We had much better arguments. So, what were those winning arguments? Rich library and tooling ecosystem, and the relative availability of professionals in our local market, for a fair price too. Our professor deemed us project managers, not real programmers, then said we were right, and for a few seconds the atmosphere in the classroom turned sad and hopeless. Then we moved on with the lecture.

TL;DR: We all want to play with the shiniest new toys, but when money is at stake, better stick to something tried and true.

So here are some questions to keep in mind when choosing a programming language, or any software tool for that matter, for a project. The focus will be on commercial projects, but some of the tips work for research projects and simple pet projects too.

Basic level

Initially, the decision-making process is usually guided by a very narrow understanding of the consequences of choosing a specific tool. In increasing order of maturity, here are some basic reasons to make a choice:

I would like to learn this new tool/language/framework, people say it’s hot right now
People say this is the best tool/language for this kind of problem
I know this language/tool very well and can be very productive with it
I and my team know this language/tool quite well and we can all be productive with it

1 and 2 are only acceptable reasons for a pet project, with a small caveat, which I’ll explain later*. Although I would recommend sometimes taking a look at more niche, possibly peculiar tools to learn. Because, you know, if a language doesn’t change the way you think, it’s not worth learning.

4 is a decent reason, see Paul Graham’s post about using LISP to build a startup, but in the long run, it’s not that simple.

Higher-level decision making

The difference between programming and getting stuff done, and software engineering is that the latter has significantly harder constraints (See Software Engineering at Google). Not just any code can be developed productively by a changing team of people and maintained over time. And most commercial software isn’t one-time scripts, but code that lives on for years, if not decades. To be a senior engineer, among other things, is also about making well-thought technical choices.

That’s why, when choosing a tool, language, or an entire stack, try to guide your decision-making with these questions, in no particular order:

How well documented this tool/language is?
How actively used/developed is it?
How many dependencies of any sort does it have?
How stable this tool/language is?
What is the size and quality of the ecosystem for this tool/language?
How productive can someone be using this tool/language?

More constraints, but doable.

Business-level decision-making

Now we reached the final frontier. Until now, it wasn’t particularly hard to make a choice, you just had to do your research. But now, we’re gonna have to enter the realm of never-ending trade-offs. Keep in mind that software is written by people, who you have to employ, pay salaries, and ideally have a positive return on investment.

How easy is it to teach someone, or how much time does it take to make someone productive with the given tool/language?
How much reachable supply of professionals is out there for this tool/language? Is it sufficient for you?
How much do professionals who are knowledgeable with this tool/language ask for (money, perks, whatever)?
What is the quality of the supply? Are the engineers mostly newbies or seasoned professionals?
How many people would like to work with the chosen tool/language? How excited are they?

Rarely the raw performance of a tool or language is a big issue. Some domains are indeed interested in that characteristic too, like scientific computing, low-latency systems, and maybe embedded systems. More recently, how energy-efficient, or “green” a language or tool is, is of greater importance. Yes, I’m not kidding. For example Amazon cares about such things, although like all things at this level, it’s not so simple.

An example of picking a language

Let’s do a “demo”. We will assume that we’re a remote-first startup and we want to build ~~a snowman~~ a serverless platform. How do we pick the programming stack? Well, at least the programming language. We will assume that the technical founders are capable of writing any language. No, they are not spherical.

An important technical constraint for our project is that serverless technology is especially effective when the startup time of a serverless function is quick. If it’s not, why bother? Optionally, we might want to dive into serverless edge computing, meaning we need a programming language that can work even on resource-constrained devices. Maybe not microcontrollers, but something like a newer Raspberry Pi shouldn’t be considered unrealistic.

We are also budget-constrained because we’re a startup. We need to execute fast, or else we might not reach escape velocity, and no one will bother.

With that said, let’s prune some candidates. Because of our startup latency constraint, we can’t afford to run anything which needs a VM-like runtime. So no Java, C#, and even Erlang or Elixir. Although Erlang and Elixir have less substantial problems with VM cold start, they have another downside of having a smaller talent pool. On yet another hand, this talent pool is usually very enthusiastic and professional. I personally love Elixir, it’s just a pleasure to write, see why. What a shame we’re not building a messaging system.

Language	Verdict	Talent Pool Size	Tooling	Excitement Factor	Startup Latency
Java	No	Very Large	Very Good	Can we go lower?	Half of Java jokes are about this
C#	No	Large	Very Good	A bit better than Java	A bit better than Java
Elixir/Erlang	No	Small	Good	Almost through the roof	Good, for a VM-based language

If we are planning for maximum efficiency, maybe we should use C++? Definitely no. C++ is quite dangerous. Besides, we need to keep in mind that we want to develop fast and preferably without much risk of segmentation faults, resource leaks, and other C++ surprises. Btw, a good C++ dev is quite expensive and hard to find nowadays.

Language	Verdict	Talent Pool Size	Tooling	Excitement Factor	Startup Latency
…	…	…	…	…	…
C++	No	Moderate	Moderate, hard to use IMO	Depends what kind of person are you	Sonic the hedgehog approves

We know that development speed is important. But we also want a performant language without VM cold start problems. How about Python, or JS? These are popular, fast to work with, with a considerable talent pool, and JS can be speedy too. To be fair, this wouldn’t be the worst idea. Python, specifically CPython, can be slow but with the right tooling, or by substituting it with PyPy, we can solve these problems. As for JS, one issue is that the language is not the most pleasant to debug, with its unholy trinity of no-values and subpar traceback messages. Regretfully, there are lots of not-so-good-devs out there professing these tools, so that’s and issue. Finally, these are not the best systems programming languages.

Language	Verdict	Talent Pool Size	Tooling	Excitement Factor	Startup Latency
…	…	…	…	…	…
JS	Maybe/No	Very Large	Good	Depends what flavor are you using	Good
Python (CPython)	Maybe/No	Very Large	Good	It will be a bummer that it’s not used for DS/ML/AI	Good
Python (PyPy)	Maybe/Yes	Very Large (but there’s a catch)	Good	If you know, you know	Good, and it’s very fast overall

Ok, so I said it, systems programming languages. And we dropped C++. What do we have left? Go, Rust, Crystal. We drop Crystal right away due to the lack of a sizeable community, talent pool, and libraries. So, it’s Go vs Rust? Hold on, there’s another contestant - OCaml. So, why did it come to these 3 languages? All of these are very suitable for systems programming, that is, interacting with lower-level OS constructs, are quite efficient at working closer to hardware, and in general, are fast and resource-efficient. Of all 3, Go is the most mainstream, so it’s a plus. Also, it’s easy to onboard people to use it. On the other hand, Rust and OCaml provide nicer guarantees for the programs you write, and although less popular, the quality of developers using them is usually pretty high. OCaml and Rust are pretty close idiomatically, but Rust syntax will be much more familiar to non-hardcore FP people, aka common folk, so it’s probably 10 points to Rust. All in all, let’s see the final table.

Language	Verdict	Talent Pool Size	Tooling	Excitement Factor	Startup Latency
Java	No	Very Large	Very Good	Can we go lower?	Half of Java jokes are about this
C#	No	Large	Very Good	A bit better than Java	A bit better than Java
Elixir/Erlang	No	Small	Good	Almost through the roof	Good, for a VM-based language
C++	No	Moderate	Moderate, hard to use IMO	Depends what kind of person are you	Sonic the hedgehog approves
JS	Maybe/No	Very Large	Good	Depends what flavor are you using	Good
Python (CPython)	Maybe/No	Very Large	Good	It will be a bummer that it’s not used for DS/ML/AI	Good
Python (PyPy)	Maybe/Yes	Very Large (but there’s a catch)	Good	If you know, you know	Good, and it’s very fast overall
Crystal	No	Very Small	So-so	If you know, you know v2	Very Good, and it’s blazing fast overall
Rust	Maybe/Strong Yes	Small-Moderate	Moderate	Almost through the roof	Very good, and it’s very fast overall
Go	Yes	Large	Good	Pretty good	Good, and it’s very fast overall
OCaml	Maybe/Yes	Small	Moderate	Almost through the roof, but only for FP geeks	Very good, and it’s very fast overall

All things considered, probably the safest choice would be to use Go. And the next best thing would be Rust. A very good option would be PyPy, IMO. It’s almost 1 to 1 equivalent to CPython, but considerably faster. If you like it more hardcore FP, you could try OCaml. You could in fact go polyglot, and pick 2 languages, but don’t escalate to more than that. There’s a reason most full-stack engineers are writing JS-only.

*Time to discuss that caveat.

Yes, picking a tool only because it’s hot or seems interesting but is risky will rarely be a good idea, except when it is. You see, a tool is usually “hot” for a reason. Maybe it’s solving a common pain in the industry, and does so elegantly. Or maybe, it boosts productivity, efficiency, or the long-term maintainability of a system. Still, this isn’t enough to make such a risky move.

On the other hand, there’s an interesting aspect here. If a tool is hot people will want to work with it. This phenomenon boosts the desire to work for your team/business because you’re using this New Hot Thing ©. Combined with the intrinsic qualities of the new tool, it might make sense to actually give it a try. It is just as risky to never take a risk. Failing to grow and innovate will leave your business hard to hire for, your talent pool shrinking, and your operational efficiency slowly dying.

Follow sage's advice 😏 Made with: imgflip.com

A substitute for a conclusion

I hope I haven’t fried your brains with this many things to consider. Even I sometimes don’t do the whole process, or am being sloppy when assessing some of the aspects. Still, having a checklist of things to consider is always a good thing, so I hope you’ll benefit from this.

Maybe a bit anti-climactic, but consider this - if you picked the wrong tool, it will rarely doom your project for failure. What will is not realizing you made a bad choice, and trying to fix it. Technical stacks are problems which can be fixed with money, and that’s a good thing.

Not the ending you expected? 😏

P.S.

I should add a clarification about Java. Don’t get me wrong - I don’t “hate” Java, I just like pointing to its flaws, sometimes vehemently 😀. Java’s unnecessary verbosity is the main issue that I have with it. It wasn’t the only issue, but with the sped-up release cycle and a lot of ideas borrowed from other languages and communities, it’s becoming a better language. Brilliant engineers use Java for many important, actively developed projects with no plans to retire or rewrite these. Ergo, it can’t be an objectively “bad” language.

2022-11-09 Update

I came acros this amazing presentation. It’s still related to the arguments I propose, although putting a greater importance on the Basic Level > 3rd point decision factor. Even if initially the factor seems simplistic, there’s sophistication in simplicity, and the author of this presentation does a great job uncovering it. TL;DR, it’s good, on topic, and I recommend you check it out after reading my article 😀.

A little disclaimer

#Слава Україні! #Героям слава!

Becoming a Senior Engineer

2022-05-23T20:00:00+00:00

Disclaimer. This post is based on frequent discussions with many of my friends and acquaintances who work in IT/Software Engineering, in a lot of different places, like outsourcing companies, product companies, big organizations with established processes, and small ones where chaos reigns.

Anyway, based on all this tribal wisdom, anecdotes, and my own experience and observations, there are four base properties/skills/traits which are a sure-fire way to grow into a senior/leadership position at your organization.

The four axes

Depending on the organization, a mix of these four traits is necessary to take the mantle of a “senior engineer”.

Business acumen - You know how stuff works in your organization. You know the processes, the people, and the relations among all. You understand the vision and priorities of your company. You also have a rough idea of the company’s risk tolerance, budget, and context in which it operates. You know why certain things are done the way they are.
Communication skills - You explain your thoughts crystal clear. And I can’t stress this enough! If you can’t explain your thoughts in a clear, accessible way, you will impede not only your career prospects but others’ productivity too. The better you talk and write, be better everyone will understand what needs to be done, or how to fix things, or what is the roadmap, or… you got the idea. Besides, the more senior you get, and the bigger the organization you have to work in, the more you’ll have to write and communicate with people. Especially now with all the work done remotely, you need clear writing like never before. Chats, emails, JIRA tickets, code reviews, meeting notes, post-mortems; this list can go on forever. Another subskill deserving a place here would be explaining technical stuff to non-technical people. This is especially important when you have to deal with non-technical stakeholders of your projects. They very much appreciate the effort you will put to explain to them what’s going on without much technical jargon. Imagine if nuclear physicists would explain what are they doing using their jargon. You won’t understand a thing. Been there, done that. So be empathetic, and talk to people in a way they can understand.
Being a force multiplier - You have good coaching/mentorship skills. You also always think of ways to enable people to do a better job. Maybe by creating a script to automate something, or by creating a shared document explicitly telling how some process is done and why, or just being a knowledgeable and pleasant colleague to discuss issues and ideas with.
Superior hard skills - You are one of the most knowledgeable people in your organization/community on some technology/practice/domain. You have superior skills, and for that, you are respected. Part of this is superior debugging skills. More often than we’d like, we have to fix code that’s not working. The quicker this can be done, the more time is left for feature development, which is so important for the business. You think beyond just lines of code and understand the architecture and the tradeoffs which lie at its foundations. You understand that sometimes DRY is not a good idea, where you should apply design patterns, and where it’s ok not to. Also, good coding skills are infectious. People will see your beautiful code and will want to do the same. In a way, you’ll be a force multiplier, by influencing others to write better code, which in turn will make the codebase a nicer environment.

The A potential path to senior positions

Let’s say you were hired as a software engineer, maybe even a junior one. You aspire to become a senior. What do you do?

Learn your project.
Learn why your project is important. Who are its users? What’s the roadmap? How does it make/save money?
Learn more nuanced technical skills. Maybe read a few books. Iterate on this.
Spot inefficiencies in your team’s processes, try to ease these through explicit processes, helper tools, or any other way. Iterate on this.
Make friends with colleagues outside your team, maybe even outside your business function.

Do all these, and you will certainly be allowed to lead some projects or initiatives.

A warning note

Everything that is in excess becomes harmful. Depending on the organizational culture of your employer, being overly interested in the hows and whys of the business might seem nosy. And if your intentions are perceived this way, you might damage your reputation, instead of growing it. The same goes for strong initiatives to help your colleagues or the business. This one is more nuanced. It might be (usually) that your manager or colleagues are not unpleasand, counterproductive, or trying to dismiss your genius, they just know that some stuff has been tried already, or the current prerogatives do not leave space for such initiatives. Remember to be respectful, not very annoying, and if all else fails, start searching for another job.

Some misc skills you’ll also need

I would argue the four traits above are crucial to becoming a senior engineer in any organization. But I’d also like to include the following 3 skills too. Let’s label them as very good to have.

Attention to details. Sloppy-done tasks have a big hit on your karma. Depending on your place of employment, this could range from writing code that works well without immediately visible issues, and writing high-quality code, with good tests and without breaking the CI.
Humility. You know, don’t be an unpleasand, counterproductive, or trying to dismiss others person. If no one wants to work with you, you will either be put on the worst projects in your company or straight fired from there. Note, don’t confuse humility with low self-esteem.
A growth mindset. If you learned something to land a job and once there, decide to sit still on your ass, I’m afraid your only chance to become senior is by having the rest of your colleagues being hit by a bus. Stagnation should never be an option.

Of course, there are always exceptions, people who hold senior or technical leadership positions without these skills, but they are that - exceptions. So, it’s better to also be humble, attentive, and with a growth mindset than not to be.

Some edge cases

Senior engineer is the one who stayed the most with the company. This distills down to business acumen. She/he knows how things are done in the organization, and knows the codebase very well. Some communication and hard skills are also necessary. This path is prone to the “old junior” problem. “Old juniors” are a case you wouldn’t want to be in. It happens when someone stays with a company/product for too long without substantially growing their skills, but only acquiring business acumen. People in this situation remain stuck in their companies because of a growing chasm between their title and their actual skills.
Team leads. They usually are strong on Communication skills/being a force multiplier, and most are pretty good on the hard skills side too, but YMMV. A good team leader is an important asset for any organization, they are like Great Generals for their teams.
An outsider is hired as a senior/lead right away. This does happen, and is more common in smaller organizations, in freshly established departments, or in new and specialized teams. Such people are almost always strong in hard skills and usually in communication skills. Occasionally, they may have very good business acumen because they have worked in similar industries before.

Remember, you need a mix of these. Having only hard skills won’t cut it. You’ll be just a very good software engineer. Nor will just business acumen help you, it will just turn you into a mediocre manager in the best-case scenario, or the terror of the engineering team in the worst case. And if you’re only good at being a force multiplier? Have you heard about the Scrum master position?

Takeaways

Ask questions about the business/product. Show interest in how things are done within your organization.
Level up your communication skill and help your team. Technical writing, working on enabling tasks, and mentorship are some of the most important. You can level these up by volunteering to document some nasty parts of the codebase, describing internal processes, and working on/proposing tools to increase the productivity of your team. Mentorship skills can be acquired by either asking to be the mentor for new hires, or you can try teaching outside of work, CoderDojo-like organizations being probably the best at this.
Learn hard skills. Read books. Work on pet projects, to crystalize the knowledge you got from reading. Being part of a specialized community will also help you grow your hard skills, by learning advanced concepts you won’t find by just googling, because you wouldn’t even know what to google. Reddit is pretty good at this, sometimes. Also, slack/gitter/discord groups, interested in specific technology are good too. If you use it right, Twitter and YouTube are also excellent channels for this.

By the way, notice that throughout the whole post, there was no mention of years of experience. Of course, some of the traits outlined above correlate with years of experience, but the correlation is not perfect, meaning you could have 10 YoE and still be not as good as someone with 4 YoE. So focus on skills, not on mileage.

Before I go

Maybe someone will find this news, but being a Senior is not the end of the road. Of course, many know about the “move into management” path. But there’s another way. Becoming a Staff software engineer. How? I don’t know, yet. When I will, I’ll certainly write another blog post. Until then, I’ll leave you with this Reddit thread and this book.

A little disclaimer

#Слава Україні! #Героям слава!

Going beyond simple error analysis of ML systems

2021-07-26T00:10:00+00:00

First, there was a story…

Imagine yourself working as an ML engineer… very cool my friend!

First of all, congratulations, pat yourself on the back, your family must be proud.

Second, depending on the company size, culture, and the maturity of the machine learning team, you’re most likely in for a wild ride through many computer science and software engineering domains.

Again, pat yourself on the back. Now, let’s get to the chase.

As an MLE, part of your work is to pick, tune and deploy ML models. I believe I don’t need to explain to you that this is not so trivial. You must believe that the hard part of this process is to tune the model, don’t you? Or maybe that it is the deployment of the algorithm? Although these are indeed non-trivial, especially the later one, here’s The Question © for you:

How do you make sure you have a high-quality model in production?

If you’re gonna tell me that you just tested your model on a held-out dataset and that your metric of choice was something like accuracy, or the mean squared error, just run. Fast. Far away. If you didn’t run, be prepared to be questioned whether or not you:

had a baseline,
balanced dataset or adjusted your metrics,
used the held-out dataset for tuning/hyperparameter search … and so on.

So many questions... Made with: imgflip.com

I guess you figured out by now that a simple train/test split and a few error metrics, like accuracy or maybe even F1*, are not nearly enough to answer The Question ©. But what would be enough? Well, it depends, like all things in software engineering. You need to understand that reducing your model characteristic to only one or a few scalars will forfeit way too much information about the model.

* F1 score is a much better choice, btw

… and then words of wisdom* followed

* - more like personal war stories

Disclaimer, this is a long post, so maybe brew some tea/coffee, get a snack, you know, something to help you get through the whole thing. Maybe taking notes would help you to stay focused. It certainly helps me when reading a lot of technical text.

Another little disclaimer: I had an older post tangential to this topic, but the focus in it was on interpretability/explainability methods. In this blog post, I focus more on how to assess the errors of machine learning models. If you think these topics are pretty close to each other, somewhat overlapping, you are right. To better evaluate a model, we sometimes need to understand the “reasoning” it puts into making a prediction.

Keep in mind - depending on the domain you apply machine learning to, a subpar model could be anything from a little annoyance for your users to a complete dumpster fire that amplifies biases and makes your customers run away from your business. While it could be easy for said users to opt out from the former, the latter can ruin your business. We don’t want that. Your employer certainly doesn’t.

Ok, copy that. But how do you know that a machine learning model is good? Do you need to understand its predictions? Does your use case have a specific group of users that you care about the most? These questions can help you derive an evaluation strategy and in turn to make sure nothing goes south after you deploy an ML model.

You know what, let me first define a few ML evaluation maturity levels. It will be easier for me to explain and for you to follow along. For now, don’t bother about the meaning of some more advanced terms here, I will explain them right after this section.

Level 0 (L0): Having a train+test split and one or two generic metrics, like MSE or Accuracy. At this level, deploying the ML model is not advised (read: irresponsible at best).
Level 1 (L1): Previous level, but using cross-validation if possible, or worst-case scenario, having a big and diverse test set. You will need to have per-class metrics for classification problems or multiple metrics for regression problems. For classification use cases, metrics like ROC-AUC score, or F1 score are considerably better than accuracy, so use these. Moreover, understanding your model’s precision and recall characteristics can prove crucial for a successful ML product. In case of regression, MAPE+RMSE+Adjusted R^2 are a good combination, you can consider using AIC and/or BIC too. For regression, try to have at least one metric robust to outliers (MAPE is robust to some types of outliers, but not the others).
Level 1.1 (L1.1): Check most wrong predictions, that is, entries with high prediction confidence, but that are predicted wrong. It can help you uncover error patterns, maybe even biases.
Level 2 (L2): Perturbation analysis using counterfactuals and random alterations of input values. Usually, such an approach permits an understanding of feature importance for each entry, but that is more like a bonus you have to work to get.
Level 2.1 (L2.1): ICE/PDP/ALE plots can be used to better understand feature importances. Keep in mind these are fairly compute power demanding.
Level 2.2 (L2.2): Surrogate local explanations (usually LIME) and/or additive feature explanations (i.e. SHAP) to understand model predictions before approving the model for deployment. Also computationally demanding.
Level 3 (L3): Cohort-based model inspection. One way to define cohorts is through Manifold-like error groupings. At this level, it’s important to acknowledge the changes in data distributions and if applicable, to evaluate on data from different periods. Believe me when I tell you this, sometimes feature and/or label distributions can change even in domains where you don’t expect them to. And not accounting for this will give you some royal headaches.
(Optional) Level 4 (L4): Adversarial examples checking. Stuff like Anchors and TCAV are at this level too. In principle, any other advanced model interpretability/explainability or security auditing is at this level.

Power levels. Don't be L0. Made with: imgflip.com

You would want to be at Level 1 when launching a model in beta, Level 2 when it’s in production, and from there grow to Level 3. Level 4 is more specific and not every use case requires it. Maybe you are using your ML algorithms internally, and there’s a low risk for some malicious agents trying to screw you, in this case, I doubt you need to examine the behavior of your model when fed adversarial examples but use your own judgment.

Note that although I mention regression use-cases, I omitted a lot of info about time-series forecasting. This is done on purpose, because the topic is huge, and this post is already a long-read. But if you have a basic understanding of what’s going on here, you can map different time-series analysis tools onto these levels.

Methods

Let’s roughly cluster evaluation/error analysis methods into three broad categories: (1) metrics, (2) groupings, and (3) interpretations. Metrics is kind of obvious. Groupings are probably the most abstract ones. We put here train/test splits, cross-validation, input data cohort, and error groupings in this… oh god… group (no pun intended). Finally, under the interpretation umbrella fall such things as surrogate local explanations, feature importance, and even analyzing the most wrong predictions, among other things.

Metrics

I won’t dive deep into metrics-based evaluations but will mention that depending on your use case you might want to consider metrics that are non-linear in their relation to how wrong the prediction is. Maybe you’re fine with a bit of error, but if the model is very wrong, or frequently wrong, you want to penalize it disproportionally more. Or, on the contrary, as there are more wrong predictions, or the total loss of the model is growing, you want to have a log-like behavior for your metric, i.e. the metric will attenuate its growth as the model is more wrong.

Furthermore, on the matter of metrics that are robust to outliers, sometimes these are nice to have if you do some outlier removal beforehand. Or there might be a necessity, in cases when you can’t or specifically don’t remove the outliers, for whatever reason. Keep that in mind.

Effects of outliers on model fitness. Source: https://scikit-image.org

Usually, in production scenarios, you will want to assess your model performance on different cohorts, and maybe even based on these cohorts to use different models. A cohort means a group of entities, with a specific grouping criterion, like an age bracket, or location-based, or maybe something else.

Groupings

I mentioned cohorts in the paragraph above, so it will make sense to follow up on this. Cohorts are important because your stakeholders are interested in these, sometimes you might be too, but the business is usually the number one “fan” of cohorts. Why? Well, it could be due to many reasons. Maybe they are especially interested in providing top-notch services for a special group of customers, or maybe they must comply with some regulations that ask them for a specific level of performance for all the users.

Moreover, your dataset is most certainly skewed, if it’s real-world data. Meaning, you will have underrepresented classes, all sorts of imbalances, and even different distributions for your features for each class/group of classes. For example, it wouldn’t be ok for any business to give subpar recommendations for users outside the North America region, or to predict that a person of color is some kind of ape.

We need to create cohorts, or groups, based on some characteristics, and track the performance of our machine learning systems across these. Often you will discover that the teams who are conscious about their cohorts will deploy different models for different user groups, to ensure high-quality service for everyone.

But groupings aren’t just cohorts based on input data characteristics. Sometimes for model analysis, it makes sense to create groupings based on errors. Some kind of groupings by the error profile. Maybe for some inputs your model(s) gives low errors, for other inputs some very high errors, and for yet another group the error distribution is entirely different. To uncover and understand these, you could use K-Means to cluster your losses and identify the reason your model might fail or just underperform. That’s what Manifold from Uber does, and that’s just brilliant!

(Top) 3 clusters of error distributions, and a comparision between 2 models. (Bottom) Once we have error groups, we'd like to find why are these happening. Visualizing differences in feature distribution between two of these clusters can help.
Source: The author. Inspired by: http://manifold.mlvis.io/.

Finally, groupings are also about how you arrange your data into training and testing splits. Or more splits, like evaluation during the training of your model. These help in noticing when the model starts to overfit or whatever. Keep in mind, special care should be taken when doing a hyperparameter search. For fast-to-train models, a technique called nested cross validation is an incredibly good way to ensure the model is really good. The nested part is necessary because doing hyperparameter optimization (HPO) you’re optimizing on the evaluation set, so your results will be “optimistic” to say the least. Having an additional split could give you a more unbiased evaluation of the final model. What about slow models? Oh, boi. Try to have a big enough dataset such that you can have big splits for all your evaluation/testing stages. You don’t have this either? Have you heard about the AI hierarchy of needs?

Also, an often overlooked issue is the target distribution of the dataset. It might be heavily imbalanced, and as a result, special care should be taken when sampling from it for train/validation/test splits. That’s why you should almost always search for a way to have your splits stratified (see scikit-learn’s StratifiedKFold, also train_test_split has a stratify= parameter and for multioutput datasets check out multioutput_crossvalidation package). When a dataset is imbalanced you could try to do some sort of oversampling, a la SMOTE or ADASYN, but in my experience, it might not always work, so just experiment (a scikit-learn-like lib for this is imbalanced-learn).

Interpretations

Disclaimer #2, this part of the blog post is maybe one of the most overwhelming. There’s quite a body of literature about ML interpretability/explainability and I will only briefly mention some methods, for a more in-depth overview, check out Interpretable Machine Learning by Christoph Molnar.

This category is pretty abstract, and some might argue that these are not really related to model evaluation, but rather ML interpretability/explainability. To which I say that these methods allow uncovering hidden errors, biases. Based on these, now you can pick one model over another, thus interpretations being useful for evaluation. These tools excel in identifying the “right answer - wrong method” scenarios, which will pass without any issue metrics and groupings.

So, what things can you “interpret” about a model that can help you evaluate it? First, if your model/API allows for it, you could check feature importances. You might discover that a model puts too much weight on some obscure feature or one that doesn’t make sense. At this point, you should become a detective, and find out why is this the case. This kind of feature importance is called global feature importance, because it is inferred at the model level, from all training data.

The next easy thing to do is perturbation analysis, of which there are multiple categories. Perturbation analysis means altering the input and seeing what’s going to happen. We can alter the input with a different purpose to assess different aspects of the model.

Counterfactuals, aka “What if I change this one feature, how will my model prediction change?”. We can check for example how sensitive is the model to changes that in principle should change the prediction intuitively. A prominent tool for this is Tensorboard’s What-If tool.
Adversarial examples, aka “Can I create such input that while similar to a normal one will result in a messed prediction”. Checking these is usually important for external user-facing systems, where an attack can have very nasty consequences, and because this kind of verification is more specific, it is usually left for later during the project.
Random alterations, to assess how robust is the model to unimportant changes, or how well it captures “common sense-ness”, also can be used for local feature importance. In the case of a sentiment analysis problem, a random alteration could be swapping synonyms for words that don’t have positive or negative semantics, aka neutral words.
Out-of-distribution data. Ok, this one isn’t really perturbation analysis, but sometimes you want to make sure the model can generalize to data that is similar but not quite. Or maybe you just want to have some fun at work and pass german sentences to a sentiment analysis model trained on Spanish text.

Another way to help you uncover error patterns is by checking the wrong predictions which have very high model confidence. In simpler terms, the royal fuck-ups. I learned this method relatively late, from the Deep Learning Book by Goodfellow et al. I’m lazy, and this method although obvious in hindsight, is new to me. I prefer doing perturbation analysis so that there’s no need for pretty printing and/or plotting with that one. But while working on my research project I am now “forcing” myself (it’s not so bad, really) to also do this step.

I would recommend defining some sort of regression tests suite made up of previously problematic input examples. This can help be sure that future versions of the ML model are indeed an improvement on the previous ones. In it can check previously wrongly classified entries or use examples from different types of perturbation analysis. You will thank yourself later for this regression suite.

Surrogate local explanations, of which the most prominent tool is LIME, are another kind of interpretability tool. Surrogate local explanations try to approximate a complex machine learning model with a simple machine learning model, but only on a subset of the input data, or maybe just for a single instance.

FINALLY (now for sure), another notable class of ML interpretability methods is additive feature explanations, and for this category one of the most prominent tools is SHAP. SHAP is especially interesting, albeit harder to understand, given it’s based on game theory and uses Shapely values to define local feature importances. One issue with this method is that Shapely values or almost any other additive feature explanation method don’t account for feature interactions, which can be a deal-breaker.

SHAP uses Shapley Values to explain the effect of each feature value on the prediction. Source: author.

There are even more advanced tools, tuned specifically for neural networks. These use different forms of saliency or activation maps. Tools like these are cool and helpful, but harder to use, and not as general. Trying to cover even a subset of these would require an entire book, so if you’re interested, you know what to do ;). In the book, you can find much more detailed explanations about modern tools like SHAP, LIME, Anchors, but also more classic approaches like PDP, ICE, and ALE plots. And even concept identification approaches like Tensorflow’s TCAV tool.

One thing to keep in mind, interpretability tools are crucial for a proper model evaluation. Although not a direct mapping, you can think of these interpretation methods for a model like code review for code. And you don’t merge code without code review in a production system, now do you?

Personal recommendations

We’re nearing the end of this post, so I would like to give you some recommendations on how to proceed when evaluating ML models as if those maturity levels weren’t enough. These recommendations are more low-level and practical, some gotchas if you will.

Of course, start with a couple of appropriate evaluation metrics. Don’t use just one. If you can, cross-validate. If doing HPO, have two testing splits. For classification, I would recommend at least some loss and some score function + scikit-learn’s classification_report and if you don’t have a ton of classes, the confusion matrix is your friend. Some people use AUC and Precision-Recall curves, which are nice, but I’m just not used to these. Maybe after this blog post, I will start using them. (do as I say, not as I do)
I usually do perturbation analysis (random and counterfactuals) after this. Looking for the top-k most wrong predictions helps, but I rarely do it (do as I say, not as I do, #2).
If I’m not satisfied yet, I will certainly check for error groups a la Manifold and/or surrogate local explanations (LIME-like, I mostly use the eli5 package). I prefer not to do the latter because it takes a looooot of time, especially with bigger-sized input. Regarding local explanations with surrogate models, sometimes I find it necessary to adjust the surrogate using the default might be just too simplistic. I do NLP, so both points are a real issue for me.

Sometimes, especially in the early stages of development, I could do a kind of “exploratory testing” of model predictions, namely feed out-of-distribution data and look at what will happen.

For personal experiments, I can sometimes use SHAP but I find it a bit frustrating that it’s hard to export the graphics and that it works best when working from Jupyter. Moreover, it’s slow, but that’s a general issue for all surrogate explanations.

I am yet to play around with Anchors, adversarial examples, and doing stuff like “Find the most similar entry with a different class” or “Find the most similar entries to this one”. The latter two can be done using kNN in either feature, embedding, and/or prediction spaces. Microsoft Data Scientists seem to be asking these kinds of questions to assess their models.**

In the end, I am sure this amount of information is overwhelming. That’s why maybe the best recommendation I could give is to just use a simple model, one that is easy to understand. To make it performant you could also try to invest time in features that make sense. All in all, just be the data scientist your company needs you to be, not the one you want to be. Boring and rational beats hype-driven.

Choose your hero wisely. Made with: imgflip.com

Epilogue

Probably this post, like no other, helped me crystalize a lot of the tacit knowledge gained through the years. Maybe you’ve heard the quote - “When one teaches, two learn” I believe something like this happened here too.

I know my posts are usually long and dense, sorry, I guess, but on the other hand, now you don’t have to bookmark 5-10 pages, just this one 😀😀😀 jk. Anyway, thank you for your perseverance in reading this article, and if you want to leave some feedback or just have a question, you’ve got quite a menu of options (see the footer of this page for contacts + you have the Disqus comment section). Guess it will take a while until next time.

 vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
> Until then, you can play around                                <
> with most of the methods described in this blog post            <
> by checking the link below                                      <
> https://github.com/AlexandruBurlacu/error_analysis_code_samples <
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You can also click on it here. All examples are seeded, so it should be possible to reproduce everything. Have fun.

Acknowledgements

Special thanks to @dgaponcic for style checks and content review, thank you again @anisoara_ionela for thorough grammar checks, and thank you @dianaartiom for the last bits of feedback on ML. You’re all the best <3

A few references

A detailed overview of regression metrics
Interpretable Machine Learning by Christoph Molnar; amazing work, a lot of info, a lot of details
**Gamut paper to help you ask the right questions about a model
Manifold paper and Manifold GitHub repo
A good overview on how to evaluate and select ML models
Github repos which also contain links to their respective papers:
And an Awesome GitHub repo on different XAI tools and papers.

K-Means tricks for fun and profit

2021-06-19T18:30:00+00:00

Prologue

This will be a pretty small post, but an interesting one nevertheless.

K-Means is an elegant algorithm. It’s easy to understand (make random points, move them iteratively to become centers of some existing clusters) and works well in practice. When I first learned about it, I recall being fascinated. It was elegant. But then, in time, the interest faded away, I was noticing numerous limitations, among which is the spherical cluster prior, its linear nature, and what I found especially annoying in EDA scenarios, the fact that it doesn’t find the optimal number of clusters by itself, so you need to tinker with this parameter too. And then, a couple of years ago, I found out about a few neat tricks on how to use K-Means. So here it goes.

The first trick

First, we need to establish a baseline. I’ll use mostly the breast cancer dataset, but you can play around with any other dataset.

from sklearn.cluster import KMeans
from sklearn.svm import LinearSVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

import numpy as np

X, y = load_breast_cancer(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=17)

svm = LinearSVC(random_state=17)
svm.fit(X_train, y_train)
svm.score(X_test, y_test) # should be ~0.93

So, what’s this neat trick that reignited my interest for K-Means?

K-Means can be used as a source of new features.

How, you might ask? Well, K-Means is a clustering algorithm, right? You can add the inferred cluster as a new categorical feature.

Now, let’s try this.

# imports from the example above

svm = LinearSVC(random_state=17)
kmeans = KMeans(n_clusters=3, random_state=17)
X_clusters = kmeans.fit_predict(X_train).reshape(-1, 1)

svm.fit(np.hstack([X_train, X_clusters]), y_train)
svm.score(np.hstack([X_test, kmeans.predict(X_test).reshape(-1, 1)]), y_test) # should be ~0.937

Source: knowyourmeme.com

These features are categorical, but we can ask the model to output distances to all the centroids, thus obtaining (hopefully) more informative features.

# imports from the example above

svm = LinearSVC(random_state=17)
kmeans = KMeans(n_clusters=3, random_state=17)
X_clusters = kmeans.fit_transform(X_train)
#                       ^^^^^^^^^
#                       Notice the `transform` instead of `predict`
# Scikit-learn supports this method as early as version 0.15

svm.fit(np.hstack([X_train, X_clusters]), y_train)
svm.score(np.hstack([X_test, kmeans.transform(X_test)]), y_test) # should be ~0.727

Wait, what’s wrong? Could it be that there’s a correlation between existing features and the distances to the centroids?

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

columns = ['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension',
       'distance to cluster 1', 'distance to cluster 2', 'distance to cluster 3']
data = pd.DataFrame.from_records(np.hstack([X_train, X_clusters]), columns=columns)
sns.heatmap(data.corr())
plt.xticks(rotation=-45)
plt.show()

Notice the last 3 columns, especially the last one, and their color on every row.

You probably heard that we want the features in the dataset to be as independent as possible. The reason is that a lot of machine learning models assume this independence to have a simpler algorithm. Some more info on this topic can be found here and here, but the gist of it is that having redundant information in linear models destabilizes the model, and in turn, it is more likely to mess up. On numerous occasions, I noticed this problem, sometimes even with non-linear models, and purging the dataset from correlated features usually gives a slight increase in the model’s performance characteristic.

Back to our main topic. Given that our new features are indeed correlated with some of the existing ones, what if we use only the distances to the cluster means as features, will it work then?

# imports from the example above

svm = LinearSVC(random_state=17)
kmeans = KMeans(n_clusters=3, random_state=17)
X_clusters = kmeans.fit_transform(X_train)

svm.fit(X_clusters, y_train)
svm.score(kmeans.transform(X_test), y_test) # should be ~0.951

Much better. With this example, you can see that we can use KMeans as a way to do dimensionality reduction. Neat.

So far so good. But the piece de resistance is yet to be shown.

The second trick

K-Means can be used as a substitute for the kernel trick

You heard me right. You can, for example, define more centroids for the K-Means algorithm to fit than there are features, much more.

# imports from the example above

svm = LinearSVC(random_state=17)
kmeans = KMeans(n_clusters=250, random_state=17)
X_clusters = kmeans.fit_transform(X_train)

svm.fit(X_clusters, y_train)
svm.score(kmeans.transform(X_test), y_test) # should be ~0.944

Well, not as good, but pretty decent. In practice, the greatest benefit of this approach is when you have a lot of data. Also, predictive performance-wise your mileage may vary, I, for one, had run this method with n_clusters=1000 and it worked better than only with a few clusters.

SVMs are known to be slow to train on big datasets. Impossibly slow. Been there, done that. That’s why, for example, there are numerous techniques to approximate the kernel trick with much less computational resources.

By the way, let’s compare how this K-Means trick will do against classic SVM and some alternative kernel approximation methods.

The code below is inspired by these two scikit-learn examples.

import matplotlib.pyplot as plt
import numpy as np
from time import time

from sklearn.datasets import load_breast_cancer
from sklearn.svm import LinearSVC, SVC
from sklearn import pipeline
from sklearn.kernel_approximation import RBFSampler, Nystroem, PolynomialCountSketch
from sklearn.preprocessing import MinMaxScaler, Normalizer
from sklearn.model_selection import train_test_split
from sklearn.cluster import MiniBatchKMeans


mm = pipeline.make_pipeline(MinMaxScaler(), Normalizer())

X, y = load_breast_cancer(return_X_y=True)
X = mm.fit_transform(X)

data_train, data_test, targets_train, targets_test = train_test_split(X, y, random_state=17)

We will test 3 methods for kernel approximation available in the scikit-learn package, against the K-Means trick, and as baselines, we will have a linear SVM and an SVM that uses the kernel trick.

# Create a classifier: a support vector classifier
kernel_svm = SVC(gamma=.2, random_state=17)
linear_svm = LinearSVC(random_state=17)

# create pipeline from kernel approximation and linear svm
feature_map_fourier = RBFSampler(gamma=.2, random_state=17)
feature_map_nystroem = Nystroem(gamma=.2, random_state=17)
feature_map_poly_cm = PolynomialCountSketch(degree=4, random_state=17)
feature_map_kmeans = MiniBatchKMeans(random_state=17)
fourier_approx_svm = pipeline.Pipeline([("feature_map", feature_map_fourier),
                                        ("svm", LinearSVC(random_state=17))])

nystroem_approx_svm = pipeline.Pipeline([("feature_map", feature_map_nystroem),
                                        ("svm", LinearSVC(random_state=17))])

poly_cm_approx_svm = pipeline.Pipeline([("feature_map", feature_map_poly_cm),
                                        ("svm", LinearSVC(random_state=17))])

kmeans_approx_svm = pipeline.Pipeline([("feature_map", feature_map_kmeans),
                                        ("svm", LinearSVC(random_state=17))])

Let’s collect the timing and score results for each of our configurations.

# fit and predict using linear and kernel svm:
kernel_svm_time = time()
kernel_svm.fit(data_train, targets_train)
kernel_svm_score = kernel_svm.score(data_test, targets_test)
kernel_svm_time = time() - kernel_svm_time

linear_svm_time = time()
linear_svm.fit(data_train, targets_train)
linear_svm_score = linear_svm.score(data_test, targets_test)
linear_svm_time = time() - linear_svm_time

sample_sizes = 30 * np.arange(1, 10)
fourier_scores = []
nystroem_scores = []
poly_cm_scores = []
kmeans_scores = []

fourier_times = []
nystroem_times = []
poly_cm_times = []
kmeans_times = []

for D in sample_sizes:
    fourier_approx_svm.set_params(feature_map__n_components=D)
    nystroem_approx_svm.set_params(feature_map__n_components=D)
    poly_cm_approx_svm.set_params(feature_map__n_components=D)
    kmeans_approx_svm.set_params(feature_map__n_clusters=D)
    start = time()
    nystroem_approx_svm.fit(data_train, targets_train)
    nystroem_times.append(time() - start)

    start = time()
    fourier_approx_svm.fit(data_train, targets_train)
    fourier_times.append(time() - start)

    start = time()
    poly_cm_approx_svm.fit(data_train, targets_train)
    poly_cm_times.append(time() - start)

    start = time()
    kmeans_approx_svm.fit(data_train, targets_train)
    kmeans_times.append(time() - start)

    fourier_score = fourier_approx_svm.score(data_test, targets_test)
    fourier_scores.append(fourier_score)
    nystroem_score = nystroem_approx_svm.score(data_test, targets_test)
    nystroem_scores.append(nystroem_score)
    poly_cm_score = poly_cm_approx_svm.score(data_test, targets_test)
    poly_cm_scores.append(poly_cm_score)
    kmeans_score = kmeans_approx_svm.score(data_test, targets_test)
    kmeans_scores.append(kmeans_score)

Now let’s plot all the collected results.

plt.figure(figsize=(16, 4))
accuracy = plt.subplot(211)
timescale = plt.subplot(212)

accuracy.plot(sample_sizes, nystroem_scores, label="Nystroem approx. kernel")
timescale.plot(sample_sizes, nystroem_times, '--',
               label='Nystroem approx. kernel')

accuracy.plot(sample_sizes, fourier_scores, label="Fourier approx. kernel")
timescale.plot(sample_sizes, fourier_times, '--',
               label='Fourier approx. kernel')

accuracy.plot(sample_sizes, poly_cm_scores, label="Polynomial Count-Min approx. kernel")
timescale.plot(sample_sizes, poly_cm_times, '--',
               label='Polynomial Count-Min approx. kernel')

accuracy.plot(sample_sizes, kmeans_scores, label="K-Means approx. kernel")
timescale.plot(sample_sizes, kmeans_times, '--',
               label='K-Means approx. kernel')

# horizontal lines for exact rbf and linear kernels:
accuracy.plot([sample_sizes[0], sample_sizes[-1]],
              [linear_svm_score, linear_svm_score], label="linear svm")
timescale.plot([sample_sizes[0], sample_sizes[-1]],
               [linear_svm_time, linear_svm_time], '--', label='linear svm')

accuracy.plot([sample_sizes[0], sample_sizes[-1]],
              [kernel_svm_score, kernel_svm_score], label="rbf svm")
timescale.plot([sample_sizes[0], sample_sizes[-1]],
               [kernel_svm_time, kernel_svm_time], '--', label='rbf svm')

And some more plot adjustments, to make it pretty.

# legends and labels
accuracy.set_title("Classification accuracy")
timescale.set_title("Training times")
accuracy.set_xlim(sample_sizes[0], sample_sizes[-1])
accuracy.set_xticks(())
accuracy.set_ylim(np.min(fourier_scores), 1)
timescale.set_xlabel("Sampling steps = transformed feature dimension")
accuracy.set_ylabel("Classification accuracy")
timescale.set_ylabel("Training time in seconds")
accuracy.legend(loc='best')
timescale.legend(loc='best')
plt.tight_layout()
plt.show()

Meh. So was it all for nothing?

You know what? Not in the slightest. Even if it’s the slowest, K-Means as an approximation of the RBF Kernel is still a good option. I’m not kidding. You can use this special kind of K-Means in scikit-learn called MiniBatchKMeans which is one of the few algorithms that support the .partial_fit method. Combining this with a machine learning model that has .partial_fit too, like a PassiveAggressiveClassifier one can create a pretty interesting solution.

Note that the beauty of .partial_fit is twofold. First, it makes it possible to train algorithms in an out-of-core fashion, which is to say, with more data than fits in the RAM. Second, depending on your type of problem, if you could in principle (very-very in principle) never need to switch the model, it could be additionally trained right where it is deployed. That’s called online learning, and it’s super interesting. Something like this is what some Chinese companies are doing and in general can be pretty useful for AdTech, because you can receive the info whenever your ad recommendation was right or wrong within seconds.

You know what, here’s a little example of this approach for out-of-core learning.

from sklearn.cluster import MiniBatchKMeans
from sklearn.linear_model import PassiveAggressiveClassifier

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

import numpy as np

def batch(iterable, n=1):
    # source: https://stackoverflow.com/a/8290508/5428334
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx:min(ndx + n, l)]

X, y = load_breast_cancer(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=17)

kmeans = MiniBatchKMeans(n_clusters=100, random_state=17) # K-Means has a constraint, n_clusters <= n_samples to fit
pac = PassiveAggressiveClassifier(random_state=17)

for x, y in zip(batch(X_train, n=100), batch(y_train, n=100)):
    kmeans.partial_fit(x, y)       # fit K-Means a bit
    x_dist = kmeans.transform(x)   # obtain distances
    pac.partial_fit(x_dist, y, classes=[0, 1])     # learn a bit the classifier, we need to indicate the classes
    print(pac.score(kmeans.transform(X_test), y_test))

# 0.909 after 100 samples
# 0.951 after 200 samples
# 0.951 after 300 samples
# 0.944 after 400 samples
# 0.902 after 426 samples


# VS
kmeans = MiniBatchKMeans(n_clusters=100, random_state=17)
pac = PassiveAggressiveClassifier(random_state=17)

pac.fit(kmeans.fit_transform(X_train), y_train)
pac.score(kmeans.transform(X_test), y_test)
# should be ~0.951

Epilogue

So you’ve made it till the end. Hope now your ML toolset is richer. Maybe you’ve heard about the so-called “no free lunch” theorem; basically, there’s no silver bullet, in this case for ML problems. Maybe for the next project, the methods outlined in this post won’t work, but for the one that will come after that, they will. So just experiment, and see for yourself. And if you need an online learning algorithm/method, well, there’s a bigger chance that K-Means as a kernel approximation is the right tool for you.

By the way, there’s another blog post, also on ML, in the works now. What’s even nicer, among many other nice things in it, it describes a rather interesting way to use K-Means. But no spoilers for now. Stay tuned.

Finally, if you’re reading this, thank you! If you want to leave some feedback or just have a question, you’ve got quite a menu of options (see the footer of this page for contacts + you have the Disqus comment section).

Some links you might find interesting

Acknowledgements

Special thanks to @dgaponcic for style checks and content review, and thank you @anisoara_ionela for grammar checking this article more thoroughly than any AI ever could. You’re the best <3

P.S. I believe you noticed all these random_states in the code. If you’re wondering why I added these, it’s to make the code samples reproducible. Because frequently tutorials don’t do this and it leaves space for cherry-picking, where the author presents only the best results, and when trying to replicate these, the reader either can’t or it takes a lot of time. But know this, you can play around with the values of random_state and get widely different results. For example, when running the snippet with original features and distances to the 3 centroids, the one with a 0.727 score, with a random seed of 41 instead of 17, you can get the accuracy score of 0.944. So yeah, random_state or however else the random seed is called in your framework of choice is an important aspect to keep in mind, especially when doing research.

Logging, Tracing, Monitoring, et al.

2021-05-18T22:10:00+00:00

So, you want to launch your code/app/system in production?

Wait, before you do, ask yourself this question: If something goes south, how will I know what exactly happened?

A good question, indeed.

A more seasoned engineer might say: I will use logs!!! But what if I tell you, logs are only the begging?

[Disclaimer Time] This article is not about some concrete technology, framework, or library, although it references some of these. It’s more of an overview/tips about what logging/tracing/et al are and how to approach these when designing and operating software systems. The information here is based mostly from my own experience, but also from information available in papers and industry blog posts. You might need to google some stuff while/after reading it, especially if you’ve never operated a system running in production.

Act 1: I’ll set up logs, alright…

So, what exactly is a log?

Technically, this is a log, but I want to talk about other kinds of logs.

Logs are a record about some event in a system

Pretty abstract, huh? A log is like an entry in a journal about something that happened, maybe with some context. Somewhat like the Twitter feed of an Apple-reporter during the WWDC event. You have time, you have a record of something that just happened, and maybe you have context too. Now, jokes aside, logs are necessary for a system running in production. They help you uncover what was happening moments before applications crash. Or malicious activity. Or other stuff. But how do we make good logs?

Tenets of a good log message

So, how should we design our logs? Here are some tenets:

Thy logs must be hierarchical: we need to respect the distinction between DEBUG/INFO/WARNING/ERROR and possibly other levels. We should not crowd the system with WARNING logs when INFO or DEBUG logs are more appropriate. Crowding also refers to how much information a log contains. That said, a good idea for an ERROR log is to register as much information as possible to aid in debugging. Use DEBUG-level logs to register information about what setting the program is using, even how much time or resources some subroutine is using, but don’t abuse this. As for INFO logs, anything in between. Like information about a call to a top-level route handler in an HTTP server. Also, INFO logs are the right way to use prints in a system.
Thy logs must be informative: A good rule of thumb is to log everything that might help you debug your system. If an error happens, you will want to log the traceback. Also, logging the context in which the error happened will prove to be useful. By context, I mean some surrounding variables, which might have something to do with the failure. If your system is running with multiple processes or is multithreaded, or multi-whatever, do yourself a favor and log the PIDs/Thread IDs. Finally, be very careful with how you represent time, explaining why would require an entire blog, but time in computer systems is a pain, see for yourself.

ERROR: Error name, message, traceback, variables in scope is possible
WARNING: Warning name, message
INFO: Calls to top-level functions/handlers, like: [2021-05-17 00:06:23] INFO: GET /posts 200 OK
DEBUG: Program setup/initialization info, possibly memory or performance information*

*: more on that later

Thy logs must be filterable: logs are meant to be analyzed. Make them as searchable as possible. Consider formatting them as JSON documents, and don’t abuse nesting.

Why not? If the JSON is too nested, it becomes hard to search/analyze, defying its purpose.

For example, Elasticsearch can’t properly index JSONs with two or more levels of nesting. That is, something like the example below can be indexed:

{"timestamp": "2021-05-18T21:09:54Z", "level": "error", "msg": "bad thing happened"}

Even something like this:

{"timestamp": {"date": "17th May, 2021", "time": "11:30:30am"}, "level": "error", "msg": "bad thing happened"}

But do something like this:

{"timestamp": {
    "date": "17th May, 2021",
    "time": [11, 30, 30, 124]
    },
 "level": "error",
 "msg": "bad thing happened",
 "context": {
    "some_key_for_multiple_values": []
    }
}

And Elastic will treat your deeply nested elements like strings, and then good luck filtering and aggregating these logs. So keep it flat, whenever possible.

Another good format is NCSA Common log format, but if possible, choose JSON. Why? Most log analysis tools use JSON. Something like NCSA Common log format is better for smaller systems, where you can search your logs with grep and friends. Finally: Whatever format you choose, be consistent across your whole system

Bad log (1): [2021-05-17 12:30:30] ERROR: KeyError // JSON version would be just as bad
Bad log (2): {"datetime": {"date": "17th May, 2021", "time": "11:30:30am"}, "type": "ERROR", "msg": "A KeyError error occured in function some_function"}
Better log: {"timestamp": "2021-05-18T21:09:54Z", "level": "error", "pid": 1201, "traceback": , "msg": "KeyError: 'key_name'"}

Some wisdom on logging ops

So you have well-written logs. That’s great!!

But now you have to decide how to access and analyze them. Funny thing, these decisions should also be guided by the stage and the scale of your system. In other words, I would advise against a complex infrastructure if you have one app serving a few hundred people.

Now we should dive into details.

You will roughly have three stages.

Log collection/shipment
Log storage
Log processing/analytics

First, log collection. We want to save our logs somewhere and not just let them print to stderr/stdout. So, now we have to think about where do we write them. It could be a file, or to Syslog, for example, or we could even write them into a TCP or UDP socket, sending them away to some logging server. To be honest, all choices are somewhat good. As long as you don’t block the thread where the action happens, you should be fine, otherwise, prepare for a performance hit.

Regarding storage, for a simple app leaving them in file format should work for a while, but eventually, a storage solution with indexing support or really anything that can help you quickly search your logs will be advised.

Once you have multiple services, you can think of a centralized logging server, something like an ELK (Elasticsearch, Logstash, Kibana) cluster, with one or a few Elastic instances in a cluster setup.

So here comes my personal opinion: you should start by logging into a file, and mandatory ensure log file rotation because you don’t want a single 10GB text file. Believe me… you don’t. At some point, you will also have to think of log compression and possibly log shipping. Log shipping means transferring the logs from where these were created to where these will be analyzed and stored for a long time.

When it comes to log shipping, I would strongly suggest using TCP or HTTP over UDP and other protocols. Why, you may ask? Because first of all, with UDP you might lose logs while transferring them due to (1) no way of retransmitting lost packets, (2) no flow control, which might be the cause of lost packets, but also because with UDP message size is limited to 65KB of data, or even less, depending on network settings, which quite frankly could be not nearly enough. Also, your company firewalls might block this kind of traffic. So, a lot of trouble.

Having a centralized logging solution, you will now absolutely need to ship the logs, and having them first written to a file will prove a very nice idea because now your logs won’t be lost in case of network outages, server failure, logging system failure or any of the above mentioned being too slow.

Nice.

Act 1.1: Hey, I think I can make a chatbot to notify me when something blows up

Yup, you can. And if you want to reduce MTTR you most likely should. Just take into account a few things.

First and foremost, if you have the possibility, set up alerting thresholds. You don’t want to be notified when something is even slightly off every. single. time. Maybe it’s some unique (non-critical) event, no need to bother, while if the issue happens frequently, you better be notified.
Another consideration, when it comes to alerting, is the possibility to have escalation alerting. First, send an alert via email. If no action was taken, now send it to a chat group of the responsible team. Still no activity? Send it in DM to an engineer, or even to a technical manager.
Finally, just aggregate the stuff, no need for 12, or a hundred, emails/Slack messages of the same issue. Something like one log message and then some text like X occurred 25 times in the last Y seconds should be good.

When it comes to what tools to use for alerting, well, you have Sentry, also to my knowledge, it is possible to set up alerting in Kibana, although I don’t know whenever this is a paid option or free, and there are of course other tools.

This is by no means a definitive guide on how to do it, only some things to keep in mind. This whole blog post isn’t a definitive guide if you haven’t noticed yet.

Act 2: My system is slow, I guess I’ll log execution time, and # of requests, and …

… just. Stop. Please. The fact that you can do it, doesn’t mean you should. Welcome to the world of telemetry and performance monitoring, where you will initially wonder, why not just use logs? I mean, in principle you could do this, but better to have a different infrastructure, to not mess everything up.

Mess up how? Well, if you’re like me, you might want to set up performance monitoring not just at the route controller level, to see how much requests take to be handled and responded to (assuming a hypothetical server). You will also want to track how much time queries to the database take to execute, even functions! And now you have a ton of very fine-grained info, which will for sure overload the logging infrastructure. You don’t want that. Besides, even if all runs smoothly, your read and write patterns will be different. Log analysis queries can be much more complex than analysis required for performance monitoring. Also, performance monitoring usually has smaller messages that need to be recorded with lower latency. All in all, better set up a dedicated infrastructure for this.

The easiest thing is of course to use TRACE level logging, and as said earlier, dedicated infrastructure for performance monitoring. But this works only on small scale, where frankly, you don’t even need performance monitoring.

As the system scales, you might start looking towards a more restricted type of logs, maybe some binary protocols, given that you will be sending small packets of information right away, very frequently.

Performance monitoring has a bit of a different write and query patterns than log analytics (ik, said it earlier), so different storage is recommended. Queries are simpler mainly showing trends, time series, current values, or some simple aggregate values, like counts, means, medians, and percentiles, and writes are very frequent but with little data, only a few metrics, compared with logging tracebacks and contexts and stuff like that.

That’s why for example ELK stack is more common in logging infrastructure, where Elasticsearch can index and analyze even very unstructured data, and stuff like Grafana + Prometheus are more commonly used for performance monitoring. Prometheus, among other things, contains a time-series database, just the right thing to store and quickly query performance metrics.

Also, when it comes to performance analysis, you will want to monitor your system utilization, not just the stuff intrinsic to your code. If you’re using Prometheus, that’s easy to do.

Act 3: My microservice system is slow, but I can’t figure out why

First, a likbez (crash-course) on networking and dynamic systems: Against our intuition, a computer network is a shared resource with a limited capacity. This basically means if one service is very chatty, it will influence the throughput and latency for all the rest. Also given that networks are a priori not 100% reliable and we mostly use TCP-based traffic, in the network, there will be plenty of packets (chunks of data, retransmissions, packets from administrative protocols). That’s only half the problem though. There’s more 😉

Our services are dependent upon each other and upon 3rd parties. So if one service is slow, it might influence other services, even ones that are not directly interacting with it. One metaphor to help you think of it is a spider web. When you touch it on one side, it will ripple on the other side. Kinda like a butterfly effect. And that’s not just a simple comparison, you could indeed see failure due to some other service being somewhat slower.

So, how do we monitor this?

Maybe logs? Or something like performance monitoring from the previous act?

Well, I mean, it’s a start, but only logs won’t cut it. Because we don’t see the full picture, specifically, we don’t see the interaction between services, only each individual’s performance. We need something more. Enter tracing.

First, a good mental model about tracing is that it’s like logging, but with a correlation identifier, which makes it possible to combine said logs into a “trace”. A trace like this now can show us how, for example, a single request spans multiple services, how much time does each step takes and even how much time was spent on communication. All this can help uncover bugs and performance bottlenecks which a simple performance monitoring tool, or just logs, won’t be able to do. Tracing will help you find bottleneck services, and sometimes even aid you in debugging distributed systems.

Traces should be thought of as an extension to performance monitoring tools, rather than logs. Traces’ primary purpose is to uncover performance issues, also sometimes pinpoint the reason a specific operation failed. You could use them as logs, but don’t overload them with information, otherwise, your collection, storage, and analysis infrastructure will cry.

How to structure your traces? The easiest thing to do is to use tools that automagically will patch your dependencies like database clients, web servers, and HTTP/RPC clients and be done with it. Sensible defaults, you know. If you want to have more control, be prepared to write some boilerplate, especially if you want to manually control what things will be propagated between services. When it comes to adding info to your spans (the pieces which combined form a trace) don’t add your whole application context, only the most important things, for example, current configurations of your system.

Side note, sometimes it is important to correlate traces with logs, for this you can use yet another correlation identifier, for a more in-depth analysis of your system, combining traces with individual logs.

There are some existing Open Source tools with great support, like Jaeger and Zipkin, there are also industry initiatives like OpenTracing, OpenCensus and “their combination” OpenTelemetry, not to mention a few trace formats, like W3C Trace Context and Zipkin B3 formats.

A common architecture for tracing subsystems is a combination of a sidecar, collector, storage, and “presenter” components, not to mention the client library. When it comes to using tracing in a serverless setup it gets tricky, one solution would be to bypass the sidecar and send data directly to the collector, but you will lose some nice features.

Tracing, in general, is huuuuge topic, and covering it would require at least one more long-read article. That’s why, for more information, I’d like to point you towards these two articles and this post from Uber. In these you’ll find more “war stories” on how such systems where implemented (first article and the post from Uber) and also such important topics as trace sampling strategies and trace visualizations (second article).

Final act: Welcome to observability!!!

Observability, what?

Observability is the property of a system to be understood. It’s a property of how well can one infer the internal state of something from its external outputs. It’s a spectrum and depending on where your system stands, you can use monitoring and alerting more or less efficiently. In other words, if a system is observable you can understand what is happening within it from its outputs.

We need to design our systems with observability in mind. And with all the stuff outlined above, that should become a doable task.

I prefer to think of observability, with a proper incident response procedure, of course, as a way to make said system anti-fragile (see the works of Nasim Taleb), because with every failure and issue that happens, it “learns”, on the organizational level, to be better. Or one could argue that on the contrary, the system now becomes more fragile because with every fix we believe more and more that the system is now unkillable, which it never will be.

Pick for yourself, but don’t forget to use logging. At least you’ll know when and why things go south, and that’s something.

Epilogue

You’ve made it! Congrats! Now you have some very important knowledge of how to be prepared when manure hits the proverbial fan in production. This knowledge should help you debug even super-obscure bugs. Of course, this isn’t going to be easy, plus you now have an entire infrastructure to take care of, but hey, if this helps reducing time to solve an issue from 1 week (or more) to 1, maybe 2 days, it might be worth it.

I know for a fact that it was worth it for me, time and time again when it helped me quickly identify edge cases, stupid misconfigurations, and performance bottlenecks.

So yeah, that’s it for now. Incredibly, it didn’t take much time since my last blog post.

Finally, if you’re reading this, I’d like to thank you. Let me know what are your thoughts about it via Twitter, for now, until I plug in some form of a comment section. Your feedback is valuable for me.