<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://alexandruburlacu.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://alexandruburlacu.github.io/" rel="alternate" type="text/html" /><updated>2024-11-08T13:58:37+00:00</updated><id>https://alexandruburlacu.github.io/feed.xml</id><title type="html">Alexandru Burlacu</title><subtitle>A blog about advanced machine learning topics, MLOps, software engineering, distributed systems, and more.</subtitle><entry><title type="html">MLOps for independent research</title><link href="https://alexandruburlacu.github.io/posts/2023-01-12-mlops-for-independent-research" rel="alternate" type="text/html" title="MLOps for independent research" /><published>2023-01-12T21:00:00+00:00</published><updated>2023-01-12T21:00:00+00:00</updated><id>https://alexandruburlacu.github.io/posts/mlops-for-independent-research</id><content type="html" xml:base="https://alexandruburlacu.github.io/posts/2023-01-12-mlops-for-independent-research"><![CDATA[<p><strong>… Or how to run experiments on a budget.</strong></p>

<p>On December 5th, I was presenting online at the <a href="https://www.meetup.com/mlops-belgium/events/289639571/">Belgium MLOps meetup in Ghent</a>. I thought more people would benefit from the content of that presentation and my experience in general. So, I decided also to have it as an article on my blog. Also, while working on that presentation, I found a few unexpected things, but later about it.</p>

<p>Oh, by the way, one of the best alternative titles was:</p>

<center><img src="/_data/MLOpsBelgium/Alt-Title.webp" width="850" heigth="480" /></center>
<center><i>I went with food | Image based on the slides by the author</i></center>

<!-- **UPDATE**: Here's the recording from that presentation -->

<h2 id="prologue---some-context">Prologue - Some context</h2>

<p>I believe it’s important to outline my main research driver:</p>
<blockquote>
  <p><strong><em>I’m searching for methods to train strong neural networks from scratch with minimum annotated data. Ideally, with minimum data.</em></strong></p>
</blockquote>

<p>Why? Throughout my career, I had cases when data was scarce and expensive to acquire, and even a pre-trained model couldn’t help. So I had to create small bespoke models to tackle my problems. It was a huge pain, and I want to never go through that hell again and wish no one else would have to either.</p>

<p>Besides, sometimes, using a pre-trained model can be restrictive, depending on its license. Currently, the most relevant type of restrictive license for AI is <a href="https://bigscience.huggingface.co/blog/the-bigscience-rail-license">RAIL</a>. If you wonder why such licenses are restrictive and don’t want to dive into the legal aspects, here are a few good links.</p>

<ul>
  <li><a href="https://blog.tidelift.com/evaluating-the-rail-license-family">Evaluating the RAIL license family</a></li>
  <li><a href="https://www.reddit.com/r/StableDiffusion/comments/z8x4k3/the_changes_between_the_creativeml_open_railm/">A Reddit discussion about various RAIL variants and their implications</a></li>
  <li><a href="https://www.youtube.com/watch?v=W5M-dvzpzSQ">The New AI Model Licenses have a Legal Loophole | Yannic Kilcher</a></li>
</ul>

<p>To form a more nuanced view of ML and licensing, see the two-part essay <a href="https://thegradient.pub/machine-learning-ethics-and-open-source-licensing/">by Christopher Moran on The Gradient</a>. We won’t dive any deeper in this rabbit hole, otherwise we’ll stray waaaaay too far from this blog’s scope.</p>

<!-- 
https://www.digitalocean.com/community/tutorials/understanding-open-source-software-licenses
https://fossa.com/developers-guide-open-source-software-licenses
https://www.digitalocean.com/community/conceptual-articles/free-vs-open-source-software
 -->

<p>So anyway, in the summer of 2021, I had a research internship at Université Paris Sorbonne Nord. I had my own research agenda, and my supervisor was super cool about it. My research project was about searching for more sample-efficient self-supervised learning techniques (SSL). I was working with images, but the method should be modality-agnostic.</p>

<p>The only downside, stemming from my not wanting to work on some existing, grant-covered project, was that I had no access to the necessary hardware.</p>

<p>But that’s alright. It is, isn’t it?</p>

<h2 id="you-want-to-do-some-independent-research">You want to do some independent research</h2>

<p>How do you proceed?</p>

<h3 id="solution-you-buy-a-gpu">Solution: You buy a GPU.</h3>

<!-- Emoji here -->
<p>🪄🪄 Or better yet, you buy many GPUs. 🪄🪄 <!-- Emoji here too --></p>

<p>Problem solved.</p>

<p>Bye.</p>

<p>Hold on, seriously. How do you proceed? A good GPU machine will set you back a few thousand USD, even with the crypto boom somewhat behind.</p>

<p>Besides, my project was pretty short-term, so such an investment would be a net loss. And I’m not even counting the time I could spend on it playing games instead of training nets.</p>

<p>And if that wasn’t enough, depending on where you live and the quality of your electric wiring, such a machine will bring more pain and expenses than joy. Have you ever had your personal computer/workstation randomly shutdown due to excesive power consumption, maybe even taking down all your desk appliances with it too? I have.</p>

<h3 id="free-solution-google-colab">Free solution: Google Colab</h3>

<p>A popular alternative would be to use Google Colab. But not so fast. There are some limitations worth mentioning. Colab’s free tier will only allow you one GPU per account, you have to be mindful of the daily GPU quota (about 8 hours within 24h), and you can’t even run the same notebook in parallel even if it uses the CPU runtime.</p>

<p>What about Colab Pro/Pro+?</p>

<ol>
  <li>You are not guaranteed any specific GPU. It could be a P100, a T4, or, once in a blue moon, a V100.</li>
  <li>It’s still a single notebook. What if I want multiple?</li>
  <li>What are “compute units”, and how much each GPU costs?</li>
</ol>

<p>If I am to pay for a service, I’d like to understand what  I am paying for and how I’m billed. The opacity of Colab Pro and Pro+ is something I’m not sure I’d be willing to accept.</p>

<h2 id="the-first-not-so-good-solution">The first (not so) good solution</h2>

<p>Given all that, I decided for my first variant to rely on Colab because it has free access to some GPU resources. With the saved money, I indulged myself with over 20 different kinds of cheese and too many macaron flavors to count. Vive le France!</p>

<p>To run more experiments and somehow circumvent the limited access to GPUs, I was using multiple Google accounts. Each account had a copy of the same Colab notebook and only had to change hyperparameters. If you wonder whether managing these identical-but-not-quite notebooks was a mess, I’ll answer you - it was an absolute mess.</p>

<p>As for my storage solution - I was storing model checkpoints in a shared Google Drive, and given that a blob’s storage consumption is associated with the account that created it and not where it’s stored, in practice, the amount of available Google Drive storage is doubled.</p>

<p>What about experiment tracking? - Google Sheets. Yes, it started to become a mess after the 3rd change of the experiment setup.</p>

<h2 id="towards-a-better-solution">Towards a better solution</h2>

<p>Of course, it was unsustainable and slow. And painful. And annoying. And somewhat challenging to replicate. So, I needed another solution, and by this time had outlined some constraints:</p>

<ul>
  <li><strong>Constraint One</strong>: Messy environment, mainly Jupyter, with relatively limited code refactoring</li>
  <li><strong>Constraint Two</strong>: Ideally, I wanted numerically replicable experiments</li>
  <li><strong>Constraint Three</strong>: Also, experiments take a long time, so I want to run many at the same time</li>
  <li><strong>Constraint Four</strong>: Cost is a big issue because the research is self-funded</li>
</ul>

<p>Based on these constraints, I had my core requirements: <strong>Cost-efficiency</strong>, <strong>Flexibility</strong>, and <strong>Reproducibility</strong>. I had some ideas in mind to accomplish these requirements, but I needed computing resources, so my next stop was to use a public cloud.</p>

<p>I picked GCP because I’m most familiar with it. I know about alternative GPU clouds like Paperspace or Linode, but <em>I felt</em> that they might be more expensive. Plus, again, I am most familiar with GCP.</p>

<center><img src="/_data/MLOpsBelgium/MLOps for independent research.gif" width="850" heigth="480" /></center>
<center><i>If you look long enough, you'll hear the song | Image based on the slides by the author</i></center>

<p>Initially, I was provisioning stuff from the Web console. But it was tedious and error-prone, I like CLIs better, and I had Terraform and Ansible on my radar for a while.</p>

<h3 id="core-requirements-cost-efficiency">Core requirements: Cost-efficiency</h3>

<p>Based on this requirement, here are some decisions that stemmed from it.</p>

<ol>
  <li>I needed the cheapest powerful machines - Preemptible VMs with GPUs</li>
  <li>I also needed a simple way to quickly spin machines up and down so that I don’t forget anything running and I don’t waste time while setting up the environment - Terraform FTW, and Ansible too</li>
  <li>I had a hunch that by using the most powerful machine and maximizing its usage, I would have the best price-performance ratio - thus, I chose A100 GPUs. To be absolutely honest, another driver for this decision was the coolness factor</li>
  <li>I was running multiple experiments in parallel, as fast as possible - used Papermill for the hands-off launch of multiple notebook-based experiments. Occasionally was using tmux from the Jupyterlab terminal window, but it was a total pain.</li>
  <li>Best cost-optimization is not to run things at all - so I used HPO to select what configurations to run. For HPO, I used Optuna.</li>
</ol>

<p>Of all the HPO tools out there, why did I choose Optuna, you may ask?</p>

<ul>
  <li>I like their API. It integrates nicely with Python control structures, like for-loops, or if-elif-else.</li>
  <li>Optuna uses a Bayesian HPO approach. Bayesian methods are pretty accurate and more hands-off than random search, allowing me to launch the hyperparameter search sweep and not think about narrowing down the search space.</li>
  <li>A downside of Bayesian Optimization methods is that they are slow-ish / not very parallelizable. But that’s ok, my degree of parallelization is 2-5 parallel runs, and I didn’t intend to go multi-node.</li>
</ul>

<p>These decisions converged in the following architecture.</p>

<center><img src="/_data/MLOpsBelgium/DeploymentDiag.drawio.webp" width="850" heigth="480" /></center>
<center><i>I'd get spanked by any half-decent security consultant for this architecture | Image based on the slides by the author</i></center>

<p>So, a lot of stuff going on here. Let me explain. On the left side, you’ll see the configuration files on the local machine, which are used to instantiate the infrastructure on the right side. Basically, it starts with <code class="language-plaintext highlighter-rouge">terraform apply</code>, which reads and executes all terraform files in the project, like this snippet below.</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">terraform</span> <span class="p">{</span>
  <span class="nx">required_providers</span> <span class="p">{</span>
    <span class="nx">google</span> <span class="p">=</span> <span class="p">{</span>
      <span class="nx">source</span>  <span class="p">=</span> <span class="s2">"hashicorp/google"</span>
      <span class="nx">version</span> <span class="p">=</span> <span class="s2">"3.5.0"</span>
    <span class="p">}</span>
  <span class="p">}</span>
<span class="p">}</span>

<span class="nx">provider</span> <span class="s2">"google"</span> <span class="p">{</span>
  <span class="nx">credentials</span> <span class="p">=</span> <span class="nx">file</span><span class="err">(</span><span class="s2">"project-name-some-id.json"</span><span class="err">)</span>

  <span class="nx">project</span> <span class="p">=</span> <span class="s2">"project-name"</span>
  <span class="nx">region</span>  <span class="p">=</span> <span class="s2">"${var.region}"</span>
  <span class="nx">zone</span>    <span class="p">=</span> <span class="s2">"${var.region}-a"</span>
<span class="p">}</span>


<span class="nx">resource</span> <span class="s2">"google_compute_instance"</span> <span class="s2">"vm_instance_worker"</span> <span class="p">{</span>
  <span class="nx">name</span>         <span class="p">=</span> <span class="s2">"gcp-vm-instance-worker"</span>
  <span class="nx">machine_type</span> <span class="p">=</span> <span class="s2">"a2-highgpu-1g"</span>

  <span class="nx">boot_disk</span> <span class="p">{</span>
    <span class="nx">initialize_params</span> <span class="p">{</span>
      <span class="nx">image</span> <span class="p">=</span> <span class="s2">"deeplearning-platform-release/pytorch-latest-cu110"</span>
      <span class="nx">type</span>  <span class="p">=</span> <span class="s2">"pd-ssd"</span>
      <span class="nx">size</span>  <span class="p">=</span> <span class="mi">150</span>
    <span class="p">}</span>
  <span class="p">}</span>

  <span class="nx">metadata</span> <span class="p">=</span> <span class="p">{</span>
    <span class="nx">ssh</span><span class="err">-</span><span class="nx">keys</span>              <span class="p">=</span> <span class="s2">"username:${file("</span><span class="err">~/.</span><span class="nx">ssh</span><span class="err">/</span><span class="nx">sshkey</span><span class="err">.</span><span class="nx">pub</span><span class="s2">")}"</span>
    <span class="nx">install</span><span class="err">-</span><span class="nx">nvidia</span><span class="err">-</span><span class="nx">driver</span> <span class="p">=</span> <span class="kc">true</span>
    <span class="nx">proxy</span><span class="err">-</span><span class="nx">mode</span>            <span class="p">=</span> <span class="s2">"project_editors"</span>
  <span class="p">}</span>

  <span class="nx">scheduling</span> <span class="p">{</span>
    <span class="nx">automatic_restart</span>   <span class="p">=</span> <span class="kc">false</span>
    <span class="nx">on_host_maintenance</span> <span class="p">=</span> <span class="s2">"TERMINATE"</span>
    <span class="nx">preemptible</span>         <span class="p">=</span> <span class="kc">true</span>
  <span class="p">}</span>
<span class="p">}</span>

<span class="nx">resource</span> <span class="s2">"null_resource"</span> <span class="s2">"provision_worker"</span> <span class="p">{</span>
  <span class="nx">provisioner</span> <span class="s2">"local-exec"</span> <span class="p">{</span>
    <span class="nx">command</span> <span class="p">=</span> <span class="o">&lt;&lt;</span><span class="no">EOF</span><span class="sh">
                ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook \
                -u username \
                -i "${google_compute_instance.vm_instance_worker.network_interface.0.access_config.0.nat_ip}," \
                --extra-vars "tracker_uri=${google_compute_instance.vm_instance_tracker.network_interface.0.access_config.0.nat_ip}" \
                ./config-compute.yml
</span><span class="no">            EOF
</span>  <span class="p">}</span>
<span class="p">}</span>

</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">.tf</code> files use the GCP provisioner, and as such, they need a service account key (<code class="language-plaintext highlighter-rouge">credentials</code> in <code class="language-plaintext highlighter-rouge">provider "google"</code>) to be able to provision resources like VMs, buckets, and networks.</p>

<!-- I don't know about you, but to me HCL (Hashicorp Configuration Language) looks a bit like JSON and Protobuf had a baby. -->

<p>Once the infrastructure provisioning part is done, the <code class="language-plaintext highlighter-rouge">local-exec</code> provisioner is triggered, which is responsible for running the Ansible playbook and configuring each provisioned VM. It installs drivers, sets env vars, and launches MLFlow or Jupyterlab as background processes. See an example Ansible playbook below.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="pi">-</span> <span class="na">hosts</span><span class="pi">:</span> <span class="s">all</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">jupyter-install</span>
  <span class="na">become</span><span class="pi">:</span> <span class="s">username</span>

  <span class="na">tasks</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install nvidia drivers</span>
      <span class="na">shell</span><span class="pi">:</span> <span class="s">sudo /opt/deeplearning/install-driver.sh</span>

    <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">test nvidia drivers</span>
      <span class="na">shell</span><span class="pi">:</span> <span class="s">/opt/conda/bin/python -c 'import torch; print(torch.cuda.is_available())'</span>
      <span class="na">register</span><span class="pi">:</span> <span class="s">nvidia_test</span>

    <span class="pi">-</span> <span class="na">debug</span><span class="pi">:</span> <span class="s">msg="{{ nvidia_test.stdout }}"</span>

    <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install mlflow</span>
      <span class="na">shell</span><span class="pi">:</span> <span class="s">/opt/conda/bin/pip install mlflow==1.20.2 google-cloud-storage==1.42.3 optuna==2.10.0 papermill==2.3.3</span>

    <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">launch jupyterlab</span>
      <span class="na">environment</span><span class="pi">:</span>
        <span class="na">MLFLOW_TRACKING_URI</span><span class="pi">:</span> <span class="s1">'</span><span class="s">http://{{</span><span class="nv"> </span><span class="s">tracker_uri</span><span class="nv"> </span><span class="s">}}:5000'</span>
        <span class="na">MLFLOW_S3_ENDPOINT_URL</span><span class="pi">:</span> <span class="s">gs://some_bucket_address</span>
        <span class="na">PATH</span><span class="pi">:</span> <span class="s">/opt/conda/bin:{{ ansible_env.PATH}}</span>
      <span class="na">shell</span><span class="pi">:</span> <span class="s2">"</span><span class="s">nohup</span><span class="nv"> </span><span class="s">/opt/conda/bin/jupyter</span><span class="nv"> </span><span class="s">lab</span><span class="nv"> </span><span class="s">--NotebookApp.token=some_token</span><span class="nv"> </span><span class="s">--ip</span><span class="nv"> </span><span class="s">0.0.0.0</span><span class="nv"> </span><span class="s">--no-browser</span><span class="nv"> </span><span class="s">&amp;"</span>
</code></pre></div></div>

<p>I am provisioning two VMs, one for the experiment tracker and one for running experiments. I also need a firewall to allow TCP traffic on select ports, specifically 5000 (MLFlow), 8888 (JupyterLab), and 22 (SSH). Finally, I have a GCS bucket as the artifact repository for MLFlow.</p>

<p>Notice that my VMs receive a copy of my SSH public key. It’s necessary to allow SSH connections from my local machine because Ansible uses SSH to connect to its targets.</p>

<h3 id="core-requirements-flexibility-and-parallelism">Core requirements: Flexibility and Parallelism</h3>

<p>Research is quite messy. I try to fix the mess by extracting common code, maybe writing some utils, but sometimes I prioritize running experiments.
As mentioned, I was using Jupyter and Optuna. To make them work nicely together, I used Papermill.</p>

<p>Papermill allows for parametrized, programmatic execution of Jupyter notebooks. Let me explain with a table:</p>

<table>
  <thead>
    <tr>
      <th>Capability</th>
      <th>Example Usage</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Parametrizes notebooks</td>
      <td>Propose hyperparameters</td>
    </tr>
    <tr>
      <td>Can inspect them</td>
      <td>Extract final scores</td>
    </tr>
    <tr>
      <td>Executes them</td>
      <td>Run notebooks from the command line</td>
    </tr>
    <tr>
      <td>Stores them</td>
      <td>Save specific notebook variants</td>
    </tr>
  </tbody>
</table>

<p>So, in my setup a Python CLI program with Optuna and Papermill is used to launch multiple parallel experiments, something like this:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python notebook_hpo.py <span class="se">\</span>
  <span class="nt">-i</span> Test.ipynb <span class="se">\</span>
  <span class="nt">-o</span> <span class="s1">'./out/Test.{run_id}.ipynb'</span> <span class="se">\</span>
  <span class="nt">-p</span> ./parameters.yml <span class="se">\</span>
  <span class="nt">-j</span> 8
</code></pre></div></div>

<p>Or, if you prefer a diagram to a code snippet, here’s one:</p>

<center><img src="/_data/MLOpsBelgium/HPODiagram.drawio.webp" width="550" heigth="480" /></center>
<center><i>I'd get spanked by any half-decent UML afficionado for this diagram | Image based on the slides by the author</i></center>

<h3 id="core-requirements-reproducibility">Core requirements: Reproducibility</h3>

<p>I have suffered enough in the industry from unreplicable training runs, so I needed to eliminate this issue in my research.</p>

<p>I needed <strong>tracking</strong> and <strong>determinism</strong>.</p>

<p>I won’t dive deep into the matter of running reproducible experiments. But I’ll allow myself to repeat some stuff. You can find a more detailed overview <a href="https://alexandruburlacu.github.io/posts/2022-11-22-mlops-fable">here</a>, in the <code class="language-plaintext highlighter-rouge">The takeaways &gt; Replicable experiments</code> part.</p>

<p>The deterministic experiments checklist (for PyTorch):</p>
<ul>
  <li>The most important thing you can do is to seed your pseudo-random number generators (Python, Numpy, PyTorch, CUDA), aka PRNGs.</li>
  <li>Be reasonable about (non-)determinism: Calling <code class="language-plaintext highlighter-rouge">torch.use_deterministic_algorithms()</code> is a Nope for me because <a href="https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms">it will throw erros</a> when calling <code class="language-plaintext highlighter-rouge">.backward()</code> for some layers. On the other hand, setting <code class="language-plaintext highlighter-rouge">torch.backends.cudnn.{benchmark,deterministic}</code> properties is fine; they won’t throw errors.</li>
  <li>Special considerations about parallel data loaders, specifically for PyTorch users, don’t forget to also seed them in each of your <code class="language-plaintext highlighter-rouge">DataLoader</code> workers, like this:</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">seed_worker</span><span class="p">(</span><span class="n">worker_id</span><span class="p">):</span>
    <span class="n">worker_seed</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">initial_seed</span><span class="p">()</span> <span class="o">%</span> <span class="mi">2</span><span class="o">**</span><span class="mi">32</span>
    <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="n">worker_seed</span><span class="p">)</span>
    <span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="n">worker_seed</span><span class="p">)</span>

<span class="n">g</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">Generator</span><span class="p">()</span>
<span class="n">g</span><span class="p">.</span><span class="n">manual_seed</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> 
<span class="n">dl</span> <span class="o">=</span> <span class="n">DataLoader</span><span class="p">(</span><span class="n">train_dataset</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">num_workers</span><span class="o">=</span><span class="n">num_workers</span><span class="p">,</span> 
                <span class="n">worker_init_fn</span><span class="o">=</span><span class="n">seed_worker</span><span class="p">,</span> <span class="n">generator</span><span class="o">=</span><span class="n">g</span><span class="p">)</span>
</code></pre></div></div>

<p>That’s kind of it with the determinism part. How should I handle my experiment tracking infra?</p>
<ul>
  <li>I need a minimal, dedicated, non-preemptible VM (<code class="language-plaintext highlighter-rouge">n1-standard-2</code> works fine) because I don’t want my tracking server preempted without first having a DB backup on my laptop, and implementing a half-decent backup script wasn’t something I wanted to do</li>
  <li>The experiment tracking server is a self-hosted MLFlow; I am quite familiar with it</li>
  <li>The tracking database is SQLite. SQLite, being basically a single file, allows me to <code class="language-plaintext highlighter-rouge">scp</code> it to my local machine when done working and load it with Terraform <code class="language-plaintext highlighter-rouge">file-provisioner</code> on startup</li>
  <li>All my artifacts are checkpointed to GCS, or rather, I’m using GCS as an artifact repository for MLFlow</li>
</ul>

<p>My tracking strategy:</p>
<ul>
  <li>Track all modifiable hyper-parameters</li>
  <li>During fine-tuning, track loss, top-1 and top-5 accuracy on both training and validation splits</li>
  <li>During pre-training, only track loss</li>
  <li>No need to track data because I use standard datasets like CIFAR100 or STL10</li>
  <li>Based on my previous experience, I find it quite annoying working with nested runs, so I don’t use those</li>
  <li>I created a new experiment on qualitative/untracked change (a different dataset, changed pre-processing code, a different SSL pre-training method)</li>
</ul>

<p>Some of it is also explained in detail in that same article referenced above (<a href="https://alexandruburlacu.github.io/posts/2022-11-22-mlops-fable">here it is</a>, for your convenience), in the <code class="language-plaintext highlighter-rouge">The takeaways &gt; Experiment tracking</code> part.</p>

<p>Tracking all this stuff with MLFlow, also allows me to compare runs with parallel coordinate plots, which is the best way to look at your hyperparameter optimization runs, IMO!</p>

<p>By the way, if you’re not familiar with MLFlow, <a href="https://mlflow.org/docs/latest/quickstart.html">here’s a link</a>.</p>

<h2 id="was-it-all-worth-it">Was it all worth it?</h2>

<p><strong>TL;DR:</strong> Yes, let me show you why.</p>

<p>First, let’s assume the following setup: ResNet50, pre-training (PT) + fine-tuning (FT), for 10 epochs, with batch sizes 512 (PT) and 4096 (FT).</p>

<p>Let’s first do some benchmarks.</p>

<table>
  <thead>
    <tr>
      <th>GPU type</th>
      <th>pre-training time</th>
      <th>fine-tuning time</th>
      <th>compared to A100</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Colab K80 12GB</td>
      <td>965s</td>
      <td>310s</td>
      <td>5.1x slower</td>
    </tr>
    <tr>
      <td>T4 16GB</td>
      <td>420s</td>
      <td>122s</td>
      <td>2.2x slower</td>
    </tr>
    <tr>
      <td>A100 40GB</td>
      <td>166s</td>
      <td>80s</td>
      <td>1</td>
    </tr>
  </tbody>
</table>

<!-- A100 w\ FP32 - 190 + 95
A100 w\ 448 batch size - 169s -->

<p>Let’s do some simple math with the same setup.</p>

<p>A model takes 7.2GB of VRAM. Except for A100, <strong>it uses 8.4GB</strong> for the same setup. No idea why.</p>

<table>
  <thead>
    <tr>
      <th>GPU Type</th>
      <th>Nr. of parallel runs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Colab K80 12GB</td>
      <td>1</td>
    </tr>
    <tr>
      <td>T4 16GB</td>
      <td>2</td>
    </tr>
    <tr>
      <td>V100 16GB</td>
      <td>2</td>
    </tr>
    <tr>
      <td>A100 40GB</td>
      <td>4 (<strong>5 runs with <code class="language-plaintext highlighter-rouge">batch_size</code> 448</strong>)</td>
    </tr>
  </tbody>
</table>

<p>Let’s do some more math.</p>

<p>GCP billed my A2 instance for 44h. Meaning I was running experiments for almost 44h. Of course, I was launching those experiments manually with my script, and there was some idle time, but it was minimal. Anyway, 44 billed hours on A2. For the same volume of work with a T4 GPU, I’d get billed for…</p>

<p><code class="language-plaintext highlighter-rouge">44h x (5 runs / 2 runs) x 2.2 speedup == 240h w/ T4</code></p>

<p>… for 240 hours. That is a lot more, even if T4 GPUs are considerably cheaper!</p>

<p>Hold on, 5 parallel runs on A100 are possible when using 448 batch size, not 512. That’s almost a 10% smaller batch size, so the training should take roughly 10% more time in this setting. Well, based on a few experiments, changing the batch size from 512 to 448 results in just 3-5% pre-training slowdown, plus there’s the fine-tuning part, which we don’t alter, so all in all, it’s still going to be roughly 2.2x faster than T4.</p>

<p>Anyway, <strong>for that 44h I paid 48 USD</strong>.</p>

<p>Before we move forward, let’s make one thing clear: based on the information we have so far, <strong>Colab Pro/Pro+ is not worth it</strong>, compared with my setup, at least.</p>

<p>Colab Pro+ is 43 EUR/month. It does not guarantee the accelerator type, uses opaque “compute units” payment, and 200+h on T4 will consume those units in no time.</p>

<p>Let’s do some more math. How much would I have to pay for 240h of using a T4 GPU, with a decent VM instance, like an <code class="language-plaintext highlighter-rouge">n1-standard-8</code>?</p>

<p><code class="language-plaintext highlighter-rouge">240h x 3.15 USD/h / 17.381h = 43.5 USD</code></p>

<p>Based on these calculations, I paid a <strong>~5 USD premium for a ~6x speedup</strong>. Totally worth it.</p>

<p>In fact, I would have paid more than 43 USD for 240h on T4. Because it seems the <strong>network is 1.8-2x slower</strong> on N1 instances, resulting in a long time to download the necessary dataset after each provisioning. A few test runs of A2 and N1-standard-8 instances averaged 9m 30s and 19m, respectively, to download the CIFAR100. On a side note, I could have kept copies of the datasets in a GCS bucket, but I didn’t. Maybe I thought it would cost a little too much for its worth, and I’d be annoyed by it. But what’s done is done. Given that I would need to run a T4 instance for considerably longer to do the same amount of work, I’d also have to provision my infrastructure more often, leading to more times I have to wait until my CIFAR100 or STL10 datasets are downloaded. That would definitely result in more than 43 USD.</p>

<p>So A2 is both faster <strong>and</strong> cheaper in my setup. I wish my gut feeling would always work this well.</p>

<center><img src="/_data/MLOpsBelgium/gpu_spot_comparision.webp" width="350" heigth="280" /></center>
<center><i>It might not seem like it, but A100 is the better deal | Image based on the slides by the author</i></center>

<p>So, I hope you can see that using the most expensive single GPU setup on GCP turned out to be the best decision. It costs roughly the same or even less than using the seemingly most cost-efficient one while being soooooo much faster. Even if running an A2 instance was 2x more expensive than N1 with T4 GPU, I’d still take that expense to be able to do 240+h of work in 44h.</p>

<h2 id="future-directions">Future directions</h2>

<p>It may seem like I have my setup optimized to the limit. But it has room for improvement. I’d say the room is the size of a nice large kitchen with an isle in the middle and a terrace for summer dining.</p>

<p>The most impactful missed opportunity is using Mixed Precision. Surprisingly, I wasn’t using it. Maybe because of my old trauma installing APEX from scratch. But now <a href="https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/">it’s pretty easy</a>, or so <a href="https://discuss.pytorch.org/t/torch-cuda-amp-vs-nvidia-apex/74994/9">they say</a>. Thankfully A100 GPUs have a magic trick, which seems to be enabled by default on PyTorch. This trick is called the TF32 float number representation. It’s a reduced-precision floating point number representation, which can be run on Nvidia’s Tensor Cores and allow for a transparent and easy switch to FP32 when necessary.</p>

<p>A trickier thing I’d like to do is to optimize the data loading. CPUs are underutilized in my setup. Given that my datasets are all standard, I’m considering using <a href="https://ffcv.io/">FFCV</a>.</p>

<p>A few more niche things, with lower priority than the stuff described above:</p>
<ul>
  <li>Threaded checkpoint saving because it’s in the same thread as training and takes a few seconds at the end of each epoch.</li>
  <li>Try MosaicML for additional gains. I’m thinking to specifically the <a href="https://docs.mosaicml.com/en/latest/method_cards/channels_last.html">ChannelsLast</a> and <a href="https://docs.mosaicml.com/en/latest/method_cards/progressive_resizing.html">ProgressiveResizing</a>, but also PyTorch’s <a href="https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.OneCycleLR.html">OneCycleLR</a>.</li>
  <li>Automatic restart from checkpoints (GCP MIGs + startup scripts) for longer training runs.</li>
</ul>

<h3 id="not-my-case-for-now">Not my case, for now:</h3>
<ul>
  <li>Model/Tensor/Pipeline parallelism - largest model is ResNet101</li>
  <li>Huge datasets - I’m not even planning to use ImageNet</li>
  <li>Collaboration - I was the only one working on it and only discussed the results with my supervisor</li>
</ul>

<h2 id="a-few-takeaways">A few takeaways</h2>

<ol>
  <li><strong><em>Automate stuff</em></strong> - I’m sure you’ll be glad you did when you can spin up a complete work setup in minutes with a single click. And shut it down with the same ease. Not to mention leaving an instance running will be a thing of the past.</li>
  <li><strong><em>Track your experiments</em></strong> - If you want to reproduce your excellent results or figure out what other tricks to try, keeping a log of what you did and how it went is essential.</li>
  <li><strong><em>Invest in maximizing resource utilization</em></strong> - Having powerful hardware means nothing if it stays idle or is underutilized. Make sure you feed it enough work, so your investment breaks even faster.</li>
  <li><strong><em>Most powerful hardware can be the most cost-effective</em></strong> - That said, using the newest, most advanced, and most powerful hardware can be not only fun but also cost-effective. And finally,</li>
  <li><strong><em>Moving faster costs money</em></strong> - but it’s worth it.</li>
</ol>

<h2 id="ps">P.S.</h2>

<p>“Eventually I will buy a GPU”, from the Director of “I will stop binge-playing PS5” and “I promise I’ll go to the gym consistently”.</p>

<!-- https://www.canva.com/design/DAFRd5NNBRc/TT475viVVE0ZtjVzEkgxDg/edit -->]]></content><author><name></name></author><category term="posts" /><category term="mlops," /><category term="devops," /><category term="ml," /><category term="research," /><category term="infrastructure," /><category term="machine" /><category term="learning" /><summary type="html"><![CDATA[Find out how working on an independent research project led me to apply my MLOps skills to create a performant and cost-effective experiment infrastructure]]></summary></entry><entry><title type="html">A fable about MLOps… and broken dreams</title><link href="https://alexandruburlacu.github.io/posts/2022-11-22-mlops-fable" rel="alternate" type="text/html" title="A fable about MLOps… and broken dreams" /><published>2022-11-21T22:12:00+00:00</published><updated>2022-11-21T22:12:00+00:00</updated><id>https://alexandruburlacu.github.io/posts/mlops-fable</id><content type="html" xml:base="https://alexandruburlacu.github.io/posts/2022-11-22-mlops-fable"><![CDATA[<p>For a while, I was considering presenting more often at conferences and meetups. I was postponing it for quite some time, but this summer, I thought, “No more!” and applied to be a speaker at the <a href="https://mdc.md/">Moldova Developer Conference</a>. And I was accepted with a talk about MLOps! I thought I’d make the talk a kind of fairytale/fable story with blackjack and easter eggs. Fast forward a few weeks ago, in the first half of November, I was presenting at the conference, and because not everyone could attend it, I also decided to make a blog post on the topic I was presenting.</p>

<!-- **UPDATE**: Here's the recording from that presentation -->

<h2 id="intro">Intro</h2>

<p>This article is divided into two parts, <em>The Story</em> and <em>The Takeaways</em>. Let’s start with the story.</p>

<h2 id="the-fable-about-mlops">The fable about MLOps…</h2>

<p>Note that all the characters in the story are fictional. So is the setting in which the story happens. They are not inspired by concrete people or organizations but rather distilled from my many experiences and a few industry stories. Alright, story time.</p>

<h3 id="act-1-we-need-a-poc-to-prove-ml-is-a-good-investment">Act 1: We need a PoC to prove ML is a good investment</h3>

<center><img src="/_data/webp/a_long_time_ago.webp" width="850" heigth="480" /></center>
<center><i>I'm sure you can figure out this reference | Image based on the slides by the author</i></center>

<p>In an alternate reality, or maybe just another time and place, there was a company - <strong>Lupine Corp.</strong>  Lupine Corp. is a logistics company with a very long history,
dating back since the revolution. However, no one remembers which one, could be the French, or the Bolshevik. Like any respectable company, they have a set of values and principles they abide by. One of their core tenets is to be <em>cost-efficient</em>. The other one is - <em>no unnecessary risks</em>.</p>

<center><img src="/_data/webp/the_adoption_cycle.webp" width="850" heigth="480" /></center>
<center><i>They were hyped by Hadoop, in 2020. I mean... | Image based on the slides by the author</i></center>

<p>Lupine Corp. are also reputable for doing their due diligence. So they knew that before launching their ML initiative, they needed to have their prerequisites in place.</p>

<ol>
  <li>
    <p>They made sure to know their success metrics, meaning they established some KPIs and a way to report and track those.</p>
  </li>
  <li>
    <p>They also had their data easily accessible and discoverable, not just existing somewhere in their databases. They knew this would be very important for the data scientists they will hire.</p>
  </li>
  <li>
    <p>Finally, the leadership knew that Data Science and ML are much more unpredictable than traditional software engineering, and they adjusted their expectations accordingly.</p>
  </li>
</ol>

<blockquote>
  <p>Side note: With only these 3 points, Lupine Corp. were so much better prepared for ML than the majority of the companies out there.</p>
</blockquote>

<p>Lupine Corp. imposed some budget limitations because of the unpredictable nature of ML projects, so they only hired two people:</p>

<ul>
  <li><strong>Nifel Nifenson</strong> (image below, left), who previously worked for two years as a lone Data Scientist in a small company</li>
  <li><strong>Nafaela Nafarri, PhD</strong> (image below, right), a Senior Data Scientist with six years of experience</li>
</ul>

<p>Nifel Nifenson is a very results-oriented guy. One could say he’s the (rough) embodiment of the Lean Startup philosophy. Nafaela Nafari has a strong analytical mind. When Lupine Corp. asked them to deliver some results ASAP, they did just that and then some more. The results were very promising and done in record time.
Senior management was ecstatic, and more use cases were in discussion.</p>

<center><img src="/_data/webp/great_success.webp" width="850" heigth="480" /></center>
<center><i>Dream team. Left - Alexander the Great in the Battle of Issus Mosaic. Right - Pallas Athena by Rembrandt | Image based on the slides by the author</i></center>

<h3 id="act-2-expanding-the-team-signs-of-trouble">Act 2: Expanding the team. Signs of trouble.</h3>

<p>As all things in business and life, with larger scale, cracks became more apparent.</p>

<p>Nifel, Nafaela, and the new team members got along very well. It was a very nice team to work with. Everyone was professional and friendly. Yet somehow, the team’s velocity (as per Scrum, or “throughput” as per Kanban) wasn’t scaling as expected. It even started to go down after a few months. More people and more time were required to complete the same work Nifel and Nafaela had done a few months before. But why was this happening?</p>

<p>There are many reasons why.
For example, many promising experiments couldn’t be replicated, even with all the notes the team took.
Also, they observed increased complaints from some of the users of their deployed models. The first few weeks after the models were put in production, everyone was happy, but in time more and more bad feedback was received.</p>

<p>And if all that wasn’t enough, some of those productionized use cases started to receive a lot of traffic, sometimes up to two thousand concurrent users. They decided to horizontally scale their existing docker containers to serve them all. It wasn’t resource-efficient. It was hard to manage. And the latency SLAs were thrown out of the window with worrying regularity…</p>

<h3 id="act-3-bringing-the-big-guns">Act 3: Bringing the big guns</h3>

<p>Lupine Corp. was upset with the prospect of their ML initiative imploding, so they hired <strong>Nuf Nufelman</strong> as the new Head of Data Science.</p>

<p>Previously he worked as a lead data scientist at a big non-FAANG company, similar in structure to Lupine Corp. but quite different culturally. His previous employer was basically a “throw money at the problem” type of company, and Nuf was shaped by this mentality too. Nuf was also a great DevOps believer.</p>

<center><img src="/_data/webp/nuf_intro.webp" width="850" heigth="480" /></center>
<center><i> Nuf was born and raised in Odessa, but lost his way, a bit | Zeus' statue at Versaille | Image based on the slides by the author</i></center>

<p>He understood that the problem Nifel’s and Nafaela’s team faced was a replicability problem.</p>

<p>… and a retraining problem.</p>

<p>….. and a scalability problem.</p>

<p>They needed a well-structured process to research, develop, evaluate and productionize their work consistently.</p>

<p>In a meeting with the higher-ups, Nuf told them that if Lupine Corp. was serious about their ML intentions, they had to adopt MLOps, <em>wholly and without question</em>. They accepted.</p>

<p>To streamline adoption, Nuf suggested they don’t develop all the tools in-house but instead pay for an ML-platform-as-a-service (MLPaaS) by All-You-Need-And-A-Kitchen-Sink ($AYN). All-You-Need-And-A-Kitchen-Sink is a recently IPO-ed startup that <em>“solves all the MLOps pains”</em>.</p>

<p>Surprisingly, it worked.</p>

<p>Most of the past problems went away.</p>

<p>But a lot of the internal processes still needed adjustments. Because it was quite a generic tool, a lot of glue code had to be written. Also, people didn’t like using it. The learning curve was steep. And some of the API design choices and documentation could have been more pleasant to work with.</p>

<p>And did I mention the Enterprise tier was a-seed-investment-grant-per-month expensive? If you ever complain about AWS bills, this one was probably even worse, but I digress.</p>

<h3 id="act-4-burning-cash-and-its-consequences">Act 4: Burning cash and its consequences</h3>

<p>The ML and Data Science initiative continued to grow at Lupine Corp. They hired more people and sometimes heard more complaints about their ML platform. It was slightly annoying but not that important for the upper management. They had different pains.</p>

<p>How could they ever be content when this new MLPaaS gizmo was burning cash like crazy? And recall their main tenets. Increasing their operational efficiency was a recurring topic during their meetings.</p>

<p>But as anything in old, large corporations, it was a lot of talking and not so much doing.</p>

<p>And then, the earnings call day came…</p>

<center><img src="/_data/webp/earnings_call.webp" width="850" heigth="480" /></center>
<center><i>That day rang both the telephones and hell's bells | Christ in Limbo by a Follower of Jheronimus Bosch</i></center>

<p>Financials showed Lupine was burning a lot more cash than its competitors. They were no startup or scaleup. This was showing financial recklessness. Shareholders didn’t like it. Neither did the stock market. Their stock plummeted 20% in a week. Something between Meta and Netflix.</p>

<p>To alleviate the issue, Lupine Corp. decided to optimize its operations. Now for real.</p>

<p>They laid off many employees working on non-critical aspects of the business. Whether possible, they terminated said initiatives too.</p>

<p>It was clear one of the main reasons they were burning money was their ML platform. Obviously, the ML initiative was impacted. Nafaela and Nuf stayed, but Nifel was laid off. Layoff decisions were based on tenure and seniority.</p>

<center><img src="/_data/webp/goodbye_nifel.webp" width="850" heigth="480" /></center>
<center><i>Poor Nifel | Image based on the slides by the author</i></center>

<p>Cutting costs worked. But it wasn’t a good long-term strategy, and Lupine Corp. knew this all too well. They needed to optimize their OpEx. So now, Lupine Corp. was looking for someone who could help. And they found someone. Someone,</p>

<center><strong>Legen-      </strong></center>
<center><strong>waaaait for it</strong></center>
<!-- <center><strong></strong></center> -->
<center><strong>-dary</strong></center>

<p>Meet <strong>Nahum Nahreba</strong>.</p>

<p>He’s a platform engineer. He is known for thinking from first principles and building nimble, scalable solutions. He’s something of a Jeff Dean, although he might not be able <a href="http://www.neohope.com/2014/04/24/jeff-dean-facts/">to shift bits from one computer to the other</a>. He helped scale a few startups. It wasn’t the first time he had to work on ML platforms.</p>

<center><img src="/_data/webp/nahum_intro.webp" width="850" heigth="480" /></center>
<center><i>Trully a legend | Image based on the slides by the author</i></center>

<p><strong>TL;DR:</strong> He came. He saw. He solved the mess.</p>

<p>He persuaded Lupine Corp. to greenlight a major refactoring of the ML platform, pruning it of many unnecessary features, reducing the bill, and implementing a few features and tools internally, with a specific focus on developer experience and integration with the rest of the company’s infrastructure. It’s a fable, not a technical report, so I won’t dive deep into how he did it.</p>

<p>And so they lived relatively happily until Lupine Corp. management discovered IoT…</p>

<p>The end.</p>

<h2 id="the-takeaways">The takeaways</h2>

<p>So, how could Lupine Corp. avoid this mess? And how can other companies like them avoid it too?</p>

<p>First things first, we need to give credit where credit’s due. This fictional company did a lot of stuff others don’t, so their success chances were already pretty high. They knew what success looked like for them, they had their data available and discoverable, and they had a correct mindset about this initiative. In my practice, most companies don’t have that.</p>

<p>I would argue one of the reasons Lupine Corp. had such <del>fun</del> hard times was a well-known quote:</p>
<blockquote>
  <p><em>“Premature optimization is the root of all evil”</em> - Donald Knuth</p>
</blockquote>

<p>… as cited by Nifel Nifelson, and most SWEs. Nifel, in this story, had somewhat more software engineering experience, and it was his responsibility to use an SWE mindset when starting their ML journey. He knew by heart the quote above, the KISS principle, and many others. But he also, like most of us, didn’t quite understand the nuances behind said quotes. Nifel treated MLOps as overengineering. Under management’s tight deadline and pressure to show good results and prove himself a specialist, he created good ML models but not-so-good ML systems.</p>

<p>By the way, the “fuller” quote sounds like this:</p>

<blockquote>
  <p><em>“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”</em></p>
</blockquote>

<p>If only Nifel knew it like this… <strong>So the takeaway #1: Start early with MLOps</strong>.</p>

<p>Nifel’s (counter-)example shows we must consider adopting MLOps practices early on. But it’s not so simple either.</p>

<p>Software and data people are an enthusiastic bunch. We want to use many tools to solve many problems. We’re very prone to over-engineering. If we were rockstars, I think this tendency towards abuse would have manifested a bit differently. Thankfully we aren’t rockstars.</p>

<blockquote>
  <p>By the way, it’s not my first piece on picking tools, so you’d like to check out the <a href="https://alexandruburlacu.github.io/posts/2022-06-18-choosing-a-tool">other article about it</a>.</p>
</blockquote>

<p>When starting with MLOps, we can be overwhelmed by multiple tools, terms, concepts, and practices. We’ll hear from every corner how crucial it is to have pipeline orchestration, 17 types of ML and data tests, three types of observability, feature stores, model stores, metadata stores, stores to store stores… alright, I’m exaggerating now, but you got the idea.</p>

<p>You don’t need all this tooling, not from the start, even if it comes all bundled together, like AWS, GCP, or Azure offerings.</p>

<p>Using a fully-featured MLOps solution from the beginning usually doesn’t work.</p>

<p>Either because it’s too generic. And/or there are too many upfront costs. Also, it takes a lot of work to onboard your users.</p>

<p>Going head-first into MLOps is a bad idea for most of the same reasons.</p>

<p>What you do need in the beginning is to…</p>
<ul>
  <li>quickly find and access your data</li>
  <li>seed that model training code</li>
  <li>record your experiment configuration</li>
</ul>

<p>Then make sure to</p>
<ul>
  <li>easily deploy your models</li>
  <li>have some tests</li>
</ul>

<p>The rest will come after. <strong>All that said, the takeaway #2: Start small with MLOps</strong>.</p>

<p>Now onto more technical advice.</p>

<h3 id="simple-data-collection-and-discovery">Simple data collection and discovery</h3>

<p>Lupine Corp had this, but I’m sure you don’t. So, what should you do? First, you need to understand <em>The Why?</em> We’re past the Big Data hype by almost ten years. Organizations now have lots of data… but it takes a lot of work to use it properly. It wouldn’t be an exaggeration to say that for the absolute majority of the projects I worked on, accessing datasets was my second most annoying problem. The first one was the lack of a baseline and success metrics. As I said, Lupine Corp. was in fact really good. Your company probably isn’t.</p>

<p>Alright, we know what “data collection” is. ETL pipelines and all that. Or a few scripts running as CRON jobs, dumping files into an S3 bucket. But what about data discovery?</p>

<p>A short googling session will reveal terms and technologies like data governance, data lineage, Amundsen from Lyft, Apache Atlas, Google Data Catalog… yeah, no. Not yet.</p>

<p>Have a shared spreadsheet. In it, each row is about a dataset. Name, short description, update frequency, contact person, and location in the object store. That’s it, at least in the beginning.</p>

<p>Do this, and your data scientists and ML engineers will be happy as hell. You’ll get recruits just by word of mouth.</p>

<p>Here’s a wacky architectural diagram for what you need for <strong>simple</strong> data collection and discovery.</p>

<center><img src="/_data/webp/simple_data_col_and_disco.webp" width="850" heigth="480" /></center>
<center><i>A few backup and automation scripts running on a schedule, S3 or something similar, a spreadsheet. If you can't do this, please don't hire ML engineers, you'll just waste money. | Image based on the slides by the author</i></center>

<p><strong>Pro tip:</strong> when you dump your raw data into those buckets, don’t override your old data. You’ll see why later.</p>

<h3 id="replicable-experiments">Replicable experiments</h3>

<p>This one requires a few steps, but they’re relatively straightforward. First, you need to seed your pseudo-random number generators, aka PRNGs.</p>

<p>Not everyone knows this, or maybe not explicitly, but ML code is full of randomness. We need to initialize the parameters of our ML models - we use some random distributions. We also need to shuffle our data - also randomness. This is trivial for a machine learning practitioner. What is less trivial is how this randomness is “created”. You see, randomness in computers is not entirely random.</p>

<p><strong>(Optional Paragraph)</strong> We use <a href="https://www.cryptosys.net/rng_algorithms.html">special algorithms</a>, based on stuff like chaos theory, which given an initial state, or a seed, and a set of usually recurrent rules, will generate a sequence of values. The rules are fixed, so the algorithm is deterministic, but the values are chaotic, meaning there’s no discernable pattern. Now, the seed value, the initial state used in these PRNGs is usually a genuinely random number, it can be the exact current temperature of the CPU, the clock drift between multiple CPU cores, or some other value that is naturally random. But you can manually provide the initial state, and thus when running the same sequence of operations multiple times, get the same sequence of values.</p>

<p>Back to our business. We can seed, or manually provide the initial states for our PRNGs so that running the same code will give us the same results - same models, same performance.</p>

<p>This is super important because if we can get the same results, we can properly validate and compare ML models and pick the best ones.</p>

<p>Python ML code has multiple sources of randomness, which can, and should, be seeded. This is because most numerical libraries in Python are written in C/C++/Fortran, and Python is a convenient wrapper to access these routines.</p>

<p>But there are a few more things between you and numerically replicable experiments besides PRNGs.</p>

<p>cuDNN is also standing in the way. cuDNN is NVidia’s low-level set of primitives for deep learning. It has multiple GPU-optimized implementations for convolutions, pooling, linear layers, various activation functions, and so on. Now, cuDNN has a clever way of achieving maximum performance on different hardware for various scenarios. It tests multiple implementations of the same algorithm <em>at the start of the program</em> and picks the fastest one. <a href="https://discuss.pytorch.org/t/what-is-the-differenc-between-cudnn-deterministic-and-cudnn-benchmark/38054/2">This selection <em>can</em> be non-deterministic (read random)</a>. Why? I am not sure, but as far as I understood, its heuristics might behave differently if there’s anything else running on the GPU. To disable this behavior, one has to set the <code class="language-plaintext highlighter-rouge">torch.backends.cudnn.benchmark = False</code>. To my knowledge, there are also a <a href="https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#reproducibility">few other sources of randomness in cuDNN</a>, and you can disable (some) these by setting <code class="language-plaintext highlighter-rouge">torch.backends.cudnn.deterministic = True</code>. And if you’re interested in finding out more on how to run replicable PyTorch experiments, <a href="https://pytorch.org/docs/stable/notes/randomness.html">check out this page from the docs</a>. And if you’re not, search if there are similar behaviors in your favorite framework.</p>

<!-- [eta-greedy/random search policy](https://rl-book.com/learn/bandits/e_greedy/) at its base.  -->

<p>Finally, most of the time, ML algorithms will try to take advantage of modern multi-core CPUs, and when designing replicable experiments, one has to think about it too.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">random</span><span class="p">,</span> <span class="n">os</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">torch</span>

<span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span>
<span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span>
<span class="n">torch</span><span class="p">.</span><span class="n">manual_seed</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span>
<span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">manual_seed</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span>
<span class="n">torch</span><span class="p">.</span><span class="n">backends</span><span class="p">.</span><span class="n">cudnn</span><span class="p">.</span><span class="n">deterministic</span> <span class="o">=</span> <span class="bp">True</span>
<span class="n">torch</span><span class="p">.</span><span class="n">backends</span><span class="p">.</span><span class="n">cudnn</span><span class="p">.</span><span class="n">benchmark</span> <span class="o">=</span> <span class="bp">False</span>

<span class="k">def</span> <span class="nf">seed_worker</span><span class="p">(</span><span class="n">worker_id</span><span class="p">):</span>
    <span class="n">worker_seed</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">initial_seed</span><span class="p">()</span> <span class="o">%</span> <span class="mi">2</span><span class="o">**</span><span class="mi">32</span>
    <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="n">worker_seed</span><span class="p">)</span>
    <span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="n">worker_seed</span><span class="p">)</span>

<span class="n">g</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">Generator</span><span class="p">()</span>
<span class="n">g</span><span class="p">.</span><span class="n">manual_seed</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> 
<span class="n">dl</span> <span class="o">=</span> <span class="n">DataLoader</span><span class="p">(</span><span class="n">train_dataset</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">num_workers</span><span class="o">=</span><span class="n">num_workers</span><span class="p">,</span> 
                <span class="n">worker_init_fn</span><span class="o">=</span><span class="n">seed_worker</span><span class="p">,</span> <span class="n">generator</span><span class="o">=</span><span class="n">g</span><span class="p">)</span>
</code></pre></div></div>

<p><strong>Pro tip:</strong> when testing a machine learning model configuration, run it multiple times using different seed values. It will reduce the chance that you’re just lucky.</p>

<p>But to replicate experiments, one needs to know all their parameters, which brings us to the next part…</p>

<h3 id="experiment-tracking">Experiment tracking</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">mlflow</span>
<span class="kn">from</span> <span class="nn">mlflow.models.signature</span> <span class="kn">import</span> <span class="n">infer_signature</span>

<span class="k">with</span> <span class="n">mlflow</span><span class="p">.</span><span class="n">start_run</span><span class="p">():</span>
    <span class="n">mlflow</span><span class="p">.</span><span class="n">log_param</span><span class="p">(</span><span class="s">"batch_size"</span><span class="p">,</span> <span class="mi">32</span><span class="p">)</span>
    <span class="c1"># Metrics can be updated throughout the run
</span>    <span class="n">mlflow</span><span class="p">.</span><span class="n">log_metric</span><span class="p">(</span><span class="s">"accuracy"</span><span class="p">,</span> <span class="mf">0.973</span><span class="p">)</span>
    <span class="n">mlflow</span><span class="p">.</span><span class="n">log_metric</span><span class="p">(</span><span class="s">"accuracy"</span><span class="p">,</span> <span class="mf">0.981</span><span class="p">)</span>

    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">"outputs/test.txt"</span><span class="p">,</span> <span class="s">"w"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
        <span class="n">f</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">"hello world!"</span><span class="p">)</span>

    <span class="n">mlflow</span><span class="p">.</span><span class="n">log_artifacts</span><span class="p">(</span><span class="s">"outputs"</span><span class="p">)</span>

    <span class="n">model_signature</span> <span class="o">=</span> <span class="n">infer_signature</span><span class="p">(</span><span class="n">example_inputs</span><span class="p">,</span> <span class="n">model</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">example_inputs</span><span class="p">))</span>
    <span class="n">mlflow</span><span class="p">.</span><span class="n">sklearn</span><span class="p">.</span><span class="n">log_model</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">artifact_path</span><span class="o">=</span><span class="s">"./sklearn-model"</span><span class="p">,</span> 
                             <span class="n">registered_model_name</span><span class="o">=</span><span class="s">"sklearn-rf-reg-model"</span><span class="p">,</span>
                             <span class="n">signature</span><span class="o">=</span><span class="n">model_signature</span><span class="p">)</span>

</code></pre></div></div>

<p>Just try to track as much as possible. I do. And it helped me a great deal. If you are ok with managing your infra, use <a href="mlflow.org">MLFlow</a>. If you would rather pay for a good managed solution, <a href="neptune.ai">Neptune.ai</a> and <a href="wandb.ai">Weights and Biases</a> are very nice.</p>

<p><strong>Pro tip 1:</strong> For maximum benefit, group similar algorithms together. It will make it easier to compare those with stuff like <a href="https://ai.facebook.com/blog/hiplot-high-dimensional-interactive-plots-made-easy/">parallel coordinate plots</a>.</p>

<p><strong>Pro tip 2:</strong> Also, try to track and version all your data. Either with DVC or something else. That’s why you shouldn’t override the raw data in the buckets. Because if you do override it, you won’t be able to replicate the results of your experiments.</p>

<p>So you have a trained ML model. You can also fully replicate it. Now what?</p>

<h3 id="ml-serving">ML Serving</h3>

<p>You need to deploy and serve it somehow! How? Use docker and an app server! Consider Ray Serve, BentoML, or Seldon if you care about SLAs. These are specialized solutions that provide impactful features like adaptive batching, model pooling, and so on. If you care much about SLAs, try Triton Inference Server from NVidia. If you want to dive deeper into details, <a href="https://alexandruburlacu.github.io/posts/2022-09-25-neptuneai-ml-serving">read my blog post on the topic</a>.</p>

<h3 id="ml-tests">ML Tests</h3>

<p>What about tests? ML code is still code. So it needs tests. ML testing is a big and hairy problem. I promise I will eventually write some article about it, but for now, think about this problem like this:</p>

<p>You need to have two types of tests,</p>
<ul>
  <li>Behavioral tests, which will measure predictions. These can become your regression suite, where you add various edge cases on which you don’t want to fail ever again</li>
  <li>Unit/Integration tests, which will measure training, serving, and preprocessing code correctness. Stuff like “The model should reduce its loss after one iteration” or “The shape of the output should be [x,y,z] given that the input shape was [x,m,n]” and so on. These will spot bugs in your implementation.</li>
</ul>

<p>Depending on your application domain, here are a few links to help you with ML testing.</p>
<ul>
  <li><a href="https://docs.deepchecks.com/stable/getting-started/welcome.html">Deepchecks library</a></li>
  <li><a href="https://neptune.ai/blog/ml-model-testing-teams-share-how-they-test-models">ML Model Testing: 4 Teams Share How They Test Their Models | Neptune.ai Blog</a></li>
  <li><a href="https://applyingml.com/resources/testing-ml/">Machine Learning in Production - Testing | ApplyingML</a></li>
  <li><a href="https://madewithml.com/courses/mlops/testing/#models">Made With ML Testing Machine Learning Systems: Code, Data and Models</a></li>
  <li><a href="https://www.jeremyjordan.me/testing-ml/">Effective testing for machine learning systems | By Jeremy Jordan</a></li>
</ul>

<h3 id="cicd">CI/CD</h3>

<p>If you have done everything until this point, having CI/CD should be easy. Kudos for triggering conditional steps for retraining if the training/model code changes. The conditional build behavior can be implemented with either something like <code class="language-plaintext highlighter-rouge">dvc repro</code> + some caching between runs or clever <code class="language-plaintext highlighter-rouge">git diff</code> manipulations.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># not the most production ready hack, but maybe it will help you</span>
<span class="nn">...</span>
<span class="na">jobs</span><span class="pi">:</span>
  <span class="na">check</span><span class="pi">:</span>
    <span class="na">runs-on</span><span class="pi">:</span> <span class="s">ubuntu-20.04</span>
    <span class="na">outputs</span><span class="pi">:</span>
      <span class="na">DIFFS</span><span class="pi">:</span> <span class="s">${{ steps.diffs.outputs.DIFFS }}</span>
    <span class="na">steps</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/checkout@v3</span>
        <span class="na">with</span><span class="pi">:</span>
          <span class="na">fetch-depth</span><span class="pi">:</span> <span class="m">0</span> <span class="c1"># actually will need some adjustments</span>
          <span class="c1"># fetch only as many as necessary: https://github.com/actions/checkout/issues/438</span>
      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Last good run commit</span>
        <span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
          <span class="s">curl -s \</span>
          <span class="s">-H "Accept: application/vnd.github+json" \</span>
          <span class="s">-H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \</span>
          <span class="s">https://api.github.com/repos/{{ USER }}/{{ REPO_NAME }}/actions/workflows/training-trigger.yml/runs?status=success | jq \</span>
          <span class="s">-r ".workflow_runs[0].head_commit.id" &gt; last_good_commit.txt</span>
      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Show and set DIFFS</span>
        <span class="na">id</span><span class="pi">:</span> <span class="s">diffs</span>
        <span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
          <span class="s">DIFFS=$(git diff HEAD $(cat last_good_commit.txt) --name-only | tr '\n' ' ')</span>
          <span class="s">echo "::set-output name=DIFFS::$DIFFS"</span>
          <span class="s">echo $DIFFS</span>

  <span class="na">train</span><span class="pi">:</span>
    <span class="na">needs</span><span class="pi">:</span> <span class="s">check</span>
    <span class="na">if</span><span class="pi">:</span> <span class="s">contains(needs.check.outputs.DIFFS, 'train.py')</span>
    <span class="na">uses</span><span class="pi">:</span> <span class="pi">{{</span> <span class="nv">USER</span> <span class="pi">}}</span><span class="s">/{{ REPO_NAME }}/.github/workflows/training.yml@master</span>
</code></pre></div></div>

<p>One important thing to note is somewhat related to CI. ML projects tend to have a naturally tight relation between EDA, data processing, training, and serving code. As a result, I highly recommend designing ML projects as monorepos and adopting monorepo-related practices and patterns for building, versioning, and code compatibility.</p>

<h3 id="epilogue">Epilogue</h3>

<p>All the advice above is focused on simplicity. You must understand that the solutions I suggest have a very clear scope. These are solutions you should only consider at <strong>the beginning</strong> of your MLOps journey.</p>

<p>Let me make it simpler with a table.</p>

<table>
  <thead>
    <tr>
      <th>Q</th>
      <th>A</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Is it going to scale?</td>
      <td>Nope</td>
    </tr>
    <tr>
      <td>Is it production-ready?</td>
      <td>It’s PoC-ready</td>
    </tr>
    <tr>
      <td>How quickly can I set it up?</td>
      <td>A few days at most</td>
    </tr>
    <tr>
      <td>Is it better than doing nothing?</td>
      <td>Yes!!!</td>
    </tr>
    <tr>
      <td>Is it cost-effective?</td>
      <td>Hell yes</td>
    </tr>
    <tr>
      <td>Is it more cost-effective than using a paid or even an existing OSS solution?</td>
      <td>IMO much more so</td>
    </tr>
  </tbody>
</table>

<p>These recipes are <strong>Maximum ROI - Minimum Effort</strong> solutions to get you started. Eventually, you will discover that they don’t quite suit you. Only then switch to something else. You’ll make a better-informed decision then.</p>

<h2 id="ps">P.S.</h2>

<p>I was serious about presenting more often at conferences and meetups. And that’s why I will also be presenting at the Belgium MLOps meetup on 5th December 2022. So if you’d like to learn about my MLOps adventures in setting up my research environment, please join us via <a href="https://www.meetup.com/mlops-belgium/events/289639571/">this link</a>.</p>

<h2 id="pps">P.P.S.</h2>

<p>The story is based on the “Three Little Pigs” one in its Romanian/Russian variant, where the piglets are named Nif-Nif, Naf-Naf, Nuf-Nuf. Now, the local, russian-speaking population has a joke about the 4th piglet, which I’ll let you guess his name. Special kudos to those who also get the meaning/connotation of the fourth piglet.</p>

<!-- https://www.unusual.vc/post/how-to-build-ml-products
 -->]]></content><author><name></name></author><category term="posts" /><category term="mlops," /><category term="devops," /><category term="ml" /><category term="deployment," /><category term="machine" /><category term="learning," /><category term="ml" /><category term="serving" /><summary type="html"><![CDATA[A fable about a company's journey through scaling their ML function, and some practical advice on how you should do it]]></summary></entry><entry><title type="html">How to Solve the Model Serving Component of the MLOps Stack</title><link href="https://alexandruburlacu.github.io/posts/2022-09-25-neptuneai-ml-serving" rel="alternate" type="text/html" title="How to Solve the Model Serving Component of the MLOps Stack" /><published>2022-09-24T22:00:00+00:00</published><updated>2022-09-24T22:00:00+00:00</updated><id>https://alexandruburlacu.github.io/posts/neptuneai-ml-serving</id><content type="html" xml:base="https://alexandruburlacu.github.io/posts/2022-09-25-neptuneai-ml-serving"><![CDATA[<blockquote>
  <p>This blog post was written by me and orginally posted on <a href="https://neptune.ai/blog/model-serving-component-mlops-stack">Neptune.ai Blog</a>. Be sure to check them out. I like their blog posts about MLOps a lot.</p>
</blockquote>

<p>Model serving and deployment is one of the pillars of the MLOps stack. In this article, I’ll dive into it and talk about what a basic, intermediate, and advanced setup for model serving look like.</p>

<p>Let’s start by covering some basics.</p>

<h2 id="what-is-model-serving">What is Model Serving?</h2>
<p>Training a machine learning model may seem like a great accomplishment, but in practice, it’s not even halfway from delivering business value. For a machine learning initiative to succeed, we need to deploy that model and ensure it meets our performance and reliability requirements. You may say, “But I can just pack it into a Docker image and be done with it”. In some scenarios, that could indeed be enough. But most of the time, it won’t. When people talk about productionizing ML models, they use the term <strong>serving</strong> rather than simply deployment. So what does this mean?</p>

<p>To serve a model is to expose it to the real world and ensure it meets all your production requirements, aka your latency, accuracy, fault-tolerance, and throughput are all at the “business is happy” level. Just packaging a model into a Docker image is not “the solution” because you’re still left with how to run the model, scale the model, deploy new model updates, and so on. Don’t get me wrong, there’s a time and place for Flask-server-in-Docker-image style of serving; it’s just a limited tool for a limited number of use-cases, which I’ll outline later.</p>

<p>Now that we know what serving implies, let’s dive in.</p>

<h2 id="model-deployment-scenarios">Model Deployment scenarios</h2>

<p>When deciding how to serve our ML models, we must ask ourselves a few questions. Answering these should help us shape our model serving architecture.</p>

<h3 id="is-our-model-user-facing">Is our model user-facing?</h3>

<p>In other words, does the user trigger it through some action and need to see an effect dependent on our model outputs in real-time? If this sounds too abstract, how about an example? Are we creating an email autocomplete solution like the one in Gmail? Our user writes some text and expects a relevant completion. This kind of scenario needs an “interactive” deployment. This is probably the most common way to serve ML models. But it’s not the only way.</p>

<p>Suppose we don’t need the model’s predictions right away. We’re fine waiting even an hour or more to get what we need. How frequently do we need to get these predictions? Do we need something like a weekly excel report or tagging some inventory item descriptions once per day? If this sounds about right, we can run a “batch” process as a way to serve our model. This setup would probably be the easiest to maintain and scale. But there’s another, 3rd way.</p>

<h3 id="does-the-latency-matter">Does the latency matter?</h3>

<p>You don’t need to “respond” to the user but still must act based on the user’s action. Something like a fraud detection model that gets triggered on a user’s transaction. This scenario asks for a “streaming” setup. A scenario like this is usually deemed the most complex to handle. Although it would sound like the interactive setup would be harder to build, streaming is generally harder to reason about and thus harder to implement properly.</p>

<p>Let’s dive into the details of each of these setups, the best time to use them, and the trade-offs.</p>

<h2 id="model-deployment-setups">Model Deployment setups</h2>

<p>We should consider a few general “setups” based on our business needs when it comes to exposing ML models to the outside world for consumption.</p>

<h3 id="batch-model-serving">Batch model serving</h3>

<p>This one is the easiest to implement and operate of all possible setups. Batch processes are not interactive, i.e., they do not wait for some interaction with another user or process. They just run, start to finish. Because of this, there are mostly no latency requirements; all it needs is to be able to scale to large dataset sizes.</p>

<p>Because of this latency insensitiveness, you can use complex models – Kaggle-like ensembles, huge gradient boosted trees or neural networks, anything goes, because it is expected that these operations won’t be done in milliseconds anyway. To handle even multi-hundred GB datasets, all you need is something like CRON, a workstation/a relatively capable cloud VM, and to know how to develop out-of-core data processing scripts. Don’t believe me? Here’s <a href="https://towardsdatascience.com/how-to-analyse-100s-of-gbs-of-data-on-your-laptop-with-python-f83363dda94">an example</a> to prove my point.</p>

<p>It becomes a bit more challenging if you need to handle TBs of data. You will need to deal with multi-node Apache Spark, Apache Airflow, or something like it. You’ll have to think about potential node failure and how to maximize the resource utilization of said nodes.</p>

<p>Finally, if you’re operating at Google-size datasets, <a href="https://sre.google/sre-book/data-processing-pipelines/">check this link</a>. Operating at such a scale brings issues like “chatty neighbors”, straggling tasks/jobs, “thundering herds”, and timezones. Yeah, and congratulations on your gargantuan scale.</p>

<h3 id="streaming-model-serving">Streaming model serving</h3>

<p>As we already mentioned, batch processes are not the only ones that don’t need to wait on user interaction, i.e., they are not interactive. We can also have our models act on streams of data. These scenarios are much more latency-sensitive than batch processes.</p>

<p>Standard tools for streaming model serving are Apache Kafka, Apache Flink, and Akka. But if you need to operate your model as a streaming/event-driven infrastructure component, these are not your only options. You can create a component that will be a consumer of events on one side and a producer on the other. Whatever you do, be mindful of back pressure. Streaming setups care a lot about being able to process large volumes of continuously flowing data, so be sure to not make your deployed ML models the bottleneck of this setup.</p>

<p>Another thing to consider when developing streaming ML serving solutions is model serialization. Most streaming event processing systems are JVM-based, either Java or Scala native. As a result, you will likely discover that your model structure is limited by the capabilities of your serializer. For a story about how model serialization can become an issue, <a href="https://alexandruburlacu.github.io/posts/2022-07-05-neptuneai-automl">check out this article’s sub-section</a> – the resulting models can be tedious to deploy.</p>

<p>Here are some useful links regarding the same –</p>
<ul>
  <li><a href="https://towardsdatascience.com/deploying-ml-models-in-distributed-real-time-data-streaming-applications-217954a0b423">Deploying ML Models in Distributed Real-time Data Streaming Applications | TDS</a></li>
  <li><a href="https://www.lightbend.com/blog/akka-speculative-model-serving">Using Akka for leveraging speculative execution in model serving</a></li>
  <li><a href="https://aws.amazon.com/blogs/machine-learning/automated-model-refresh-with-streaming-data/">Automated model refresh with streaming data</a></li>
</ul>

<h3 id="interactive-model-serving-via-restgrpc">Interactive model serving (via REST/gRPC)</h3>

<p>The most popular way to serve ML models – using a server! In fact, a lot of people, when discussing ML serving, refer to this specific setup rather than any of the three. An interactive setup means the user somehow triggers a model and is waiting for the output or something caused by the output. Basically, it’s a request-response interaction pattern.</p>

<p>There are many ways to serve ML models in this setup. From a Flask or FastAPI server with an in-memory loaded ML model to specialized solutions like TF Serving or NVIDIA Triton, and anything in between. In this article, we will mainly focus on this setup.</p>

<p>I’ve seen people developing batch solutions where the ML component is actually a server being called by said batch program. Or components in a streaming event processing system calling HTTP servers that serve ML models. Being a flexible, reasonably simple to reason about, and well-documented approach, many are “abusing” the interactive pattern.</p>

<h3 id="note-on-cloud-edge-and-client-side-serving">Note on Cloud, Edge and Client-side serving</h3>

<p>What if we are developing a mobile app and want our ML-enabled features to work without the internet? What if we want to provide our users with magical responsiveness? To make waiting for a response on a web page a thing of the past. Enter client-side serving and serving ML on edge.</p>

<h4 id="things-to-consider">Things to consider</h4>

<p>When designing ML systems, we need to be aware of this possibility and the challenges of such a deployment scenario.</p>
<ul>
  <li>Deployment on browser clients is straightforward using <a href="https://github.com/tensorflow/tfjs">TF.js</a>. <a href="https://github.com/microsoft/onnxruntime/tree/master/js/web">ONNX</a> can also be an option, albeit a bit more complicated.</li>
  <li>As for mobile, we have multiple variants, including CoreML from Apple, TFLite from Google, and ONNX.</li>
  <li>For edge devices, depending on their compute performance, we can either run ML models just like we’d do in the cloud or create custom TinyML solutions.</li>
</ul>

<p>Notice that, in theory, browsers and smartphones are edge devices. In practice, they are treated differently because of the wildly different programming models. More often than not, edge servers are classic computers, either running on ARM or x86 hardware, with traditional OSs, just much closer to the user, network-wise. Mobile devices need to be programmed differently because of the big difference between mobile and more common OSs. More recently, mobile devices have specialized DSPs or co-processors optimized for AI inference.</p>

<p>Browsers are even more different because browser code is usually architected around the idea of a sandboxed environment and the event loop. More recently, we have web workers, which make the creation of multi-process applications easier. Also, when serving an ML model in a browser, we can’t make any assumptions about the hardware on which the model will run, resulting in a potentially horrible user experience. It can very much be that a user opened our web app with the ML model on a low-end mobile device. Only imagine the lags that site will have.</p>

<h3 id="trade-offs">Trade-offs</h3>

<p>There could be multiple reasons to move ML serving closer to the edge. Usual motives are latency sensitiveness, bandwidth control, privacy concerns, and the capability to work offline. Keep in mind that we can have various hierarchical deployment targets, spanning between the user’s client device to an IoT hub or router closest to the user, to a city or region-wide data center.</p>

<p>Deploying on edge devices or client devices usually trades off model size and performance for reduced network latency or the possibility of dramatically reducing the bandwidth. For example, deploying a model for automatic face recognition and classification on a mobile phone maybe isn’t such a good idea, but a tiny and simple one that can detect whether there’s a face in the scene or not is. The same goes for an automatic email response generator vs. an autocomplete keyboard model. The former usually isn’t needed on-device, while the latter must be deployed on-device.</p>

<p>In practice, it is possible to mix edge/on-device models with a cloud-deployed model for maximum predictive performance when online, but with the possibility to retain some AI-enabled features offline. This can mostly be done by writing custom code, but it is also possible to use something like <a href="https://github.com/kubeedge/sedna">Sedna</a> for <a href="https://kubeedge.io/en/">KubeEdge</a> if your edge devices are capable of running KubeEdge.</p>

<h3 id="a-real-world-use-case">A real-world use-case</h3>

<p>A common but less discussed scenario for deploying on edge – A retailer wants to use video analytics in their grocery stores. They developed a suite of powerful computer vision models to analyze the video feed from their in-store cameras and were met with a hard constraint. The internet provider couldn’t ensure the upload latency, and bandwidth from their locations couldn’t support multiple streaming video feeds. The solution? They bought a gaming PC per store, put it in the staff room, and did their video analysis locally without needing to stream videos from the stores. Yes, this is an edge ML scenario. Edge computing is not only about IoT.</p>

<h2 id="serving-ml-models-the-right-way">Serving ML models the right way</h2>

<p>Model serving has a tight relationship with metadata stores, ML model registries, monitoring components, and feature stores. That is quite a lot. Plus, depending on concrete organizational requirements, model serving might have to be integrated with CI/CD tooling. It might be necessary to either ensure a staging environment to test newly trained models or even continuously deploy to production environments, most likely as a shadow or canary deployment.</p>

<center><img src="/_data/webp/MLOps_process.webp" alt="End-to-end MLOps architecture and workflow with functional components and roles" /></center>
<center><i>End-to-end MLOps architecture and workflow with functional components and roles | Source: <a href="https://arxiv.org/abs/2205.02302">https://arxiv.org/abs/2205.02302</a></i></center>

<h3 id="what-makes-a-deployment-good">What makes a deployment good?</h3>
<p>Keep in mind that a good model serving solution isn’t only about cost-efficiency and latencies but also about how well it is integrated with the rest of the stack. If we have a high-performance server that is a nightmare to integrate with our observability, feature stores, and model registries, we have a terrible model serving component.</p>

<p>A common way to implement the whole model deployment/serving workflow is to have the model serving component fetch concrete models based on the information from the ML model registry and/or metadata store.</p>

<p>For example, using a tool like <a href="https://neptune.ai/">Neptune.ai</a>, we can track multiple experiments. At some point, if we decide we have a good candidate model, we tag it as a model ready for staging/canary. Remember, we’re still interacting with Neptune.ai, no need to use any other tool. Our ML serving component periodically checks in with the ML model registry, and if there’s a new model with the compatible tag, it will update the deployment like <a href="https://docs.neptune.ai/how-to-guides/model-registry/querying-and-downloading-models-and-metadata/accessing-production-ready-models">this</a>. This method allows for more accessible model updates without triggering image builds or other expensive and complex workflows. 
An alternative approach is to redeploy a pre-built serving component and only change its configuration to fetch a newer model, <a href="https://www.cloudskillsboost.google/focuses/17649?parent=catalog">something like this</a>. This approach is more common in cloud-native (Kubernetes) serving solutions.</p>

<p>Of course, as mentioned earlier, frequently, the model serving component has to interact with feature stores. To interact with feature stores, we need to be able to serve not just serialized ML models but also have support for custom IO-enabled components. In some cases, this can be a nightmare. A workaround is integrating the feature stores at the application-server level and not at the ML serving component level.</p>

<p>Finally, we also need to log and monitor our deployed ML models. Many custom solutions integrate with tools like the ELK stack for logs, OpenTelemetry for traces, and Prometheus for metrics. ML does bring some specific challenges, though.</p>

<blockquote>
  <p>For a dive into what a good observability setup consists of, be sure to check out <a href="https://alexandruburlacu.github.io/posts/2021-05-20-logs-traces-how-to">another blog post of mine</a>.</p>
</blockquote>

<p>First, we need to be able to collect new data for our datasets. This is mostly done either through custom infrastructure or ELK. 
Then, we need to be able to track ML-specific signals, like distribution shifts for input values and outputs. This is a highly un-optimized scenario for tools like Prometheus. To better understand these challenges, <a href="https://www.shreya-shankar.com/rethinking-ml-monitoring-3/">check out this blog post</a>. A few tools try to help with this, most prominently <a href="https://whylabs.ai/">WhyLabs</a> and <a href="https://arize.com/">Arize</a>.</p>

<h2 id="what-do-we-really-care-about">What do we really care about?</h2>

<p>Other than the usual suspects - tail latencies, number of requests per second, and application error rate, it is advisable to also track model performance. And here’s the tricky part. It’s rarely possible to obtain ground-truth labels in real-time or with a short delay. If the delay is significant, it will take longer to identify issues impacting our users’ experience.</p>

<p>Because of this, tracking the inputs and outputs distribution and triggering some action if these diverge significantly from what the model is expecting is pretty common. While this is useful, it doesn’t quite help track our predictive performance SLO (service-level objective).</p>

<h3 id="the-problem-of-tracking-performance">The problem of tracking performance</h3>

<p>Let me explain, on one hand, we can reasonably assume that divergences in our inputs and outputs distributions can result in degraded performance, but on the other hand, we don’t actually know the exact relation between the two.</p>

<p>We can have scenarios where a distribution for a feature drifts a lot from the expected distribution but has no significant impact on our ML model performance. We will have a false alarm in this case. But these relations change over time. So next time, when the same feature drifts again, it can result in a significant loss of predictive power of our ML models. As you can imagine, this is a nightmare to manage. So what can be done?</p>

<h3 id="the-solution--detection-and-mitigation">The solution – detection and mitigation</h3>

<p>We deploy and update ML models to better our business. Ideally, we must “link” our model SLOs with business metrics. For example, if we notice that the ratio of users clicking on our recommendation drops, we know we are not doing well. For a text auto-correction solution, a similar business-derived model SLO could be the ratio of accepted suggestions. If it falls below some threshold, maybe our model is no better than the previous one. Regretfully this isn’t always this easy to do.</p>

<p>Because this problem can be so hairy, we usually extract ML model performance monitoring into a separate component and only track the system-level metrics, traces, and logs at the ML serving component level. We hope that as the infrastructure for ML model monitoring becomes better, ML serving components will provide significantly better integrations with these tools to make the troubleshooting of deployed models significantly easier.</p>

<h2 id="evolving-model-serving">Evolving model serving</h2>
<p>Because the interactive serving setup is the most popular way to productionize ML models, we will discuss what a basic, intermediate and advanced setup looks like. What differentiates a good setup from a mediocre one is cost-effectiveness, scalability, and latency profile. Of course, the integration with the rest of the MLOps stack is also important. In general, deciding on what architecture and tools to use is always a tricky affair, with numerous trade-offs. If you’re interested in advancing your decision-making when it comes to making technical decisions, be sure to check out <a href="https://alexandruburlacu.github.io/posts/2022-06-18-choosing-a-tool">this article</a> on what questions should you ask and some of the trade-offs you should expect. Don’t mind that it’s about programming languages, most questions apply to tools and frameworks too.</p>

<h3 id="basic-setup">Basic setup</h3>

<p>Recall, at the beginning of the article, I mentioned that there’s a time and place for an ML-model-in-Flask-server-in-a-Docker-container style of serving. A lot was said about this kind of serving, so I won’t dive into much detail. Note that the ML model can be either backed in the container or attached as a volume. If you are only creating a demo API or know for a fact that you won’t have much traffic (maybe it’s an internal application, which only 3-5 people will use), this can be an acceptable solution.</p>

<p>Or, if you can provision multiple very capable cloud VMs with powerful GPUs and CPUs and don’t bother having poor resource utilization and sub-optimal tail latencies, then it can also work. I mean, <a href="https://www.zdnet.com/article/why-facebook-doesnt-have-or-need-testers/">Facebook is doing very few tests for their software</a> and still manages to be a huge tech corporation, so it may not always make sense to follow all software engineering best practices.
<strong>Pros</strong></p>
<ul>
  <li>This setup has the advantage of being very easy to implement and relatively scalable (need to handle more requests =&gt; run multiple replicas).
<strong>Cons</strong></li>
  <li>The biggest issue is poor resource utilization because models are triggered on each request for a single input entry, and the web servers don’t need the same hardware as ML models.</li>
  <li>Then, there’s a huge lack of control over tail latencies, meaning you can’t enforce almost any SLO with this setup. The only hope to somewhat control your tail latencies is a good load balancer and enough powerful machines to run multiple replicas of your ML serving component.</li>
</ul>

<center><img src="/_data/webp/MLServing.drawio.webp" alt="Simple ML serving with a replicated container. The ML model can be either backed in or attached as a volume" /></center>
<center><i>Simple ML serving with a replicated container. The ML model can be either backed in or attached as a volume. | Source: author.</i></center>

<p>To improve this setup, we must move onto a medium-level configuration.</p>

<h3 id="intermediate-setup">Intermediate setup</h3>

<p>As mentioned above, we need to split the ML inference from the application server component to optimize the resource utilization and have better control over our latencies. One way to do it is using a publisher-subscriber asynchronous communication pattern, implemented with <a href="https://hanxiao.io/2019/01/02/Serving-Google-BERT-in-Production-using-Tensorflow-and-ZeroMQ/">ZeroMQ</a> or even <a href="https://pyimagesearch.com/2018/01/29/scalable-keras-deep-learning-rest-api/">Redis</a>, for example.</p>

<p>So, after this “schism”, we can do a lot of cool tricks to perfect our serving component into an advanced one.</p>

<ul>
  <li>
    <p>First, we can enforce much more granular and fine-tuned timeouts and retries. With such a setup, it is possible to scale the ML servers independently from the application servers.</p>
  </li>
  <li>
    <p>Then, the most fantastic hack for this is to do <a href="https://www.usenix.org/system/files/conference/nsdi17/nsdi17-crankshaw.pdf">adaptive batching</a>. In fact, it’s such a great technique that it would make a solution almost advanced-level, performance-wise.</p>
  </li>
</ul>

<p>A good model serving solution isn’t just about how good is the server performance but also how easy it is to integrate the rest of the ML sub-systems. A machine learning serving component would need to provide at least some model management capabilities to easily update model versions without needing to rebuild the whole thing. For this kind of setup, the ML/MLOps team can design their ML workers to periodically check in with the model registry and, if there are any updates, fetch new models, something like <a href="https://docs.neptune.ai/how-to-guides/model-registry/querying-and-downloading-models-and-metadata/accessing-production-ready-models">this</a> or <a href="https://mlflow.org/docs/latest/model-registry.html#fetching-an-mlflow-model-from-the-model-registry">this</a>.</p>

<center><img src="/_data/webp/MLServingMedium.drawio.webp" alt="A medium ML serving blueprint, with both replicated application servers and ML servers. The solution also uses a feature store and a model registry" /></center>
<center><i>A medium ML serving blueprint, with both replicated application servers and ML servers. The solution also uses a feature store and a model registry. | Source: author.</i></center>

<p>I am sure you noticed that the moderate setup is considerably more complex than the basic one. This complexity brings major downsides to this approach. At this stage, one needs some form of container orchestration, usually K8s, and at least some system observability, for example, with Prometheus and ELK.</p>

<h3 id="advanced-setup">Advanced setup</h3>

<p>To be fair, a medium-level setup is enough for most ML serving scenarios. You shouldn’t consider the advanced ML serving setup as a necessary evolution of the last setup. The advanced setup is more like “heavy artillery”, which is required only in exceptional cases.</p>

<p>With all the bells and whistles proposed in the solution above, a question arises – “Why did we bother so much with all these tricks if there are pre-made solutions?”. And indeed, why? The answer would usually be – they needed something custom for their setup.</p>

<p>Specialized solutions like NVIDIA Triton, Tensorflow Serving, or TorchServe have solid selling points and pretty weak ones too.</p>

<p><strong>Pros</strong></p>

<ul>
  <li>First, these serving solutions are very well optimized and usually perform better than a “medium + bells and whistles” solution.</li>
  <li>Second, these solutions are straightforward to deploy; most provide a docker container or a Helm chart.</li>
  <li>Finally, these solutions usually contain relatively basic support for model management and A/B testing.</li>
</ul>

<p><strong>Cons</strong></p>

<ul>
  <li>Now the downsides. The biggest one is the awkward integration with the rest of the MLOps ecosystem.</li>
  <li>Second, related to the first, these solutions are hard to extend. The most convenient way to solve both these is to create custom application servers that act as proxies/decorators/adapters for the high-performing pre-built ML servers.</li>
  <li>Thirdly, and this is probably a thing that I personally don’t like, is that these solutions are very constraining in terms of what models can be deployed. I want to keep my options open, and having a serving solution that accepts only TF SavedModels, or ONNX-serialized isn’t aligned with my values. And yes, even ONNX can be limiting, for example, <a href="https://alexandruburlacu.github.io/posts/2022-07-05-neptuneai-automl">when you have a custom model</a> (see the subsection – the resulting models can be tedious to deploy) which uses operations yet unsupported by ONNX.</li>
</ul>

<p>As you might have already guessed, I don’t use these solutions for the most part. I prefer PyTorch, so TF Serving is a no-go for me. Note, it’s just my context. If you use TF, consider using TF Serving. I tried it a few years ago for a TF project. It’s pretty good for serving, but a bit cumbersome for model management, if you ask me.</p>

<p>I said I use PyTorch primarily, so maybe TorchServe? To be frank, I haven’t even tried it. Seems good, but I’m afraid it has the same model management issues as TF Serving. What about Triton? I can speak of its older incarnation, TensorRT Inference Server. It was a nightmare to configure and then discover that because of a custom model head, it couldn’t be served properly. Plus model quantization issues, plus the same woes of model version management as the previous two candidates… To be fair, I’ve heard it got better, but I still am very skeptical of it. So, unless I know my model architecture is unchanged and I need maximum possible performance, I will not use it.</p>

<center><img src="/_data/adaptive-batching.svg" alt="Adaptive batching as a way to more efficiently use ML models" /></center>
<center><i>Adaptive batching as a way to more efficiently use ML models. Source: <a href="https://mlserver.readthedocs.io/en/latest/user-guide/adaptive-batching.html">Seldon MLServer docs</a></i></center>

<p>To summarize, specialized solutions like NVIDIA Triton or Tensorflow Serving are powerful tools, but if you opt to use them, you better have serious performance needs. Otherwise, I would advise against it. But that’s not all –</p>

<ul>
  <li>
    <p>Even if these solutions are feature-rich and performant, they still need extensive supporting infrastructure. Such servers are best suited as ML workers, so you still need application servers. To have a truly advanced ML serving component, you need to consider tight integration with your other systems and ML and data observability, custom-built or using services like <a href="https://arize.com/">Arize</a> and <a href="https://www.montecarlodata.com/">Montecarlo</a>.</p>
  </li>
  <li>
    <p>Also, you need to be able to perform advanced traffic management. The systems mentioned above provide some limited support for A/B testing. Still, in practice, you would have to implement it differently, either at the application server level, for more fine-grained control, or at the infrastructure level, with tools like <a href="https://istio.io/">Istio</a>. You usually need to be able to support gradual rollouts of new models, canary deployments, and traffic shadowing. No existing pre-built serving system provides all these traffic patterns. If you want to support these, be ready to get your hands, and whiteboards, dirty.</p>
  </li>
</ul>

<h2 id="note-on-cloud-offerings">Note on cloud offerings</h2>

<p><strong>TL;DR:</strong> Cloud offerings give you “full-lifecycle” solutions, meaning that the model serving is integrated with solutions for dataset management, training, hyperparameter tuning, monitoring, and model registries.</p>

<p>Cloud offerings try to give you the simplicity of the basic setup, with the feature-richness of the advanced setup and the performance of the moderate one. For most of us, this is a fantastic deal.</p>

<p>Common differentiators for cloud offerings are serverless and autoscaled inference, with GPUs and/or special chips support.</p>

<ul>
  <li>
    <p>Take Vertex AI from Google, for example. They provide you with a full MLOps experience and relatively easy model deployment, which can be served either as a cloud function or an autoscaled container, or even as a batch job. And because it’s Google, they have TPUs, which come in handy for really large-scale deployments.</p>
  </li>
  <li>
    <p>Or, with an even more complete solution, take AWS. Their SageMaker, just like Vertex AI, helps you along the whole MLOps lifecycle. Still, it also adds a simple and cost-efficient way to run models for inference with Elastic Inference accelerators, which seem to be fractional GPUs, possibly via NVIDIA’s Ampere-generation MIGs, or using a custom chip called Inferentia. Even better, SageMaker allows for post-training model optimizations for target hardware.</p>
  </li>
</ul>

<p>Yet neither offers adaptive batching, some form of speculative execution/request hedging, or other advanced techniques. Depending on your SLOs, you might still need to use systems like NVIDIA Triton or develop in-house solutions.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Running ML in production can be a daunting task. To truly master this, one has to optimize for many objectives – cost-efficiency, latency, throughput, and maintainability, to name a few. If there’s something you should get from this article, then let it be these three ideas –</p>
<ol>
  <li>Have a clear objective and priorities when serving your ML model</li>
  <li>Let the business requirements and constraints drive your ML serving component architecture, not the other way around.</li>
  <li>Think of the model serving as a component in the broader MLOps stack.
Armed with these ideas, you should be able to filter subpar ML serving solutions from the good ones, thus maximizing the impact for your organization. But don’t make the mistake of trying to get everything right from the beginning. Start serving early, iterate on your solution, and let the knowledge from this article help you make your first few iterations somewhat better. Better to deploy something mediocre than not to deploy anything.</li>
</ol>

<h2 id="references">References</h2>
<ul>
  <li><a href="https://sre.google/sre-book/table-of-contents/">Google SRE Book</a></li>
  <li><a href="https://www.usenix.org/system/files/conference/nsdi17/nsdi17-crankshaw.pdf">Clipper paper</a></li>
  <li><a href="http://learningsys.org/nips17/assets/papers/paper_1.pdf">TF Serving paper</a></li>
  <li><a href="https://arxiv.org/pdf/2205.02302.pdf">Some info about Serving within MLOps</a></li>
  <li><a href="https://www.tekhnoal.com/10-ways-to-deploy-an-ml-model.html">10 Ways to deploy an ML model</a></li>
  <li><a href="https://neptune.ai/blog/mlops-at-reasonable-scale">MLOps at reasonable scale</a></li>
  <li><a href="https://towardsdatascience.com/ml-latency-no-more-9176c434067b">ML Latency No More</a></li>
  <li><a href="https://youtu.be/YMtLI1Ub85s">How Cookpad Leverages Triton Inference Server To Boost Their Model Serving</a></li>
</ul>]]></content><author><name></name></author><category term="posts" /><category term="ml," /><category term="machine-learning," /><category term="deep-learning," /><category term="serving," /><category term="deployment," /><category term="inference" /><summary type="html"><![CDATA[It's important to be able to deploy a machine learning model when trained. But how do we approach serving ML models correctly?]]></summary></entry><entry><title type="html">Interviewing for a Senior ML Engineer position</title><link href="https://alexandruburlacu.github.io/posts/2022-07-23-senior-ml-interview" rel="alternate" type="text/html" title="Interviewing for a Senior ML Engineer position" /><published>2022-07-23T01:00:00+00:00</published><updated>2022-07-23T01:00:00+00:00</updated><id>https://alexandruburlacu.github.io/posts/senior-ml-interview</id><content type="html" xml:base="https://alexandruburlacu.github.io/posts/2022-07-23-senior-ml-interview"><![CDATA[<p>Interviewing is always a tiring and sometimes awkward process. Thankfully there are lots of resources online to help you prepare. But what if you need more specific advice for a more niche position?</p>

<p>This post is based on my personal experience going through the interviewing process at 5 not-FAANG companies. I also had some experience interviewing for not-senior ML Engineering roles at another 3 companies last year. So, I will also do a comparative analysis.</p>

<h2 id="before-we-begin">Before we begin…</h2>

<p>Let me start with a short prologue to explain why I’m writing this piece. In January 2022, I decided, again, it was time to search for another job outside of my home country. But this time, I decided to be sneaky/smart about it, so I changed my LinkedIn address to show that I’m in London. I also groomed a bit more my LinkedIn page to show some highlights of my recent experience. And then magic happened. For weeks I had recruiters invite me to interviews. I didn’t even have to apply myself to anything, only to accept or reject opportunities arriving from recruiters. What surprised me was that the majority of options were senior or even lead roles. So, I felt like an imposter, but I still accepted a few of these and started the process. And then I searched for tips on how to nail senior ML engineering interviews… and found almost nothing. Sh*t. And that’s how I <del>met your mother</del> decided to write this blog post.</p>

<p>I brushed up my interviewing skills through mock interviews. I was also searching for technical questions for Senior ML roles. Surprisingly, I couldn’t find anything. All the info was only for MLE roles. It seemed a bit strange. In retrospect, it all makes sense now.</p>

<p>I know you are eager to find out why, so I’ll just give the TL;DR right away - <strong>ML and Senior ML have more or less the same complexity/hardness for technical questions</strong>. Surprise!</p>

<p>I bet you didn’t expect that. I know I didn’t. But then, what <strong>is different</strong>? And how does the interviewing process works for Senior ML Engineers?</p>

<h2 id="senior-vs-non-senior-ml-interviews">Senior vs non-senior ML interviews</h2>

<p>Based on my experience, I haven’t noticed much difference between senior ML and ML engineering interviews at the technical level.</p>

<p>What I did notice is the focus on soft skills for senior positions, and I don’t necessarily mean communication skills. Instead, how a candidate handled failures, team-level conflicts, cross-team communication, how they solved their most challenging problems, or how they handled a poor decision.</p>

<p>I recall the first technical interview for a Senior ML role I had. I was anxious about what kind of questions will I receive. It wasn’t so bad, I had tougher questions than that, but the focus was undoubtedly higher on how I handled some scenarios or how I would do it now.</p>

<table>
  <thead>
    <tr>
      <th>Aspect</th>
      <th>ML engineer interview</th>
      <th>Senior ML engineer interview</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Coding</td>
      <td>Your usual leetcode-medium questions</td>
      <td>Same, haven’t seen dynamic programming at this stage</td>
    </tr>
    <tr>
      <td>Take-home assignment</td>
      <td>Either do EDA or deploy an ML model, focus on code quality, ease of use and tests</td>
      <td>Same, take-home assignments are not harder for senior positions</td>
    </tr>
    <tr>
      <td>ML Trivia</td>
      <td>How algs. work? What would be the best solution for a type of problem</td>
      <td>On average, the same as for ML engineer</td>
    </tr>
    <tr>
      <td>System Design</td>
      <td>How to implement a system for a given scenario? Data collection issues?</td>
      <td>On average, same as for ML engineers, just be more conscious of budget constraints</td>
    </tr>
    <tr>
      <td><strong>Behavioral</strong></td>
      <td><strong>Focus on collaboration, individual growth, and adaptability</strong></td>
      <td><strong>Focus on failures, conflict management, and cross-team collaboration</strong></td>
    </tr>
  </tbody>
</table>

<p>One position for which I did notice some big differences when it comes to the technical questions is <strong>Research Engineer</strong>. I’m talking questions like <a href="https://www.image-engineering.de/library/technotes/745-how-does-the-jpeg-compression-work">how does JPEG compresses</a> images, how to compute <a href="https://baioc.github.io/blog/fibonacci/#fft-the-fast-fibonacci-transform">nth Fibonacci in O(log n) time</a>, or <a href="https://drscotthawley.github.io/blog/2019/12/21/PCA-From-Scratch.html">how to compute PCA from scratch</a>. Now, for a research engineering position, these kinds of questions do make sense because of the innovative and research-oriented nature of the projects they have to work on. These frequently can involve a lot of <em>convert-math-to-code</em> or <em>let’s-break-it-down-and-then-improve</em> type of tasks.</p>

<p>Anyway, to give you a more detailed view, let’s see what is the general interviewing process when it comes to these kinds of roles.</p>

<!-- 
Graphcore      - Interviewer -> Take home project    -> Technical discussion -> Behavioral
ASOS           - Interviewer -> Take home project    -> Technical discussion + Behavioral
Yelp           - Interviewer -> Coding challenge     -> System design + Coding interview + 2 x Behavioral
Toptal         - Interviewer -> Coding challenge     -> Coding interview + Technical discussion -> Take home project + Technical discussion
Sprout.ai      - Interviewer -> Take home project    -> Technical discussion -> Behavioral
THG            - Interviewer -> Behavioral           -> Technical discussion
Hyperscience   - Interviewer -> Technical discussion -> Behavioral
Rasa.ai        - Interviewer -> Coding challenge     -> TBA
Tessian        - Interviewer -> Coding interview     -> Technical discussion -> System design + Behavioral
Audio Analytic - Interviewer -> Technical discussion + (Behavioral + Technical) + Behavioral -> Behavioral?
Zensors        - Behavioral/Interviewer -> Technical discussion + Coding interview -> ML Coding interview -> Behavioral
 -->

<h2 id="the-general-interviewing-flow">The general interviewing flow</h2>

<p>First, let’s go over the main steps in the process. Generally, there are at least 4 steps:</p>
<ol>
  <li>You have the first call with a recruiter or hiring manager. You get to know each other, go over your CV in general, discuss what makes you search for jobs, or accept invitations to interview, what you know about the company, what you are searching for, and so on. A pretty simple step if you ask me. Then, suppose the hiring manager thinks your goals and interests align with what the company seeks. In that case, you will be invited to the second, <strong>technical</strong> step. The dreaded one.</li>
  <li>I call this step just technical for a reason. Some companies split it into 2, a take-home assignment and then a discussion based on it. Others have the typical coding interview. And others yet just have a technical discussion. The technical discussion usually covers ML theory and some specifics, like what is transfer learning, or what transformer architectures are. It might also be a pen-and-paper exercise where you can be asked to infer how PCA works. The latter is more common for more research-oriented roles.</li>
  <li>Most of the time, there are two technical interviews, the second being more focused on system design interview. Or maybe some more technical challenges and discussions, YMMV, because this is very company- and team- specific.</li>
  <li>Finally, the last round of interviews is usually reserved for everything else that wasn’t covered in the previous steps, usually the behavior interview. Some companies have three rounds, combining the 3rd step with the 4th.</li>
</ol>

<p>Now, let’s dive into details.</p>

<h2 id="1st-interview">1st interview</h2>

<p>Pretty simple. Make sure to learn about the company, even if you were invited to interview with them. At this point, the company searching for candidates has a few objectives:</p>
<ul>
  <li>to understand how interested you are in the company/position</li>
  <li>are there any legal constraints that need to be acknowledged, like visa status</li>
  <li>or personal constraints, like the necessity to work remotely
Also, at this stage, the recruiter is looking whether you’d be a good fit based on your career aspirations, personal opinions, and past experiences.</li>
</ul>

<p>But don’t be fooled, there’s a probability of failure even at this stage. For example, if the recruiter feels you’re not interested in the position or if your career plans don’t align with the responsibilities of this position.</p>

<h2 id="2nd3rd-interview">2nd/3rd interview</h2>

<p>As mentioned, different companies do this stage differently. I found three types. Given that we have two steps here, most companies do a mix of these three methods.</p>

<h3 id="the-take-home-assignment-tribe">The “take-home-assignment tribe”</h3>

<p>Take home - either an ML serving solution or EDA + modeling. No one will expect you to deliver a robust, production-ready solution for the ML serving project, nor will anyone complain that your Jupyter notebook doesn’t contain a SotA ML model for a given dataset. The focus is on code quality, the presence of tests and features, ease of running the code for the former, and reproducibility and soundness of the solution for the latter.</p>

<p>Focus on quality over quantity. A good way to show professionalism is to follow up with clarifying questions once you receive the task. And please, read it carefully. Too often have I seen people doing it all wrong and not even bothering to check the exact constraints for the homework.</p>

<h3 id="the-coding-challengers">The “coding challengers”</h3>

<p>Too much was said about it. One point I consider worth reiterating is how important it is to actually talk through your problem-solving process and ask clarifying questions. I would argue that this could be even more important than solving the problem. Also, don’t forget about:</p>
<ul>
  <li>Asking about possible edge cases and then covering them.</li>
  <li>Explaining the time and space complexity of your solution.</li>
  <li>If you have the time, extra points for going through your code “debugger-style”. That is, step-by-step while telling what the current values of all your variables are.</li>
</ul>

<h3 id="the-technical-discussionists">The “technical discussionists”</h3>

<p>Discussion with a team of engineers. It usually goes like this: <code class="language-plaintext highlighter-rouge">Technical/ML Trivia + NotSoOptional[ML System Design] + Optional[Behavioral]</code>. ML questions are mostly one of:</p>
<ul>
  <li>“How would you handle X scenario”</li>
  <li>“What is Y? How does this work?”</li>
  <li>Occasionally, for research-heavy roles - “Could you compute Z from scratch, here’s a Google Doc”, as a follow-up to the previous questions.</li>
</ul>

<p>Where \(Y \in \{BatchNorm, DropOut, SkipConnections, DataAugmentation, SGD, Transformers, Attention, et al.\}\)
\(Z \in \{PCA, Linear Regression, kNN, kMeans\}\)</p>

<p>Sometimes technical discussions take a more ML-System-Design flavor.</p>

<p>It’s (was) COVID, so system design is usually only verbal unless you can also text-draw a solution while sharing your screen. Pseudo-code also helps.
ML System Design seems not to be any different. It’s still one of “Design a Search Engine for X”, or “How are you going to design an X-which-is-actually-a-recommender-system”.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
---------   r/w  ----------    ----------   HTTP/2
|  DB   | &lt;------| API    |&lt;-- | NGINX  |  &lt;-------  Client
|       |        |        |    ---------- 
---------        ----------

</code></pre></div></div>
<center><i>Example of "text-drawing" #1</i></center>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                                 /-------&gt; Users Service --&gt; MySQL
                                /
Client w/ Browser Cache ---&gt; Gateway -----&gt; Posts Service  --&gt; Cassandra x 6
                                                |                 write_to: 2
                                              Redis               read_from: 1

</code></pre></div></div>
<center><i>Example of "text-drawing" #2</i></center>

<p>Extra points for talking through efficiency/budget/business considerations at this step. For example, proposing to split the application in two, with ML logic on a GPU-enabled machine and business logic on a more conventional server. Or thinking out loud about a buy vs. build decision about some sub-component.</p>

<h2 id="some-personal-opinions">Some personal opinions</h2>

<p>I prefer take-home projects + technical discussions. This combination makes for a more meaningful technical discussion. It allows the candidate to express their ideas about how a proper production system should be designed based on the take-home assignment. Plus, a good take-home project can highlight candidates’ abilities to write code and how they handle logging, testing, documentation, and deployment. I would argue it’s much better than just solving leetcode problems.</p>

<p>I even used take-home assignments to filter candidates when we were hiring for my team. I know the main cons of it, but I believe that a well-defined problem can be solved in one or two evenings, a couple hours each. Not great, but I feel much more relaxed than doing a 45m coding interview. Speaking of the devil…</p>

<p>I don’t like coding challenges. IMO, it’s usually just lazy bs. These kinds of practices can be understandable for FAANG (<a href="https://www.reddit.com/r/csMajors/comments/qhtqre/faang_manga/">well, more like MANGA nowadays</a>) companies because of their scale*. But, when coding challenges are done by small companies, I mostly find this as just bad taste.</p>

<blockquote>
  <p>Disclaimer *: I don’t mean that at Google-scale, they need their devs to know very well how to sort an array or find 2 numbers that add up to something. I mean that they have to go through so many candidates that they need a standardized, time-efficient, and repeatable way to check their capabilities. It doesn’t seem realistic for companies this big to give take-home assignments and thoroughly check these without incurring significant time and productivity losses. That’s the sad reality.</p>
</blockquote>

<p>To add to the mess of coding interviews, companies are actually misusing them. Coding interviews are supposed to check for a candidate’s problem-solving <strong>and</strong> communication skills. You need to show the interviewer <em>what is your thought process</em> and <em>how are you tackling a new problem</em>. Usually, it shouldn’t matter much if the solution you implemented is optimal or not. You need to be aware of this, though. Regretfully, interviewers usually just look for the “correct” answers, like it’s an exam and not a discussion, making the whole experience miserable.</p>

<p>In theory, coding tests are even worse. Because there’s no way to see the candidate’s <em>thought process</em> and <em>the way they are tackling problems</em>. Thus, it becomes just a timed exam that has no actual value in assessing how good a candidate is. In practice, because most interviewers are no better, I would take a coding test over a coding interview almost any day of the week.</p>

<p>So, if I were to rank coding interviews, I would arrange them like this:</p>
<ol>
  <li>“Discussion” coding interview</li>
  <li>Coding test with no interviewer at all</li>
  <li>Exam-like coding interview, without much support from the interviewer</li>
</ol>

<p>Of course, there are exceptions. One time, <a href="https://www.youtube.com/watch?v=e-ftdcWqhUs">at band camp</a> (jk), I had a fantastic experience with a no-interviewer coding challenge. It was a 3.5h HackerRank challenge, in 3 stages, for a research engineering position. The questions ranged from probability to ML model serving, numerical stability, and basic ML theory. Then, for the second stage, it was a code review exercise! I was given a piece of code and had to identify a bug and suggest an improvement. How cool is that?! The final part was an actual coding challenge to implement a graph algorithm. It was exhausting, but at least it wasn’t generic, and because it was so diverse, I felt like it enabled people to show where their true strength lies.</p>

<p>Alright, I’ll stop complaining and move on to the next section of this post.</p>

<h2 id="4th-interview">4th interview</h2>

<p>This one is primarily behavioral. Although I would say the candidate is always asked behavioral questions, it’s just at this stage, it is the primary focus.</p>

<p>I really like the questions about past experiences and how they can be improved, or if something didn’t work, why?
I feel these questions correlate more with actual skill rather than generic theory questions.</p>

<p>A few questions that I really liked were:</p>
<ul>
  <li>If I ask your manager what’s your greatest weakness, what would they tell me?</li>
  <li>What was a situation in which you made a mistake? How would you prevent it now by having more experience?</li>
  <li>Give me an example where you made a poor technical decision and then had to fix it. How did you do it?</li>
</ul>

<p>Generally, any question which asks to reflect on past mistakes is especially cool. Why? They help uncover how you grew since then, how humble you are, and how your critical thinking works.</p>

<p>I have no recollection of such questions in a non-senior ML interview, but plenty of those for senior/lead positions. So maybe think about such scenarios before your next interview.</p>

<h2 id="some-final-tips-to-prepare">Some final tips to prepare</h2>

<p>To really nail that interview process, I like doing mock interviews. The best way to do it (that I found) is <a href="https://pramp.com">Pramp.com</a>. It’s not an advertisement, you can check the link - it has no referral code or anything. I just really find them helpful, especially for coding interviews and somewhat for system design interviews.</p>

<p>For ML system design, the best thing I have found so far is Chip Huyen’s booklet - <a href="https://huyenchip.com/machine-learning-systems-design/toc.html">Machine Learning Systems Design</a>. And of course, for generic system design - <a href="https://github.com/donnemartin/system-design-primer">The System Design Primer</a>.</p>

<p>And remember, to really prepare for the behavioral interviews. Be ready to answer questions about how you failed and what you learned from it. Focus more on behavioral questions, specifically ones highlighting your leadership potential and learning-from-mistakes type of situations. For a good list of behavioral questions, see this <a href="https://business.linkedin.com/content/dam/me/business/en-us/talent-solutions/resources/pdfs/linkedin-30-questions-to-identify-high-potential-candidates-ebook-8-7-17-uk-en.pdf">PDF from LinkedIn</a>.</p>

<p>Throughout the process, ask questions and show your interviewers that you are engaged in conversations with them and are interested in the role. Ask them about their technical and business priorities, how specific processes are implemented in the organization, and their current pain points. <a href="https://github.com/viraptor/reverse-interview">Here’s a good list</a> of questions you can ask.</p>

<p>Interested in becoming a senior engineer? You’ll need both strong ML and superior soft skills to get that senior position. Also, maybe check my post <a href="https://alexandruburlacu.github.io/posts/2022-05-23-becoming-senior"><em>Becoming a Senior Engineer</em></a>, which should help you define your own roadmap.</p>

<h4 id="a-little-disclaimer-last-one-in-this-post">A little disclaimer (last one in this post)</h4>

<p>These posts were almost done since February, but due to the tragic events unfolding in Ukraine, I thought it wouldn’t be nice, to say the least, to post it back then. In Moldova, there’s a saying “Satu’ arde da baba sî chiaptănă” which translates to something like “The (unreasonable) old lady is grooming while the whole village burns”. I didn’t want to be that lady, so I thought it would be better to wait until things become at least somewhat less chaotic.</p>

<p>#Слава Україні! #Героям слава!</p>]]></content><author><name></name></author><category term="posts" /><category term="machine" /><category term="learning," /><category term="career," /><category term="career" /><category term="advice," /><category term="senior" /><category term="engineer," /><category term="leadership," /><category term="programming," /><category term="interviews" /><summary type="html"><![CDATA[My experience interviewing for a few Senior ML and MLOps roles. You will learn what are the common steps, quirks, and tips how to nail an interview for senior ML engineer positions.]]></summary></entry><entry><title type="html">AutoML Solutions: What I Like and Don’t Like About AutoML as a Data Scientist</title><link href="https://alexandruburlacu.github.io/posts/2022-07-05-neptuneai-automl" rel="alternate" type="text/html" title="AutoML Solutions: What I Like and Don’t Like About AutoML as a Data Scientist" /><published>2022-07-04T22:00:00+00:00</published><updated>2022-07-04T22:00:00+00:00</updated><id>https://alexandruburlacu.github.io/posts/neptuneai-automl</id><content type="html" xml:base="https://alexandruburlacu.github.io/posts/2022-07-05-neptuneai-automl"><![CDATA[<blockquote>
  <p>This blog post was written by me and orginally posted on <a href="https://neptune.ai/blog/automl-solutions">Neptune.ai Blog</a>. Be sure to check them out. I like their blog posts about MLOps a lot.</p>
</blockquote>

<p>There’s a sentiment that AutoML could leave a lot of Data Scientists jobless. Will it? Short answer – Nope. In fact, even if AutoML solutions become 10x better, it will not make Machine Learning specialists of any trade irrelevant.</p>

<p>Why the optimism, you may ask? Because although a technical marvel, AutoML is no silver bullet. The bulk of work a data scientist does is not modeling, but rather data collection, domain understanding, figuring out how to design a good experiment, and what features can be most useful for a subsequent modeling/predictive problem. The same goes for most ML engineers and other data professionals.</p>

<center><img src="/_data/webp/FullDataScienceWorkflow.drawio.webp" alt="CRISP-DM process for data science projects" /></center>
<center><i>Inspired by <a href="https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining">CRISP-DM</a> workflow, but with all the real-world feedback loops | Image by author</i></center>

<p>Indeed, AutoML sounds like some sort of algorithmic magic, that upon receiving your labeled data, will output the best possible ML model for it. Truth be told, AutoML is a bit like interacting with a genie: “Be careful what you wish for”, or rather, what data you give it.</p>

<p>Remember the saying, garbage in – garbage out? Due to the additional feedback loops in an AutoML system, compared to a classic ML solution, the “garbage” will be amplified beyond your wildest imagination. I personally wasn’t careful enough and fell into this trap a few times, but more on that later.</p>

<center><img src="/_data/webp/FullDataScienceWorkflowTimeSpent.drawio.webp" alt="The time it takes to clean the data and create relevant features is significantly larger than to train ML models" /></center>
<center><i>Based on personal experience and the references at the end of the article | Image by author</i></center>

<p>Before making any more claims, we first need to understand what AutoML is, and what it isn’t.</p>

<h2 id="the-current-state-of-automl">The current state of AutoML</h2>

<p>In practice, AutoML can take quite different forms. Sometimes a relatively efficient hyperparameter optimization tool (HPO), which can pick different ML algorithms, can be called an AutoML tool. A few notable examples are <a href="http://epistasislab.github.io/tpot/">TPOT</a>, <a href="http://autokeras.com/">AutoKeras</a>, and <a href="https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html">H2O.ai AutoML</a> (not to be confused with <a href="https://h2o.ai/products/h2o-driverless-ai/">Driverless.ai</a>). I could even speculate that given a GUI/Web interface to interact with these kinds of tools, and enough marketing budget, one can create a startup out of these.</p>

<center><img src="/_data/webp/tpot-ml-pipeline.webp" /></center>
<center><i>An example AutoML loop. Image by TPOT from Epistasis Labs | <a href="http://epistasislab.github.io/tpot/">Source</a></i></center>

<p>For some Deep Learning folks, AutoML would be about NAS, aka <strong>Network Architecture Search</strong> algorithms or methods. These methods are actually a very interesting research direction, which brought us such computer vision architectures as EfficientNet, AmoebaNet, and methods like <a href="https://arxiv.org/abs/1806.09055">DARTS</a>, <a href="https://arxiv.org/abs/1802.03268">ENAS</a>, and <a href="https://arxiv.org/abs/1712.00559">PNAS</a>. A couple of notable open-source tools for NAS are <a href="https://nni.readthedocs.io/">Microsoft’s NNI</a> and <a href="https://arxiv.org/abs/1802.03268">MXNet AutoGluon</a>.</p>

<p>Recall my speculation about <strong>HPO + nice interface == profit</strong>? It was more of a simplification, but some companies actually did this, of course adding features, scalability, security, and customer service, and it works, and it indeed helps organizations enable data scientists to solve a lot of problems. H2O’s Driverless.ai is probably the most well-known solution of this kind, but part of <a href="https://www.datarobot.com/">DataRobot</a> and <a href="https://www.dataiku.com/">Dataiku</a>’s products are also managed AutoML behind an easy-to-use interface.</p>

<p>I believe a special mention is for AutoML offerings from cloud giants like Google, Azure, and AWS. I don’t have much experience with Azure and AWS, but I can speak about my experience with <a href="https://cloud.google.com/automl">Google’s Vision AutoML</a>. From my experiments and knowledge, these solutions are some of the few that actually use NAS in a developer-oriented product and this is amazing.</p>

<p>Note that the NAS won’t be used for quick runs. The last time I checked, specifically Google Vision AutoML was using Transfer Learning for quick runs and NAS for 24-hour runs. It’s been a while since I checked though.</p>

<p>Let’s structure all of this information a bit, shall we? The table below should give you a high-level sense of how different tools are AutoML, in one way or another.</p>

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Is it Open Source?</th>
      <th>On-prem/Managed?</th>
      <th>Features</th>
      <th>Kind</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Microsoft NNI</td>
      <td>Yes</td>
      <td>On-premise</td>
      <td>HPO + NAS + Some other interesting stuff</td>
      <td>NAS, has a Web UI</td>
    </tr>
    <tr>
      <td>AutoGluon</td>
      <td>Yes</td>
      <td>On-premise</td>
      <td>NAS, supports penalizing big models</td>
      <td>NAS</td>
    </tr>
    <tr>
      <td>AutoKeras</td>
      <td>Yes</td>
      <td>On-premise</td>
      <td>NAS, depending on scenario has baselines it tries first</td>
      <td>NAS</td>
    </tr>
    <tr>
      <td>TPOT</td>
      <td>Yes</td>
      <td>On-premise</td>
      <td>Builds pre-processing + algorithms + ensembles pipelines</td>
      <td>HPO++, actually uses genetic algorithms</td>
    </tr>
    <tr>
      <td>H2O.ai AutoML</td>
      <td>Yes</td>
      <td>On-premise</td>
      <td>Basically a free version of Driverless.ai</td>
      <td>HPO++, has a Web UI, w\ integrated evaluation</td>
    </tr>
    <tr>
      <td>H2O Driverless.ai</td>
      <td>No</td>
      <td>On-premise</td>
      <td>Uses many pre-processing, feature encoding and selection schemes</td>
      <td>HPO++ with a nicer UI, w\ integrated evaluation</td>
    </tr>
    <tr>
      <td>Google Vision AutoML</td>
      <td>No</td>
      <td>Managed</td>
      <td>Basically a managed, simple to use NAS</td>
      <td>Transfer learning + NAS, a minimalist UI and w\ integrated evaluation</td>
    </tr>
    <tr>
      <td>DataRobot</td>
      <td>No</td>
      <td>On-premise/Managed</td>
      <td>An integrated platform with XAI, Inference server, Model and Experiments management</td>
      <td>AutoML part seems to be an HPO++ w\ integrated evaluation and XAI and a lot of other stuff</td>
    </tr>
  </tbody>
</table>

<p>Fundamentally, AutoML is trading computational budget (or time) for expertise. If you have no idea how to solve a problem, you will opt for the largest possible search space and wait for the search to finish. On the other hand, if you want to cut your expenses for powerful servers, or don’t want to wait for a week until the results arrive, <strong>and know some things about your problem</strong>, you can reduce the search space and arrive at a solution faster.</p>

<p>AutoML should really be treated more like an exploration tool rather than an optimal model generation tool. It’s not an alternative to a data/ML professional.</p>

<h2 id="automl--the-good-parts-pros">AutoML – The good parts (pros)</h2>

<p>Alright, I think we have established that AutoML is not a panacea for all ML issues. Then what is AutoML good for?</p>

<h3 id="speeding-up-the-model-exploration-stage">Speeding up the model exploration stage</h3>

<p>Let’s be honest, for most of us more often than not, we are usually not especially experienced in the domains we’re working on. Note that by domain I don’t mean computer vision, NLP, or time series, but rather advertising, e-commerce, finance, cell biology, genomics, and the list can go on for much longer. To add to the challenge, businesses require quick and impactful results.</p>

<p>I have a semi-personal story on how AutoML can bridge the gap between those with expertise and those without. A few years ago, I was at a summer school about Deep Learning and Reinforcement Learning. The organizers arranged a Kaggle competition, basically trying to forecast some time series. I intentionally omit details, you know, it’s semi-personal so… Anyway, there were PhDs, and postdocs, all trying to fit exceedingly complex models, some others were focusing on creating meaningful features. I, for having somewhat shallow knowledge of working with time series, and pure laziness decided I could just use AutoML, namely TPOT. Without much EDA beforehand, and even less so feature engineering. My result was in about the 50th percentile. Now, what do you think the winning submission was? Also TPOT, but with basic outlier removal, converting dates and times to categorical features like is_it_weekend and the likes of it, and running TPOT for 2 days.</p>

<blockquote>
  <p><strong>The moral of the story – if you lack subject matter expertise, or time to learn it, or are just lazy, AutoML is a fairly good starting point. It also frees up time to work on those features, and as seen from my story, features do indeed make a difference.</strong></p>
</blockquote>

<p>Although my story suggests it, it’s not always about delivering the final model, sometimes analyzing the generated candidates for some patterns can be of help too. For example, whether the best solutions use Naive Bayes, Decision Trees, Linear Classifiers, or maybe the AutoML tries to create increasingly complex ensembles, meaning you would also need a very expressive model to solve your problem.</p>

<h3 id="a-very-good-baseline">A very good baseline</h3>

<p>So, you’re working on a new ML project. The first thing you do, model-wise – you implement a simple heuristic baseline and see where you stand. Second, you try a simple ML solution and analyze how much it improves the baseline. One thing you can try to do after this stage, at least what I like to do, is to try to estimate what would be your upper bound in terms of predictive performance, and let an AutoML solution squeeze the most out of your data and preprocessing.</p>

<blockquote>
  <p><strong>Not only does it sometimes deliver superior results quickly, but it also shifts your perception towards working on better features.</strong></p>
</blockquote>

<p>Note that sometimes you don’t have the resources or are constrained by some other factors. So YMMV, but do keep in mind this use case for AutoML when working on new projects.</p>

<h3 id="identify-quickly--what-works-and-what-doesnt">Identify quickly – what works and what doesn’t?</h3>

<p>The space of possible combinations of feature transformations, algorithms, their hyperparameters, and ways of ensembling said algorithms create an immense search space of possible ML models. Even when you know what solutions can work and what can’t for a given problem, it’s still a vast search space. AutoML can help to fairly quickly test what configurations are more likely to work.</p>

<p>“How?” – you may ask. By running AutoML multiple times, and tracking:</p>

<ul>
  <li>what configurations get picked more often,</li>
  <li>how often,</li>
  <li>what is dropped,</li>
  <li>how quickly is it dropped,</li>
  <li>and so on.</li>
</ul>

<p>In a way, this is some kind of meta-EDA. One might say – Exploratory Model Analysis.</p>

<p>Now, why would you be interested in it? We want the best model, why not get straight to it? Because what we should aim for isn’t one good final model, but an understanding of what works, and what doesn’t. And based on this understanding, we can better solve problems further down the line. Even with AutoML, no one exempts you from such lovely issues as needing to periodically retrain your models on new data and also trying to reduce budget expenditure on ML.</p>

<h2 id="automl--the-bad-parts-cons">AutoML – The bad parts (cons)</h2>

<h3 id="a-false-sense-of-security">A false sense of security</h3>

<p>Honestly, this is the thing I hate the most about AutoML. It feels like magic and makes you lazy. And just like any automation, the more you use it, the more catastrophic it is when it fails.</p>

<p>Because of this, it’s easy to introduce data bugs. And due to AutoML’s sometimes opaque nature, these bugs are very hard to spot.</p>

<p>I have a personal anecdote about this, too – one that I will probably never get tired of recalling. We were working on a cell classification problem, where the distinction between the positive and negative classes was tough to observe even for a human. The images could really be at least somewhat accurately classified only by SMEs. We were trying for a few months to create a computer vision model to automate this task. The results weren’t good. Even with the most custom-built solution, which took into account various properties of our dataset and was capable of learning from small amounts of data without overfitting, the accuracy was close to 69%. On a binary classification problem.</p>

<p>At that stage, we had the opportunity to use Google Vision AutoML which was still in beta. The quick run results were a bit worse than ours. Eventually, we decided to run the full training, which was a bit pricey, and to make the most out of our data, we manually augmented the images to increase the dataset size. Lo and behold, 98.8% accuracy. Great success!</p>

<p>Only I was skeptical about it. After months of failed experiments, hundreds of hyperparameters tried, and dozens of methods used, I couldn’t believe some NAS could beat the problem, and do so by light-years. My superior was preparing to announce our outstanding results to the investors and other stakeholders. I insisted we inspect what was going on. A few weeks later, with a few dozens of partially occluded images, total confusion, and despair, I figured it out.</p>

<p>We manually augmented the dataset before using it with Google Vision AutoML, but we didn’t manually specify the splits. As a result, augmented versions of the same image were in training, test, and validation splits. The model just memorized the images. Once we fixed it and ran it again, we got ~67%.</p>

<blockquote>
  <p><strong>The moral of the story – don’t get comfortable with AutoML, it’ll bite you in the back.</strong></p>
</blockquote>

<h3 id="prone-to-over-optimizationover-fitting">Prone to over-optimization/over-fitting</h3>

<p>Depending on the nature of your data, and your model validation setup, some AutoML solutions can easily overfit. By the nature of data I mean its properties like label distributions, how many outliers you have, and the overall quality of your dataset. To be fair, often it’s not the tool’s fault, but yours, meaning most of the time the cause of overfitting is in your evaluation setup. So watch out how you evaluate candidates, how you split your data, and if working with time-series – I don’t envy you. Treat the AutoML process like hyperparameter optimization, and split your data accordingly using something like <a href="https://weina.me/nested-cross-validation/">nested cross-validation</a>.</p>

<p>You can find a comprehensive guide how to properly evaluate any machine learning model <a href="https://alexandruburlacu.github.io/posts/2021-07-26-ml-error-analysis">here in this post</a>.</p>

<h3 id="too-much-emphasis-on-optimization">Too much emphasis on optimization</h3>

<p>As mentioned a few times already, the correct way to think of AutoML is as an enabler that lets you focus more on the data side of things. But in reality many fall trap to the idea that model hyperparameters, and the model in general, are the most important factor in an ML project because AutoML solutions can sometimes show excellent improvements, reinforcing this idea.</p>

<h3 id="the-resulting-models-can-be-tedious-to-deploy">The resulting models can be tedious to deploy</h3>

<p>I once had the opportunity, or misfortune, depending on when you ask me, to work on ad price forecasting. And eventually, I tried using AutoML, namely TPOT. It ran well and gave pretty good results, so we decided to have our best-performing model deployed. I was asked to convert the model into something that a Golang or, at least, a Java backend would understand because deploying Python services was a no-go.</p>

<p>After a few hours of research, I discovered <a href="https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language">PMML</a>, plus I already knew about <a href="https://onnx.ai/">ONNX</a>. Long story short, PMML-capable libs vary a lot in what models can they read. So, while my ensemble Python model generated by TPOT was somewhat unproblematic to convert to PMML format, making a Go program understand it was impossible. Why? Because the Go lib didn’t know how to work with ensembles, preprocessing, and most models except for some decision trees, linear classifiers, and maybe Naive Bayes. As for ONNX, it also proved problematic to convert a scikit-learn ensemble pipeline to ONNX.</p>

<p>Often AutoML candidate models grow very complex, and converting them into anything becomes a headache. That’s why a lot of production ML is based mostly on linear classifiers, Naive Bayes and random forests, and GBDTs. You will rarely if ever see some complex stacked ensemble of different classifiers. They are a priori slow and very hard to make fast or compatible with non-Python environments.</p>

<h3 id="hard-to-analyzedebug-the-model">Hard to analyze/debug the model</h3>

<p>Recall the Google Vision AutoML story. Google didn’t have any facilities to deeply inspect models, a la <a href="https://en.wikipedia.org/wiki/Explainable_artificial_intelligence">XAI</a>. Also, there was no way to obtain some kind of interpretability or explanations of predictions for individual images. As a result, I was stuck with obfuscating parts of input images and analyzing the predictions. Generally, explainability and debugging tools for AutoML are a special problem. AutoML-generated models tend to be quite complex, thus hard to analyze. Additionally, most of the time the complexity hits twice, because a complex model will take more time to run predictions, and this, in turn, makes obtaining explanations using black-box analysis tools even more burdensome.</p>

<p>If you’re interested in some of the most popular black-box XAI tools, check out <a href="https://alexandruburlacu.github.io/posts/2021-05-09-archive-understanding-a-black-box">this post</a>.</p>

<h2 id="automl-vs-data-scientists">AutoML vs Data Scientists</h2>

<p>Before I give you some numbers, just keep in mind that depending on the problem you’re trying to solve, your experience with AutoML will vary greatly. So, let’s dive in.</p>

<h3 id="a-word-on-automl-benchmarks">A word on AutoML benchmarks</h3>

<p>The literature on AutoML benchmarks is fairly scarce, and most often it compares the performance of AutoML solutions omitting the performance of humans. Also, the studies are mostly about tabular datasets. Thankfully, we do have some work in establishing standardized ways to assess the performance of different AutoML solutions.</p>

<p>First, there’s the <a href="https://github.com/openml/automlbenchmark">AutoML benchmark</a>, and then there’s also a so-called Kaggle benchmark, which you can find examples of <a href="https://arxiv.org/pdf/2003.06505.pdf">in this paper</a> and in <a href="https://towardsdatascience.com/compare-popular-automl-frameworks-on-10-tabular-kaggle-competitions-9b1420e8942d">this Medium post</a>. For information on the use of AutoML/NAS in computer vision and text classification tasks, the easiest thing to do is to check the results of the <a href="https://github.com/google-research/nasbench">NAS Bench</a>(mark) and a <a href="https://www.automl.org/nas-overview/">few other competitions</a>. Still, not much comparative analysis between people-led and algorithm-led designs.</p>

<h3 id="is-all-hope-lost">Is all hope lost?</h3>

<p>No. On one hand, you can always try to run your models against the datasets mentioned above and see how good/bad you are against AutoML. But of course, this isn’t the answer you’re looking for. Enter <a href="https://arxiv.org/abs/2108.12193"><em>“Man versus Machine: AutoML and Human Experts’ Role in Phishing Detection”</em></a>. I’ll give you the gist of it, and a personal remark.</p>

<center><img src="/_data/webp/AutoMLvsNotAutoML.webp" alt="Comparisons of the AUC score and training duration of the best model built using AutoML and non-AutoML frameworks" /></center>
<center><i>Comparisons of the AUC score and training duration of the best model built using AutoML and non-AutoML frameworks* | See the article for more details</i></center>

<p>* One thing to note – Duration is calculated as the time it takes for a model to be trained on the given dataset.</p>

<ul>
  <li>
    <p>The authors conclude that AutoML models significantly outperform people when the dataset these solutions are applied to have some overlap in their classes and generally show high degrees of non-linearity. In other words, hard datasets. Otherwise, the performance is on-par with not using AutoML. They also claim that AutoML solutions usually take much longer to create high-performing models compared to non-AutoML.</p>
  </li>
  <li>
    <p>And here’s the catch, the authors don’t mention the time it takes to come up with a high-performing model. Why you may ask? Because for their non-AutoML solutions they take existing scikit-learn algorithms and don’t tune them at all. What does it all mean? First, take the duration conclusion with a grain of salt. Second, AutoML will only ever make sense for hard datasets, with noise, overlapping classes, and high degrees of non-linearity. Otherwise, you’ll be better off with the default settings of some off-the-shelf algorithm.</p>
  </li>
</ul>

<p>Their findings of the correlation between dataset complexity and AutoML advantage are quite in line with my personal experiences and the results of AutoML Benchmark, in which on more complex datasets some AutoML solutions have a 10%+ advantage in AUC and accuracy over manually created models. As you may recall from my story in the first part of AutoML cons, what took me a few months of work, Google’s AutoML almost matched in 24 hours.</p>

<p>How does all of this information help you? If you know your dataset is well-behaved, maybe don’t bother with AutoML. But how would you know? You can try running a few classic ML models, and see how their cross-validation performance varies. Or maybe just “look” at your data.</p>

<p>Personally, I use AutoML first in the beginning as a quick exploration tool, and then when all hope is lost. Never in between. To help you make up your own mind about AutoML, check out the links below, and run experiments.</p>

<h3 id="further-reading--benchmarks-of-automl-methods-including-against-humans">Further reading – benchmarks of AutoML methods, including against humans:</h3>

<ul>
  <li><a href="https://towardsdatascience.com/automl-is-overhyped-1b5511ded65f">AutoML is Overhyped</a></li>
  <li><a href="https://towardsdatascience.com/automl-faceoff-2-machines-vs-15-humans-bfc9d03e590f">AutoML Faceoff: 15 Humans VS 2 Machines. Who won? | by Norm Niemer | Towards Data Science </a></li>
  <li><a href="https://towardsdatascience.com/compare-popular-automl-frameworks-on-10-tabular-kaggle-competitions-9b1420e8942d">Compare popular AutoML frameworks on 10 tabular Kaggle competitions | by Piotr Płoński | Towards Data Science</a></li>
  <li><a href="https://arxiv.org/abs/1907.00909">[1907.00909] An Open Source AutoML Benchmark</a></li>
  <li><a href="https://arxiv.org/abs/2108.12193">[2108.12193] Man versus Machine: AutoML and Human Experts’ Role in Phishing Detection</a></li>
  <li><a href="https://arxiv.org/abs/2003.06505">[2003.06505] AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data</a></li>
  <li><a href="https://arxiv.org/abs/1902.09635">[1902.09635] NAS-Bench-101: Towards Reproducible Neural Architecture Search</a></li>
</ul>

<h2 id="what-if-everyone-would-use-automl-always">What if… everyone would use AutoML, always?</h2>

<p>Before we dive into this thought experiment, recall that AutoML works by trading computation for expertise. If we are clueless and have tons of computing power, this is “The Tool”. Let’s analyze what would happen if we went all-in with AutoML in the case of a more classic, established business, and in the case of an innovative company.</p>

<h3 id="major-enterprises-like-ford">Major enterprises, like Ford</h3>

<p>Depending on what department would use AutoML instead of their existing ML/DS tools, we might have somewhat good results, for example in marketing and sales, somewhat worse results in logistics and planning, and probably absolutely rubbish results for stuff like ADAS, which is advanced driver assist systems and simulation software. Besides, the increase in computing power required for the company to run these AutoML solutions would most certainly set them back by a non-trivial amount of cash.</p>

<p>And even if they would have the money and irrationality to go all-in on AutoML, it still would be a bad idea, due to strict requirements for model interpretability, which a complex ensemble model resulting from AutoML just can’t give. Hard pass.</p>

<h3 id="innovative-companies-like-palantir">Innovative companies, like Palantir</h3>

<p>If we’re talking specifically about Palantir, I believe with or without AutoML, their software doesn’t really care, because it’s about integrating and smartly using the data assets of an organization. Still, most of the analysis doesn’t require very advanced ML algorithms, so using AutoML would be a waste of money. Why use it when the best model is still going to be a linear regression or a decision tree. Why you may ask? Because their clientele is organizations that value model interpretability very much, again.</p>

<p>For any other innovative company, AutoML would have its place, but still within some serious limits. A lot of the time, the problems faced by these organizations can’t be simply formulated as supervised classification or regression, which makes it tricky to use AutoML.</p>

<p>The more innovative the use case, the harder it is to use off-the-shelf solutions. Can you imagine using an open-source AutoML tool to develop new drugs, or composite materials, or optimize the placement of transistors on a specialized chip? Me neither. These tasks can easily and should be treated as research directions. Is anyone in need of a startup idea?</p>

<h3 id="an-analysis">An analysis</h3>

<p>Maybe you noticed that a major problem for industry adoption of AutoML is interpretability. You might think “Oh, but maybe they haven’t heard about stuff like <a href="https://shap.readthedocs.io/en/latest/index.html">SHAP</a>, or XAI (Explainable AI) in general? That ought to change their minds”. I assure you, it won’t. Not soon, anyway.</p>

<p>You see, there’s a major difference between model interpretability and explainability. The former means that the model can be understood, as it is. The latter usually means either that there’s a way to infer why a certain prediction was made, or in more academic/cutting-edge cases, that a model will “tell you” the reasoning behind its prediction. And maybe you already see the problem here. No one can guarantee you that the explanation is correct.</p>

<p>This is the reason why, for example, there were thousands of people developing neural network-based computer vision models to detect if a patient has COVID based on their X-ray scans, and yet no major medical institution was using these. Doctors need to understand very well why the predictions were made. Same as legal, accounting, sales, marketing, and all the rest have different, sometimes non-negotiable requirements about model interpretability. And that’s why organizations are still big fans of linear models and decision trees and shy away from dense Neural Networks.</p>

<h2 id="so-what-would-be-a-good-use-case-for-automl">So what would be a good use case for AutoML?</h2>

<p>Now, let’s see some concrete use cases which can benefit the most from AutoML:</p>

<h3 id="batch-jobs">Batch jobs</h3>

<p>Most AutoML tools do not take into account model complexity/compute requirements, as a result giving you very well-tuned models which can be extremely slow or computationally demanding. Because of this, using such models is impossible in interactive or streaming scenarios, so what you’re left with is using them for batch jobs.</p>

<p>Maybe running ML as batch jobs sounds not that exciting, especially after you read about incredible feats of engineering of deploying ML models directly interacting with users, maybe even on edge devices, or how people are using ML models in streaming scenarios to process billions of events in near real-time, but trust me, a lot of businesses have processes that are absolutely fine with running on a schedule once in a few hours, days, or even weeks. You’ve certainly heard that quickest results beat most accurate results when it comes to business, but there are plenty of situations where accuracy is more critical than time.</p>

<h3 id="testing-the-waters-for-a-problem">Testing the waters for a problem</h3>

<p>I said before, and I will say again – AutoML is best suited for quick prototyping. It’s my favorite use-case for AutoML and one that helps me assess where an upper bound of performance might be, with my current dataset and pre-processing/feature engineering in place. When you adopt this mindset, you slowly turn towards a more data-centric ML/AI paradigm because you just assume that you will always get an optimized model.</p>

<p>Keep in mind that this should be done <strong>after</strong> the EDA stage. Also, if possible, try to reduce the search space, based on your EDA. If there are no significant correlations between attributes and the target variable you can confidently drop linear classifiers from the search space. What I like doing is running a few quick experiments with a reduced search space using an AutoML tool, with only the simplest models, with different random seeds, because of replicability, and see what are the best performing models. Based on that, I can adjust the search space for the next runs.</p>

<h2 id="takeaways">Takeaways</h2>

<p>AutoML is both a blessing and a curse. As with any tool, it can be used right to the greatest advantage, or it can be misused and then bad-mouthed.</p>

<blockquote>
  <p><strong>One thing to keep in mind is don’t abuse it.</strong></p>
</blockquote>

<p>It can be tempting to throw AutoML at any problem, even before analyzing your data or understanding your problem. Don’t be that person.</p>

<p>Another important thing you should get from this blog post: Invest all the time you save using AutoML on feature engineering. Think of it this way, if you would have the best model for your dataset, what else can you do to improve the performance of your machine learning system? Obviously, you can fetch more data or ensure that the data is of higher quality or have more informative features. Of course, AutoML won’t give you a perfect model, but the rationale holds. With modeling (almost) out of the way, and better performance still possible, you should focus on improving your data and features to reach those performance objectives. And if the results look too good – debug it.</p>

<p>Most importantly, make sure you understand very well the business requirements. So before running AutoML for hours on powerful CPUs and GPUs, take a few minutes to discuss whether your users will appreciate the slight increase in predictive performance, and won’t mind the lack of model interpretability.</p>

<p>As you can see, depending on who you ask, AutoML can mean quite different things. I recall the first time I figured that most of what is marketed as AutoML can be done with a multi-core workstation, a hyperparameter optimization library, and all of it wrapped in a simple UI, I was somewhat disenchanted. As long as it works for you, I guess.</p>

<h2 id="references">References</h2>

<ul>
  <li><a href="https://www.datanami.com/2020/07/06/data-prep-still-dominates-data-scientists-time-survey-finds/">Data Prep Still Dominates Data Scientists’ Time, Survey Finds</a></li>
  <li><a href="https://blog.ldodds.com/2020/01/31/do-data-scientists-spend-80-of-their-time-cleaning-data-turns-out-no/">Do data scientists spend 80% of their time cleaning data? Turns out, no? – Lost Boy</a></li>
  <li><a href="https://medium.com/human-science-ai/how-i-spent-my-time-as-product-data-scientist-90e760044cd7">How I Spent My Time As Product Data Scientist | by andrew wong | Human Science AI | Medium </a></li>
  <li><a href="https://www.fast.ai/2018/07/12/auto-ml-1/">What do machine learning practitioners actually do?</a></li>
  <li><a href="https://doc.dataiku.com/dss/latest/">Dataiku Documentation</a></li>
  <li><a href="https://www.datarobot.com/platform/automated-machine-learning/">Automated Machine Learning – DataRobot AI Cloud </a></li>
  <li><a href="https://h2o.ai/platform/ai-cloud/make/h2o-driverless-ai/">H2O Driverless AI </a></li>
  <li><a href="https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html">AutoML: Automatic Machine Learning — H2O 3.36.1.2 documentation </a></li>
  <li><a href="https://www.fast.ai/2018/07/16/auto-ml2/">An Opinionated Introduction to AutoML and Neural Architecture Search · fast.ai </a></li>
  <li><a href="https://arxiv.org/abs/1908.00709">[1908.00709] AutoML: A Survey of the State-of-the-Art </a></li>
  <li><a href="https://towardsdatascience.com/automl-is-overhyped-1b5511ded65f">AutoML is Overhyped</a></li>
  <li><a href="https://towardsdatascience.com/automl-faceoff-2-machines-vs-15-humans-bfc9d03e590f">AutoML Faceoff: 15 Humans VS 2 Machines. Who won? | by Norm Niemer | Towards Data Science </a></li>
  <li><a href="https://www.encora.com/insights/machine-learning-applied-to-medical-diagnosis">Machine Learning Applied to Medical Diagnosis </a></li>
</ul>]]></content><author><name></name></author><category term="posts" /><category term="automl," /><category term="ml," /><category term="machine-learning," /><category term="deep-learning," /><category term="nas," /><category term="network-architecture-search," /><category term="hpo" /><summary type="html"><![CDATA[AutoML sounds like magic. But how effective is it? And when to better use a simpler approach?]]></summary></entry><entry><title type="html">Choosing programming languages for real-world projects</title><link href="https://alexandruburlacu.github.io/posts/2022-06-18-choosing-a-tool" rel="alternate" type="text/html" title="Choosing programming languages for real-world projects" /><published>2022-06-17T22:00:00+00:00</published><updated>2022-06-17T22:00:00+00:00</updated><id>https://alexandruburlacu.github.io/posts/choosing-a-tool</id><content type="html" xml:base="https://alexandruburlacu.github.io/posts/2022-06-18-choosing-a-tool"><![CDATA[<p>A few years ago, when I was in my senior year at the university, during the distributed systems lecture our professor asked us a very nice question:</p>
<blockquote>
  <p>If we were to choose between a fancy new programming language, or Java/C#, for a greenfield commercial project, what would we choose and why?</p>
</blockquote>

<p>If you’re wondering what it has to do with distributed systems, I have to say - half of it was about software architecture.</p>

<p>The classroom was split into 2 camps, obviously. The fun and somewhat sad fact was that the Java camp won. I was part of that camp, even though I don’t like Java, to say the least. We had much better arguments. So, what were those winning arguments? Rich library and tooling ecosystem, and the relative availability of professionals in our local market, for a fair price too. Our professor deemed us project managers, not real programmers, then said we were right, and for a few seconds the atmosphere in the classroom turned sad and hopeless. Then we moved on with the lecture.</p>

<p><strong>TL;DR:</strong> We all want to play with the shiniest new toys, but when money is at stake, better stick to something tried and true.</p>

<p>So here are some questions to keep in mind when choosing a programming language, or any software tool for that matter, for a project. The focus will be on commercial projects, but some of the tips work for research projects and simple pet projects too.</p>

<h2 id="basic-level">Basic level</h2>

<p>Initially, the decision-making process is usually guided by a very narrow understanding of the consequences of choosing a specific tool. In increasing order of maturity, here are some basic reasons to make a choice:</p>
<ol>
  <li><em>I would like to learn this new tool/language/framework, people say it’s hot right now</em></li>
  <li><em>People say this is the best tool/language for this kind of problem</em></li>
  <li><em>I know this language/tool very well and can be very productive with it</em></li>
  <li><em>I and my team know this language/tool quite well and we can all be productive with it</em></li>
</ol>

<p>1 and 2 are only acceptable reasons for a pet project, with a small caveat, which I’ll explain later*. Although I would recommend sometimes taking a look at more niche, possibly peculiar tools to learn. Because, you know, <a href="https://www.goodreads.com/author/quotes/1164347.Alan_J_Perlis">if a language doesn’t change the way you think, it’s not worth learning</a>.</p>

<p>4 is a decent reason, see Paul Graham’s post about <a href="http://www.paulgraham.com/avg.html">using LISP to build a startup</a>, but in the long run, it’s not that simple.</p>

<h2 id="higher-level-decision-making">Higher-level decision making</h2>

<p>The difference between programming and getting stuff done, and software engineering is that the latter has significantly harder constraints (See <a href="https://abseil.io/resources/swe-book/html/toc.html">Software Engineering at Google</a>). Not just any code can be developed productively by a changing team of people and maintained over time. And most commercial software isn’t one-time scripts, but code that lives on for years, if not decades. To be a senior engineer, <a href="https://alexandruburlacu.github.io/posts/2022-05-23-becoming-senior">among other things</a>, is also about making well-thought technical choices.</p>

<p>That’s why, when choosing a tool, language, or an entire stack, try to guide your decision-making with these questions, in no particular order:</p>

<ul>
  <li><em>How well documented this tool/language is?</em></li>
  <li><em>How actively used/developed is it?</em></li>
  <li><em>How many dependencies of any sort does it have?</em></li>
  <li><em>How stable this tool/language is?</em></li>
  <li><em>What is the size and quality of the ecosystem for this tool/language?</em></li>
  <li><em>How productive can someone be using this tool/language?</em></li>
</ul>

<p>More constraints, but doable.</p>

<h2 id="business-level-decision-making">Business-level decision-making</h2>

<p>Now we reached the final frontier. Until now, it wasn’t particularly hard to make a choice, you just had to do your research. But now, we’re gonna have to enter the realm of never-ending trade-offs. Keep in mind that software is written by people, who you have to employ, pay salaries, and ideally have a positive return on investment.</p>

<ul>
  <li><em>How easy is it to teach someone, or how much time does it take to make someone productive with the given tool/language?</em></li>
  <li><em>How much reachable supply of professionals is out there for this tool/language? Is it sufficient for you?</em></li>
  <li><em>How much do professionals who are knowledgeable with this tool/language ask for (money, perks, whatever)?</em></li>
  <li><em>What is the quality of the supply? Are the engineers mostly newbies or seasoned professionals?</em></li>
  <li><em>How many people would like to work with the chosen tool/language? How excited are they?</em></li>
</ul>

<p>Rarely the raw performance of a tool or language is a big issue. Some domains are indeed interested in that characteristic too, like scientific computing, low-latency systems, and maybe embedded systems. More recently, how energy-efficient, or “green” a language or tool is, is of greater importance. Yes, <a href="https://docente.ifsc.edu.br/mello/livros/java/paperSLE.pdf">I’m not kidding</a>. For example <a href="https://aws.amazon.com/blogs/opensource/sustainability-with-rust/">Amazon cares</a> about such things, although like all things at this level, it’s <a href="https://news.ycombinator.com/item?id=30441771">not so simple</a>.</p>

<h3 id="an-example-of-picking-a-language">An example of picking a language</h3>

<p>Let’s do a “demo”. We will assume that we’re a remote-first startup and we want to build <del>a snowman</del> a serverless platform. How do we pick the programming stack? Well, at least the programming language. We will assume that the technical founders are capable of writing any language. No, they are not <a href="https://en.wikipedia.org/wiki/Spherical_cow">spherical</a>.</p>

<p>An important technical constraint for our project is that serverless technology is especially effective when the startup time of a serverless function is quick. If it’s not, why bother? Optionally, we might want to dive into serverless edge computing, meaning we need a programming language that can work even on resource-constrained devices. Maybe not microcontrollers, but something like a newer Raspberry Pi shouldn’t be considered unrealistic.</p>

<p>We are also budget-constrained because we’re a startup. We need to execute fast, or else we might not reach escape velocity, and no one will bother.</p>

<p>With that said, let’s prune some candidates. Because of our startup latency constraint, we can’t afford to run anything which needs a VM-like runtime. So no Java, C#, and even Erlang or Elixir. Although Erlang and Elixir have less substantial problems with VM cold start, they have another downside of having a smaller talent pool. On yet another hand, this talent pool is usually very enthusiastic and professional. I personally love Elixir, it’s just a pleasure to write, <a href="https://alexandruburlacu.github.io/posts/2021-05-07-elixir-pattern-matching-magic">see why</a>. What a shame we’re not building a messaging system.</p>

<table>
  <thead>
    <tr>
      <th>Language</th>
      <th>Verdict</th>
      <th>Talent Pool Size</th>
      <th>Tooling</th>
      <th>Excitement Factor</th>
      <th>Startup Latency</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Java</td>
      <td>No</td>
      <td>Very Large</td>
      <td>Very Good</td>
      <td>Can we go lower?</td>
      <td>Half of Java jokes are about this</td>
    </tr>
    <tr>
      <td>C#</td>
      <td>No</td>
      <td>Large</td>
      <td>Very Good</td>
      <td>A bit better than Java</td>
      <td>A bit better than Java</td>
    </tr>
    <tr>
      <td>Elixir/Erlang</td>
      <td>No</td>
      <td>Small</td>
      <td>Good</td>
      <td>Almost through the roof</td>
      <td>Good, for a VM-based language</td>
    </tr>
  </tbody>
</table>

<p>If we are planning for maximum efficiency, maybe we should use C++? Definitely no. C++ is quite dangerous. Besides, we need to keep in mind that we want to develop fast and preferably without much risk of segmentation faults, resource leaks, and other C++ surprises. Btw, a good C++ dev is quite expensive and hard to find nowadays.</p>

<!-- |Java | No | Very Large | Very Good | Can we go lower? | Half of Java jokes are about this |
|C# | No | Large | Very Good | A bit better than Java | A bit better than Java |
|Elixir/Erlang | No | Small | Good | Almost through the roof | Good, for a VM-based language | -->

<table>
  <thead>
    <tr>
      <th>Language</th>
      <th>Verdict</th>
      <th>Talent Pool Size</th>
      <th>Tooling</th>
      <th>Excitement Factor</th>
      <th>Startup Latency</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>…</td>
      <td>…</td>
      <td>…</td>
      <td>…</td>
      <td>…</td>
      <td>…</td>
    </tr>
    <tr>
      <td>C++</td>
      <td>No</td>
      <td>Moderate</td>
      <td>Moderate, hard to use IMO</td>
      <td>Depends what kind of person are you</td>
      <td>Sonic the hedgehog approves</td>
    </tr>
  </tbody>
</table>

<p>We know that development speed is important. But we also want a performant language without VM cold start problems. How about Python, or JS? These are popular, fast to work with, with a considerable talent pool, and JS can be speedy too. To be fair, this wouldn’t be the worst idea. Python, specifically CPython, can be slow but with the right tooling, or by substituting it with <a href="https://www.pypy.org/">PyPy</a>, we can solve these problems. As for JS, one issue is that the language is not the most pleasant to debug, with its <a href="https://javascriptwtf.com/wtf/javascript-holy-trinity">unholy trinity of no-values</a> and subpar traceback messages. Regretfully, there are lots of not-so-good-devs out there professing these tools, so that’s and issue. Finally, these are not the best systems programming languages.</p>

<!-- |Java | No | Very Large | Very Good | Can we go lower? | Half of Java jokes are about this |
|C# | No | Large | Very Good | A bit better than Java | A bit better than Java |
|Elixir/Erlang | No | Small | Good | Almost through the roof | Good, for a VM-based language |
|C++ | No | Moderate | Moderate, hard to use IMO | Depends what kind of person are you | Sonic the hedgehog approves | -->

<table>
  <thead>
    <tr>
      <th>Language</th>
      <th>Verdict</th>
      <th>Talent Pool Size</th>
      <th>Tooling</th>
      <th>Excitement Factor</th>
      <th>Startup Latency</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>…</td>
      <td>…</td>
      <td>…</td>
      <td>…</td>
      <td>…</td>
      <td>…</td>
    </tr>
    <tr>
      <td>JS</td>
      <td>Maybe/No</td>
      <td>Very Large</td>
      <td>Good</td>
      <td>Depends what flavor are you using</td>
      <td>Good</td>
    </tr>
    <tr>
      <td>Python (CPython)</td>
      <td>Maybe/No</td>
      <td>Very Large</td>
      <td>Good</td>
      <td>It will be a bummer that it’s not used for DS/ML/AI</td>
      <td>Good</td>
    </tr>
    <tr>
      <td>Python (PyPy)</td>
      <td>Maybe/Yes</td>
      <td>Very Large (but there’s a catch)</td>
      <td>Good</td>
      <td>If you know, you know</td>
      <td>Good, and it’s very fast overall</td>
    </tr>
  </tbody>
</table>

<p>Ok, so I said it, systems programming languages. And we dropped C++. What do we have left? <a href="https://golangdocs.com/system-programming-in-go-1">Go</a>, <a href="https://msrc-blog.microsoft.com/2019/07/22/why-rust-for-safe-systems-programming/">Rust</a>, <a href="https://crystal-lang.org/">Crystal</a>. We drop Crystal right away due to the lack of a sizeable community, talent pool, and libraries. So, it’s Go vs Rust? Hold on, there’s another contestant - <a href="https://ocamlverse.github.io/content/systems_programming.html">OCaml</a>. So, why did it come to these 3 languages? All of these are very suitable for systems programming, that is, interacting with lower-level OS constructs, are quite efficient at working closer to hardware, and in general, are fast and resource-efficient. Of all 3, Go is the most mainstream, so it’s a plus. Also, it’s easy to onboard people to use it. On the other hand, Rust and OCaml provide nicer guarantees for the programs you write, and although less popular, the quality of developers using them is usually pretty high. OCaml and Rust are pretty close idiomatically, but Rust syntax will be much more familiar to non-hardcore FP people, aka common folk, so it’s probably 10 points to Rust. All in all, let’s see the final table.</p>

<table>
  <thead>
    <tr>
      <th>Language</th>
      <th>Verdict</th>
      <th>Talent Pool Size</th>
      <th>Tooling</th>
      <th>Excitement Factor</th>
      <th>Startup Latency</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Java</td>
      <td>No</td>
      <td>Very Large</td>
      <td>Very Good</td>
      <td>Can we go lower?</td>
      <td>Half of Java jokes are about this</td>
    </tr>
    <tr>
      <td>C#</td>
      <td>No</td>
      <td>Large</td>
      <td>Very Good</td>
      <td>A bit better than Java</td>
      <td>A bit better than Java</td>
    </tr>
    <tr>
      <td>Elixir/Erlang</td>
      <td>No</td>
      <td>Small</td>
      <td>Good</td>
      <td>Almost through the roof</td>
      <td>Good, for a VM-based language</td>
    </tr>
    <tr>
      <td>C++</td>
      <td>No</td>
      <td>Moderate</td>
      <td>Moderate, hard to use IMO</td>
      <td>Depends what kind of person are you</td>
      <td>Sonic the hedgehog approves</td>
    </tr>
    <tr>
      <td>JS</td>
      <td>Maybe/No</td>
      <td>Very Large</td>
      <td>Good</td>
      <td>Depends what flavor are you using</td>
      <td>Good</td>
    </tr>
    <tr>
      <td>Python (CPython)</td>
      <td>Maybe/No</td>
      <td>Very Large</td>
      <td>Good</td>
      <td>It will be a bummer that it’s not used for DS/ML/AI</td>
      <td>Good</td>
    </tr>
    <tr>
      <td>Python (PyPy)</td>
      <td>Maybe/Yes</td>
      <td>Very Large (but there’s a catch)</td>
      <td>Good</td>
      <td>If you know, you know</td>
      <td>Good, and it’s very fast overall</td>
    </tr>
    <tr>
      <td>Crystal</td>
      <td>No</td>
      <td>Very Small</td>
      <td>So-so</td>
      <td>If you know, you know v2</td>
      <td>Very Good, and it’s blazing fast overall</td>
    </tr>
    <tr>
      <td>Rust</td>
      <td>Maybe/Strong Yes</td>
      <td>Small-Moderate</td>
      <td>Moderate</td>
      <td>Almost through the roof</td>
      <td>Very good, and it’s very fast overall</td>
    </tr>
    <tr>
      <td>Go</td>
      <td>Yes</td>
      <td>Large</td>
      <td>Good</td>
      <td>Pretty good</td>
      <td>Good, and it’s very fast overall</td>
    </tr>
    <tr>
      <td>OCaml</td>
      <td>Maybe/Yes</td>
      <td>Small</td>
      <td>Moderate</td>
      <td>Almost through the roof, but only for FP geeks</td>
      <td>Very good, and it’s very fast overall</td>
    </tr>
  </tbody>
</table>

<p>All things considered, probably the safest choice would be to use Go. And the next best thing would be Rust. A very good option would be PyPy, IMO. It’s almost 1 to 1 equivalent to CPython, but considerably faster. If you like it more hardcore FP, you could try OCaml. You could in fact go polyglot, and pick 2 languages, but don’t escalate to more than that. There’s a reason most full-stack engineers are writing JS-only.</p>

<h2 id="time-to-discuss-that-caveat">*Time to discuss that caveat.</h2>

<p>Yes, picking a tool only because it’s <em>hot</em> or seems interesting but is risky will rarely be a good idea, except when it is. You see, a tool is usually “hot” for a reason. Maybe it’s solving a common pain in the industry, and does so elegantly. Or maybe, it boosts productivity, efficiency, or the long-term maintainability of a system. Still, this isn’t enough to make such a risky move.</p>

<p>On the other hand, there’s an interesting aspect here. If a tool is hot people will want to work with it. This phenomenon boosts the desire to work for your team/business because you’re using this New Hot Thing ©. Combined with the intrinsic qualities of the new tool, it might make sense to actually give it a try. It is just as risky to never take a risk. Failing to grow and innovate will leave your business hard to hire for, your talent pool shrinking, and your operational efficiency slowly dying.</p>

<center><img src="/_data/bell_curve_languages.jpg" /></center>
<center><i>Follow sage's advice 😏 Made with: imgflip.com</i></center>

<h2 id="a-substitute-for-a-conclusion">A substitute for a conclusion</h2>

<p>I hope I haven’t fried your brains with this many things to consider. Even I sometimes don’t do the whole process, or am being sloppy when assessing some of the aspects. Still, having a checklist of things to consider is always a good thing, so I hope you’ll benefit from this.</p>

<p>Maybe a bit anti-climactic, but consider this - if you picked the wrong tool, it will rarely doom your project for failure. What will is not realizing you made a bad choice, and trying to fix it. Technical stacks are problems which can be fixed with money, and that’s a good thing.</p>

<p>Not the ending you expected? 😏</p>

<h3 id="ps">P.S.</h3>
<p>I should add a clarification about Java. Don’t get me wrong - I don’t “hate” Java, I just like pointing to its flaws, sometimes vehemently 😀. Java’s unnecessary verbosity is the main issue that I have with it. It wasn’t the only issue, but with the sped-up release cycle and a lot of ideas borrowed from other languages and communities, it’s becoming a better language. Brilliant engineers use Java for many important, actively developed projects with no plans to retire or rewrite these. Ergo, it can’t be an objectively “bad” language.</p>

<!-- Also, on a more philosophical note, keep in mind - Java was created for mass producing of software, where developers would become interchangeble. From a business point of view, this is a very good idea. But from a craftsman's point of view, this is sad and uninspiring. Also this thing become so popular because Sun marketed it as hell and people started to believe Java is good. -->

<h3 id="2022-11-09-update">2022-11-09 Update</h3>

<p>I came acros <a href="https://boringtechnology.club/">this amazing presentation</a>. It’s still related to the arguments I propose, although putting a greater importance on the <code class="language-plaintext highlighter-rouge">Basic Level &gt; 3rd point</code> decision factor. Even if initially the factor seems simplistic, there’s sophistication in simplicity, and the author of this presentation does a great job uncovering it. TL;DR, it’s good, on topic, and I recommend you check it out after reading my article 😀.</p>

<h4 id="a-little-disclaimer">A little disclaimer</h4>

<p>These posts were almost done since February, but due to the tragic events unfolding in Ukraine, I thought it wouldn’t be nice, to say the least, to post it back then. In Moldova, there’s a saying “Satu’ arde da baba sî chiaptănă” which translates to something like “The (unreasonable) old lady is grooming while the whole village burns”. I didn’t want to be that lady, so I thought it would be better to wait until things become at least somewhat less chaotic.</p>

<p>#Слава Україні! #Героям слава!</p>]]></content><author><name></name></author><category term="posts" /><category term="software" /><category term="engineering," /><category term="programming," /><category term="programming" /><category term="languages," /><category term="decision" /><category term="making," /><category term="frameworks," /><category term="java," /><category term="kotlin," /><category term="lisp," /><category term="python," /><category term="go," /><category term="golang," /><category term="rust," /><category term="rustlang," /><category term="erlang," /><category term="elixir," /><category term="ocaml," /><category term="software," /><category term="engineering," /><category term="senior," /><category term="leadership" /><summary type="html"><![CDATA[How to pick a tool, language, or framework when real money and the business is at stake. What to consider when faced with this kind of situation.]]></summary></entry><entry><title type="html">Becoming a Senior Engineer</title><link href="https://alexandruburlacu.github.io/posts/2022-05-23-becoming-senior" rel="alternate" type="text/html" title="Becoming a Senior Engineer" /><published>2022-05-23T20:00:00+00:00</published><updated>2022-05-23T20:00:00+00:00</updated><id>https://alexandruburlacu.github.io/posts/becoming-senior-engineer</id><content type="html" xml:base="https://alexandruburlacu.github.io/posts/2022-05-23-becoming-senior"><![CDATA[<p>Disclaimer. This post is based on frequent discussions with many of my friends and acquaintances who work in IT/Software Engineering, in a lot of different places, like outsourcing companies, product companies, big organizations with established processes, and small ones where <a href="https://www.youtube.com/watch?v=4L2ooG_MX9E">chaos reigns</a>.</p>

<p>Anyway, based on all this tribal wisdom, anecdotes, and my own experience and observations, there are four base properties/skills/traits which are a sure-fire way to grow into a senior/leadership position at your organization.</p>

<h1 id="the-four-axes">The four axes</h1>

<p>Depending on the organization, a mix of these four traits is necessary to take the mantle of a “senior engineer”.</p>

<ol>
  <li>
    <p><strong>Business acumen</strong> - You know how stuff works in your organization. You know the processes, the people, and the relations among all. You understand the vision and priorities of your company. You also have a rough idea of the company’s risk tolerance, budget, and context in which it operates. You know why certain things are done the way they are.</p>
  </li>
  <li>
    <p><strong>Communication skills</strong> - You explain your thoughts crystal clear. And I can’t stress this enough! If you can’t explain your thoughts in a clear, accessible way, you will impede not only your career prospects but others’ productivity too. The better you talk and write, be better everyone will understand what needs to be done, or how to fix things, or what is the roadmap, or… you got the idea. Besides, the more senior you get, and the bigger the organization you have to work in, the more you’ll have to write and communicate with people. Especially now with all the work done remotely, you need clear writing like never before. Chats, emails, JIRA tickets, code reviews, meeting notes, post-mortems; this list can go on forever.
Another subskill deserving a place here would be explaining technical stuff to non-technical people. This is especially important when you have to deal with non-technical stakeholders of your projects. They very much appreciate the effort you will put to explain to them what’s going on without much technical jargon. Imagine if nuclear physicists would explain what are they doing using their jargon. You won’t understand a thing. Been there, done that. So be empathetic, and talk to people in a way they can understand.</p>
  </li>
  <li>
    <p><strong>Being a force multiplier</strong> - You have good coaching/mentorship skills. You also always think of ways to enable people to do a better job. Maybe by creating a script to automate something, or by creating a shared document explicitly telling how some process is done and why, or just being a knowledgeable and pleasant colleague to discuss issues and ideas with.</p>
  </li>
  <li>
    <p><strong>Superior hard skills</strong> - You are one of the most knowledgeable people in your organization/community on some technology/practice/domain. You have superior skills, and for that, you are respected. Part of this is superior debugging skills. More often than we’d like, we have to fix code that’s not working. The quicker this can be done, the more time is left for feature development, which is so important for the business. You think beyond just lines of code and understand the architecture and the tradeoffs which lie at its foundations. You understand that sometimes DRY is not a good idea, where you should apply design patterns, and where it’s ok not to. Also, good coding skills are infectious. People will see your beautiful code and will want to do the same. In a way, you’ll be a force multiplier, by influencing others to write better code, which in turn will make the codebase a nicer environment.</p>
  </li>
</ol>

<p><img src="/_data/webp/skills_radar_simple.webp" alt="The four main skills axes for a senior engineer are business acumen, communication, hard-skills and being a force multiplier" /></p>

<h2 id="the-a-potential-path-to-senior-positions"><del>The</del> A potential path to senior positions</h2>

<p>Let’s say you were hired as a software engineer, maybe even a junior one. You aspire to become a senior. What do you do?</p>
<ul>
  <li>Learn your project.</li>
  <li>Learn why your project is important. Who are its users? What’s the roadmap? How does it make/save money?</li>
  <li>Learn more nuanced technical skills. Maybe read a few books. Iterate on this.</li>
  <li>Spot inefficiencies in your team’s processes, try to ease these through explicit processes, helper tools, or any other way. Iterate on this.</li>
  <li>Make friends with colleagues outside your team, maybe even outside your business function.</li>
</ul>

<p>Do all these, and you will certainly be allowed to lead some projects or initiatives.</p>

<h3 id="a-warning-note">A warning note</h3>

<p>Everything that is in excess becomes harmful. Depending on the organizational culture of your employer, being overly interested in the hows and whys of the business might seem nosy. And if your intentions are perceived this way, you might damage your reputation, instead of growing it. The same goes for strong initiatives to help your colleagues or the business. This one is more nuanced. It might be (usually) that your manager or colleagues are not <a href="https://www.dictionary.com/browse/dicks">unpleasand, counterproductive, or trying to dismiss your genius</a>, they just know that some stuff has been tried already, or the current prerogatives do not leave space for such initiatives. Remember to be respectful, not very annoying, and if all else fails, start searching for another job.</p>

<h2 id="some-misc-skills-youll-also-need">Some misc skills you’ll also need</h2>

<p>I would argue the four traits above are crucial to becoming a senior engineer in any organization. But I’d also like to include the following 3 skills too. Let’s label them as <em>very good to have</em>.</p>

<ul>
  <li><strong>Attention to details</strong>. Sloppy-done tasks have a big hit on your karma. Depending on your place of employment, this could range from writing code that works well without immediately visible issues, and writing high-quality code, with good tests and without breaking the CI.</li>
  <li><strong>Humility</strong>. You know, don’t be an <a href="https://www.dictionary.com/browse/dick">unpleasand, counterproductive, or trying to dismiss others</a> person. If no one wants to work with you, you will either be put on the worst projects in your company or straight fired from there. Note, don’t confuse <a href="https://tomhazledine.com/humility-in-tech/">humility</a> with low self-esteem.</li>
  <li><strong>A growth mindset</strong>. If you learned something to land a job and once there, decide to sit still on your ass, I’m afraid your only chance to become senior is by having the rest of your colleagues being <a href="https://en.wikipedia.org/wiki/Bus_factor">hit by a bus</a>. Stagnation should never be an option.</li>
</ul>

<p><img src="/_data/webp/skills_radar_full.webp" alt="A more complete picture of the necessary skills for a senior engineer should also include attention to details, humility, and a growth mindset" /></p>

<p>Of course, there are always exceptions, people who hold senior or technical leadership positions without these skills, but they are that - exceptions. So, it’s better to also be humble, attentive, and with a growth mindset than not to be.</p>

<h1 id="some-edge-cases">Some edge cases</h1>

<ul>
  <li>
    <p><strong>Senior engineer is the one who stayed the most with the company</strong>. This distills down to business acumen. She/he knows how things are done in the organization, and knows the codebase very well. Some communication and hard skills are also necessary. This path is prone to the “old junior” problem. “Old juniors” are a case you wouldn’t want to be in. It happens when someone stays with a company/product for too long without substantially growing their skills, but only acquiring business acumen. People in this situation remain stuck in their companies because of a growing chasm between their title and their actual skills.</p>
  </li>
  <li>
    <p><strong>Team leads</strong>. They usually are strong on Communication skills/being a force multiplier, and most are pretty good on the hard skills side too, but YMMV. A good team leader is an important asset for any organization, they are like <a href="https://civilization.fandom.com/wiki/Great_General_(Civ6)">Great Generals</a> for their teams.</p>
  </li>
  <li>
    <p><strong>An outsider is hired as a senior/lead right away</strong>. This does happen, and is more common in smaller organizations, in freshly established departments, or in new and specialized teams. Such people are almost always strong in hard skills and usually in communication skills. Occasionally, they may have very good business acumen because they have worked in similar industries before.</p>
  </li>
</ul>

<p>Remember, you need a mix of these. Having only hard skills won’t cut it. You’ll be just a very good software engineer. Nor will just business acumen help you, it will just turn you into a mediocre manager in the best-case scenario, or the terror of the engineering team in the worst case. And if you’re only good at being a force multiplier? Have you heard about the Scrum master position?</p>

<h1 id="takeaways">Takeaways</h1>

<ul>
  <li>Ask questions about the business/product. Show interest in how things are done within your organization.</li>
  <li>Level up your communication skill and help your team. Technical writing, working on enabling tasks, and mentorship are some of the most important. You can level these up by volunteering to document some nasty parts of the codebase, describing internal processes, and working on/proposing tools to increase the productivity of your team. Mentorship skills can be acquired by either asking to be the mentor for new hires, or you can try teaching outside of work, CoderDojo-like organizations being probably the best at this.</li>
  <li>Learn hard skills. Read books. Work on pet projects, to crystalize the knowledge you got from reading. Being part of a specialized community will also help you grow your hard skills, by learning advanced concepts you won’t find by just googling, because you wouldn’t even know what to google. Reddit is pretty good at this, sometimes. Also, slack/gitter/discord groups, interested in specific technology are good too. If you use it right, Twitter and YouTube are also excellent channels for this.</li>
</ul>

<p>By the way, notice that throughout the whole post, there was no mention of years of experience. Of course, some of the traits outlined above correlate with years of experience, but the correlation is not perfect, meaning you could have 10 YoE and still be not as good as someone with 4 YoE. So focus on skills, not on mileage.</p>

<h2 id="before-i-go">Before I go</h2>

<p>Maybe someone will find this news, but being a Senior is not the end of the road. Of course, many know about the “move into management” path. But there’s another way. Becoming a Staff software engineer. How? I don’t know, yet. When I will, I’ll certainly write another blog post. Until then, I’ll leave you with <a href="https://www.reddit.com/r/ExperiencedDevs/comments/ltsoao/how_do_you_differentiate_a_staff_engineer_from_a/">this Reddit thread</a> and <a href="https://staffeng.com/book">this book</a>.</p>

<h4 id="a-little-disclaimer">A little disclaimer</h4>

<p>These posts were almost done since February, but due to the tragic events unfolding in Ukraine, I thought it wouldn’t be nice, to say the least, to post it back then. In Moldova, there’s a saying “Satu’ arde da baba sî chiaptănă” which translates to something like “The (unreasonable) old lady is grooming while the whole village burns”. I didn’t want to be that lady, so I thought it would be better to wait until things become at least somewhat less chaotic.</p>

<p>#Слава Україні! #Героям слава!</p>]]></content><author><name></name></author><category term="posts" /><category term="career," /><category term="career" /><category term="advice," /><category term="senior" /><category term="engineer," /><category term="leadership," /><category term="staff" /><category term="engineer," /><category term="software" /><category term="engineer," /><category term="programming," /><category term="machine" /><category term="learning," /><category term="skills" /><summary type="html"><![CDATA[Some advice how to grow to a senior engineering role. What skills are most valuable for a senior software engineering career, and how to aquire them.]]></summary></entry><entry><title type="html">Going beyond simple error analysis of ML systems</title><link href="https://alexandruburlacu.github.io/posts/2021-07-26-ml-error-analysis" rel="alternate" type="text/html" title="Going beyond simple error analysis of ML systems" /><published>2021-07-26T00:10:00+00:00</published><updated>2021-07-26T00:10:00+00:00</updated><id>https://alexandruburlacu.github.io/posts/ml-error-analysis</id><content type="html" xml:base="https://alexandruburlacu.github.io/posts/2021-07-26-ml-error-analysis"><![CDATA[<h1 id="first-there-was-a-story">First, there was a story…</h1>

<p>Imagine yourself working as an ML engineer… very cool my friend!</p>

<p>First of all, congratulations, pat yourself on the back, your family must be proud.</p>

<p>Second, depending on the company size, culture, and the maturity of the machine learning team, you’re most likely in for a wild ride through many computer science and software engineering domains.</p>

<p>Again, pat yourself on the back. Now, let’s get to the chase.</p>

<p>As an MLE, part of your work is to pick, tune and deploy ML models. I believe I don’t need to explain to you that this is not so trivial. You must believe that the hard part of this process is to tune the model, don’t you? Or maybe that it is the deployment of the algorithm? Although these are indeed non-trivial, especially the later one, here’s <em>The Question ©</em> for you:</p>
<blockquote>
  <p><strong><em>How do you make sure you have a high-quality model in production?</em></strong></p>
</blockquote>

<p>If you’re gonna tell me that you just tested your model on a held-out dataset and that your metric of choice was something like accuracy, or the mean squared error, just run. Fast. Far away. If you didn’t run, be prepared to be questioned whether or not you:</p>
<ul>
  <li>had a baseline,</li>
  <li>balanced dataset or adjusted your metrics,</li>
  <li>used the held-out dataset for tuning/hyperparameter search 
… and so on.</li>
</ul>

<center><img src="/_data/nested_anakin.jpg" alt="So many questions... Made with: imgflip.com" /></center>
<center><i>So many questions... Made with: imgflip.com</i></center>

<p>I guess you figured out by now that a simple train/test split and a few error metrics, like accuracy or maybe even F1*, are not nearly enough to answer <em>The Question ©</em>. But what <em>would</em> be enough? Well, it depends, like all things in software engineering. You need to understand that reducing your model characteristic to only one or a few scalars will forfeit way too much information about the model.</p>

<p><em>* F1 score is a much better choice, btw</em></p>

<h1 id="-and-then-words-of-wisdom-followed">… and then words of wisdom* followed</h1>

<p><em>* - more like personal war stories</em></p>

<blockquote>
  <p>Disclaimer, this is a long post, so maybe brew some tea/coffee, get a snack, you know, something to help you get through the whole thing. Maybe taking notes would help you to stay focused. It certainly helps me when reading a lot of technical text.</p>
</blockquote>

<p>Another little disclaimer: I had <a href="https://alexandruburlacu.github.io/posts/2021-05-09-archive-understanding-a-black-box">an older post</a> tangential to this topic, but the focus in it was on interpretability/explainability methods. In this blog post, I focus more on how to assess the errors of machine learning models. If you think these topics are pretty close to each other, somewhat overlapping, you are right. To better evaluate a model, we sometimes need to understand the “reasoning” it puts into making a prediction.</p>

<!-- The motif of this article is **_understanding how, by how much, and (maybe) why a machine learning model fails?_** -->

<p>Keep in mind - depending on the domain you apply machine learning to, a subpar model could be anything from a little annoyance for your users to a complete dumpster fire that amplifies biases and makes your customers run away from your business. While it could be easy for said users to opt out from the former, the latter can ruin your business. We don’t want that. Your employer certainly doesn’t.</p>

<p>Ok, copy that. But how do you <em>know</em> that a machine learning model is good? Do you need to understand its predictions? Does your use case have a specific group of users that you care about the most? These questions can help you derive an evaluation strategy and in turn to make sure nothing goes south after you deploy an ML model.</p>

<p>You know what, let me first define a few ML evaluation maturity levels. It will be easier for me to explain and for you to follow along. For now, don’t bother about the meaning of some more advanced terms here, I will explain them right after this section.</p>

<ul>
  <li><strong>Level 0 (L0)</strong>: Having a train+test split and one or two generic metrics, like MSE or Accuracy. At this level, deploying the ML model is not advised (read: irresponsible at best).</li>
  <li><strong>Level 1 (L1)</strong>: Previous level, but using cross-validation if possible, or worst-case scenario, having a big and diverse test set. You will need to have per-class metrics for classification problems or multiple metrics for regression problems. For classification use cases, metrics like ROC-AUC score, or F1 score are considerably better than accuracy, so use these. Moreover, understanding your model’s precision and recall characteristics can prove crucial for a successful ML product. In case of regression, <a href="https://medium.com/analytics-vidhya/mae-mse-rmse-coefficient-of-determination-adjusted-r-squared-which-metric-is-better-cd0326a5697e">MAPE+RMSE+Adjusted R^2</a> are a good combination, you can consider using <a href="https://stats.stackexchange.com/questions/577/is-there-any-reason-to-prefer-the-aic-or-bic-over-the-other">AIC and/or BIC</a> too. For regression, try to have at least one metric robust to outliers (<a href="https://www.h2o.ai/blog/regression-metrics-guide">MAPE is robust to some types of outliers, but not the others</a>).</li>
  <li><strong>Level 1.1 (L1.1)</strong>: Check most wrong predictions, that is, entries with high prediction confidence, but that are predicted wrong. It can help you uncover error patterns, maybe even biases.</li>
  <li><strong>Level 2 (L2)</strong>: Perturbation analysis using counterfactuals and random alterations of input values. Usually, such an approach permits an understanding of feature importance for each entry, but that is more like a bonus you have to work to get.</li>
  <li><strong>Level 2.1 (L2.1)</strong>: <a href="https://scikit-learn.org/stable/modules/partial_dependence.html">ICE/PDP</a>/<a href="https://christophm.github.io/interpretable-ml-book/ale.html">ALE</a> plots can be used to better understand feature importances. Keep in mind these are fairly compute power demanding.</li>
  <li><strong>Level 2.2 (L2.2)</strong>: Surrogate local explanations (usually LIME) and/or additive feature explanations (i.e. SHAP) to understand model predictions before approving the model for deployment. Also computationally demanding.</li>
  <li><strong>Level 3 (L3)</strong>: Cohort-based model inspection. One way to define cohorts is through <a href="https://github.com/uber/manifold">Manifold</a>-like error groupings.
    At this level, it’s important to acknowledge the changes in data distributions and if applicable, to evaluate on data from different periods. Believe me when I tell you this, sometimes feature and/or label distributions can change even in domains where you don’t expect them to. And not accounting for this will give you some royal headaches.</li>
  <li><strong>(Optional) Level 4 (L4)</strong>: Adversarial examples checking. Stuff like Anchors and TCAV are at this level too. In principle, any other advanced model interpretability/explainability or security auditing is at this level.</li>
</ul>

<center><img src="/_data/evolution.jpg" alt="Power levels. Don't be L0. Made with: imgflip.com" /></center>
<center><i>Power levels. Don't be L0. Made with: imgflip.com</i></center>

<p>You would want to be at Level 1 when launching a model in beta, Level 2 when it’s in production, and from there grow to Level 3. Level 4 is more specific and not every use case requires it. Maybe you are using your ML algorithms internally, and there’s a low risk for some malicious agents trying to screw you, in this case, I doubt you need to examine the behavior of your model when fed adversarial examples but use your own judgment.</p>

<p>Note that although I mention regression use-cases, I omitted a lot of info about time-series forecasting. This is done on purpose, because the topic is huge, and this post is already a long-read. But if you have a basic understanding of what’s going on here, you can map different time-series analysis tools onto these levels.</p>

<h1 id="methods">Methods</h1>

<p>Let’s roughly cluster evaluation/error analysis methods into three broad categories: (1) metrics, (2) groupings, and (3) interpretations. Metrics is kind of obvious. Groupings are probably the most abstract ones. We put here train/test splits, cross-validation, input data cohort, and error groupings in this… oh god… group (no pun intended). Finally, under the interpretation umbrella fall such things as surrogate local explanations, feature importance, and even analyzing the most wrong predictions, among other things.</p>

<h2 id="metrics">Metrics</h2>

<p>I won’t dive deep into metrics-based evaluations but will mention that depending on your use case you might want to consider metrics that are non-linear in their relation to how wrong the prediction is. Maybe you’re fine with a bit of error, but if the model is very wrong, or frequently wrong, you want to penalize it disproportionally more. Or, on the contrary, as there are more wrong predictions, or the total loss of the model is growing, you want to have a log-like behavior for your metric, i.e. the metric will attenuate its growth as the model is more wrong.</p>

<p>Furthermore, on the matter of metrics that are robust to outliers, sometimes these are nice to have if you do some outlier removal beforehand. Or there might be a necessity, in cases when you can’t or specifically don’t remove the outliers, for whatever reason. Keep that in mind.</p>

<center><img src="https://scikit-image.org/docs/dev/_images/sphx_glr_plot_ransac_001.png" alt="Effects of outliers on model fitness. Source: https://scikit-image.org" /></center>
<center><i>Effects of outliers on model fitness. Source: https://scikit-image.org</i></center>

<p>Usually, in production scenarios, you will want to assess your model performance on different cohorts, and maybe even based on these cohorts to use different models. A cohort means a group of entities, with a specific grouping criterion, like an age bracket, or location-based, or maybe something else.</p>

<h2 id="groupings">Groupings</h2>

<p>I mentioned cohorts in the paragraph above, so it will make sense to follow up on this. Cohorts are important because your stakeholders are interested in these, sometimes you might be too, but the business is usually the number one “fan” of cohorts. Why? Well, it could be due to many reasons. Maybe they are especially interested in providing top-notch services for a special group of customers, or maybe they must comply with some regulations that ask them for a specific level of performance for all the users.</p>

<p>Moreover, your dataset is most certainly skewed, if it’s real-world data. Meaning, you will have underrepresented classes, all sorts of imbalances, and even different distributions for your features for each class/group of classes. For example, it wouldn’t be ok for any business to give subpar recommendations for users outside the North America region, or to predict that <a href="https://www.cnet.com/news/google-apologizes-for-algorithm-mistakenly-calling-black-people-gorillas/">a person of color is some kind of ape</a>.</p>

<p>We need to create cohorts, or groups, based on some characteristics, and track the performance of our machine learning systems across these. Often you will discover that the teams who are conscious about their cohorts will deploy different models for different user groups, to ensure high-quality service for everyone.</p>

<p>But groupings aren’t just cohorts based on input data characteristics. Sometimes for model analysis, it makes sense to create groupings based on errors. Some kind of groupings by the error profile. Maybe for some inputs your model(s) gives low errors, for other inputs some very high errors, and for yet another group the error distribution is entirely different. To uncover and understand these, you could use <a href="https://alexandruburlacu.github.io/posts/2021-06-18-kmeans-trick">K-Means</a> to cluster your losses and identify the reason your model might fail or just underperform. That’s what Manifold from Uber does, and that’s just brilliant!</p>

<center>
<span>
<img src="/_data/webp/error_dist_cluster.webp" alt="A violin plot to compare two ML models on error groups identified by a K-Means algorithm" />
<img src="/_data/webp/per_feat_dist_0_to_7.webp" alt="Per-feature distribution comparison of two ML models on different error groups" />
</span>
</center>
<center><i>(Top) 3 clusters of error distributions, and a comparision between 2 models. (Bottom) Once we have error groups, we'd like to find why are these happening. Visualizing differences in feature distribution between two of these clusters can help. <br /> Source: The author. Inspired by: <a href="http://manifold.mlvis.io/">http://manifold.mlvis.io/</a>.</i></center>

<p>Finally, groupings are also about how you arrange your data into training and testing splits. Or more splits, like evaluation during the training of your model. These help in noticing when the model starts to overfit or whatever. Keep in mind, special care should be taken when doing a hyperparameter search. For fast-to-train models, a technique called <a href="https://weina.me/nested-cross-validation/">nested cross validation</a> is an incredibly good way to ensure the model is really good. The nested part is necessary because doing hyperparameter optimization (HPO) you’re optimizing on the evaluation set, so your results will be “optimistic” to say the least. Having an additional split could give you a more unbiased evaluation of the final model.
What about slow models? Oh, boi. Try to have a big enough dataset such that you can have big splits for all your evaluation/testing stages. You don’t have this either? Have you heard about the <a href="https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007">AI hierarchy of needs</a>?</p>

<p>Also, an often overlooked issue is the target distribution of the dataset. It might be heavily imbalanced, and as a result, special care should be taken when sampling from it for train/validation/test splits. That’s why you should almost always search for a way to have your splits <em>stratified</em> (see scikit-learn’s <code class="language-plaintext highlighter-rouge">StratifiedKFold</code>, also <code class="language-plaintext highlighter-rouge">train_test_split</code> has a <code class="language-plaintext highlighter-rouge">stratify=</code> parameter and for multioutput datasets check out <code class="language-plaintext highlighter-rouge">multioutput_crossvalidation</code> package). When a dataset is imbalanced you could try to do some sort of oversampling, a la SMOTE or ADASYN, but in my experience, it might not always work, so just experiment (a scikit-learn-like lib for this is <a href="https://imbalanced-learn.org/stable/index.html"><code class="language-plaintext highlighter-rouge">imbalanced-learn</code></a>).</p>

<h2 id="interpretations">Interpretations</h2>

<blockquote>
  <p>Disclaimer #2, this part of the blog post is maybe one of the most overwhelming. There’s quite a body of literature about ML interpretability/explainability and I will only briefly mention some methods, for a more in-depth overview, check out <a href="https://christophm.github.io/interpretable-ml-book/">Interpretable Machine Learning by Christoph Molnar</a>.</p>
</blockquote>

<p>This category is pretty abstract, and some might argue that these are not really related to model evaluation, but rather ML interpretability/explainability. To which I say that these methods allow uncovering hidden errors, biases. Based on these, now you can pick one model over another, thus interpretations being useful for evaluation. These tools excel in identifying the “<strong>right answer - wrong method</strong>” scenarios, which will pass without any issue metrics and groupings.</p>

<p>So, what things can you “interpret” about a model that can help you evaluate it? First, if your model/API allows for it, you could check feature importances. You might discover that a model puts too much weight on some obscure feature or one that doesn’t make sense. At this point, you should become a detective, and find out why is this the case. This kind of feature importance is called <strong><em>global feature importance</em></strong>, because it is inferred at the model level, from all training data.</p>

<p>The next easy thing to do is <strong><em>perturbation analysis</em></strong>, of which there are multiple categories. Perturbation analysis means altering the input and seeing what’s going to happen. We can alter the input with a different purpose to assess different aspects of the model.</p>
<ul>
  <li>Counterfactuals, aka “What if I change this one feature, how will my model prediction change?”. We can check for example how sensitive is the model to changes that in principle should change the prediction intuitively. A prominent tool for this is <a href="https://www.tensorflow.org/tensorboard/what_if_tool">Tensorboard’s What-If tool</a>.</li>
  <li>Adversarial examples, aka “Can I create such input that while similar to a normal one will result in a messed prediction”. Checking these is usually important for external user-facing systems, where an attack can have very nasty consequences, and because this kind of verification is more specific, it is usually left for later during the project.</li>
  <li>Random alterations, to assess how robust is the model to unimportant changes, or how well it captures “common sense-ness”, also can be used for local feature importance. In the case of a sentiment analysis problem, a random alteration could be swapping synonyms for words that don’t have positive or negative semantics, aka neutral words. <!-- A colleague of mine actually was in such a situation, where it turned out that location information was useful in predicting the kind of document we were dealing with, which was either a grant/award or a project request. It turned out that poorer countries usually ask for projects, while richer ones were giving awards/grants. --></li>
  <li>Out-of-distribution data. Ok, this one isn’t really perturbation analysis, but sometimes you want to make sure the model can generalize to data that is similar but not quite. Or maybe you just want <a href="https://www.youtube.com/watch?v=yneJIxOdMX4">to have some fun</a> at work and pass german sentences to a sentiment analysis model trained on Spanish text.</li>
</ul>

<!-- Perturbation analysis can be thought of as a subset of a larger class of methods - [example-based interpretability](https://christophm.github.io/interpretable-ml-book/example-based.html) methods. In this set of methods, we can also put searching for prototypes representing a group of inputs or predictions, or methods that allow to search for the most similar entries (nearest neighbor search). -->

<p>Another way to help you uncover error patterns is by checking the wrong predictions which have very high model confidence. In simpler terms, the royal fuck-ups. I learned this method relatively late, from the Deep Learning Book by Goodfellow et al. I’m lazy, and this method although obvious in hindsight, is new to me. I prefer doing perturbation analysis so that there’s no need for pretty printing and/or plotting with that one. But while working on my research project I am now “forcing” myself (it’s not so bad, really) to also do this step.</p>

<p>I would recommend defining some sort of regression tests suite made up of previously problematic input examples. This can help be sure that future versions of the ML model are indeed an improvement on the previous ones. In it can check previously wrongly classified entries or use examples from different types of perturbation analysis. You will thank yourself later for this regression suite.</p>

<p>Surrogate local explanations, of which the most prominent tool is LIME, are another kind of interpretability tool. Surrogate local explanations try to approximate a complex machine learning model with a simple machine learning model, but only on a subset of the input data, or maybe just for a single instance.</p>

<p>FINALLY (now for sure), another notable class of ML interpretability methods is additive feature explanations, and for this category one of the most prominent tools is SHAP. SHAP is especially interesting, albeit harder to understand, given it’s based on game theory and uses Shapely values to define local feature importances. One issue with this method is that Shapely values or almost any other additive feature explanation method don’t account for feature interactions, which can be a deal-breaker.</p>

<center><img src="/_data/shap_additive_features.png" alt="Additive features from SHAP package can show which feature values impacted how the final prediction" /></center>
<center><i>SHAP uses Shapley Values to explain the effect of each feature value on the prediction. Source: author.</i></center>

<p>There are even more advanced tools, tuned specifically for neural networks. These use different forms of saliency or activation maps. Tools like these are cool and helpful, but harder to use, and not as general. Trying to cover even a subset of these would require <a href="https://christophm.github.io/interpretable-ml-book/">an entire book</a>, so if you’re interested, you know what to do ;). In the book, you can find much more detailed explanations about modern tools like SHAP, LIME, Anchors, but also more classic approaches like PDP, ICE, and ALE plots. And even concept identification approaches like <a href="https://github.com/tensorflow/tcav">Tensorflow’s TCAV tool</a>.</p>

<p>One thing to keep in mind, interpretability tools are crucial for a proper model evaluation. Although not a direct mapping, you can think of these interpretation methods for a model like code review for code. And you don’t merge code without code review in a production system, now do you?</p>

<h2 id="personal-recommendations">Personal recommendations</h2>

<p>We’re nearing the end of this post, so I would like to give you some recommendations on how to proceed when evaluating ML models as if those maturity levels weren’t enough. These recommendations are more low-level and practical, some gotchas if you will.</p>

<ul>
  <li>Of course, start with a couple of appropriate evaluation metrics. Don’t use just one. If you can, cross-validate. If doing HPO, have two testing splits. For classification, I would recommend at least some loss and some score function + scikit-learn’s <code class="language-plaintext highlighter-rouge">classification_report</code> and if you don’t have a ton of classes, the confusion matrix is your friend. Some people use AUC and Precision-Recall curves, which are nice, but I’m just not used to these. Maybe after this blog post, I will start using them. (do as I say, not as I do)</li>
  <li>I usually do perturbation analysis (random and counterfactuals) after this. Looking for the top-k most wrong predictions helps, but I rarely do it (do as I say, not as I do, #2).</li>
  <li>If I’m not satisfied yet, I will certainly check for error groups a la Manifold and/or surrogate local explanations (LIME-like, I mostly use the <code class="language-plaintext highlighter-rouge">eli5</code> package). I prefer not to do the latter because it takes a looooot of time, especially with bigger-sized input. Regarding local explanations with surrogate models, sometimes I find it necessary to adjust the surrogate using the default might be just too simplistic. I do NLP, so both points are a real issue for me.</li>
</ul>

<p>Sometimes, especially in the early stages of development, I could do a kind of “exploratory testing” of model predictions, namely feed out-of-distribution data and look at what will happen.</p>

<p>For personal experiments, I can sometimes use SHAP but I find it a bit frustrating that it’s hard to export the graphics and that it works best when working from Jupyter. Moreover, it’s slow, but that’s a general issue for all surrogate explanations.</p>

<p>I am yet to play around with Anchors, adversarial examples, and doing stuff like “Find the most similar entry with a different class” or “Find the most similar entries to this one”. The latter two can be done using kNN in either feature, embedding, and/or prediction spaces. Microsoft Data Scientists seem to be asking these kinds of questions to assess their models.**</p>

<p>In the end, I am sure this amount of information is overwhelming. That’s why maybe the best recommendation I could give is to just use a simple model, one that is easy to understand. To make it performant you could also try to invest time in features that make sense. All in all, just be the data scientist your company needs you to be, not the one you want to be. Boring and rational beats hype-driven.</p>

<center><img src="/_data/data_scientists.jpg" /></center>
<center>Choose your hero wisely. Made with: imgflip.com</center>

<h1 id="epilogue">Epilogue</h1>

<p>Probably this post, like no other, helped me crystalize a lot of the tacit knowledge gained through the years. Maybe you’ve heard the quote - “When one teaches, two learn” I believe something like this happened here too.</p>

<p>I know my posts are usually long and dense, sorry, I guess, but on the other hand, now you don’t have to bookmark 5-10 pages, just this one 😀😀😀 jk. Anyway, thank you for your perseverance in reading this article, and if you want to leave some feedback or just have a question, you’ve got quite a menu of options (see the footer of this page for contacts + you have the Disqus comment section). Guess it will take a while until next time.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
&gt; Until then, you can play around                                &lt;
&gt; with most of the methods described in this blog post            &lt;
&gt; by checking the link below                                      &lt;
&gt; https://github.com/AlexandruBurlacu/error_analysis_code_samples &lt;
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
</code></pre></div></div>

<p><a href="https://github.com/AlexandruBurlacu/error_analysis_code_samples">You can also click on it here.</a> All examples are seeded, so it should be possible to reproduce everything. Have fun.</p>

<h2 id="acknowledgements">Acknowledgements</h2>

<p>Special thanks to <a href="https://twitter.com/dgaponcic">@dgaponcic</a> for style checks and content review, thank you again <a href="https://twitter.com/anisoara_ionela">@anisoara_ionela</a> for thorough grammar checks, and thank you <a href="https://twitter.com/dianaartiom">@dianaartiom</a> for the last bits of feedback on ML. You’re all  the best &lt;3</p>

<h2 id="a-few-references">A few references</h2>
<ul>
  <li><a href="http://people.duke.edu/~rnau/compare.htm">A detailed overview of regression metrics</a></li>
  <li><a href="https://christophm.github.io/interpretable-ml-book/">Interpretable Machine Learning by Christoph Molnar</a>; amazing work, a lot of info, a lot of details</li>
  <li>**<a href="/_data/ml_debugging/19_gamut_chi.pdf">Gamut paper</a> to help you ask the right questions about a model</li>
  <li><a href="/_data/ml_debugging/1808.00196.pdf">Manifold paper</a> and <a href="https://github.com/uber/manifold">Manifold GitHub repo</a></li>
  <li><a href="https://neptune.ai/blog/the-ultimate-guide-to-evaluation-and-selection-of-models-in-machine-learning">A good overview on how to evaluate and select ML models</a></li>
  <li>Github repos which also contain links to their respective papers:
    <ul>
      <li><a href="https://github.com/marcotcr/lime">LIME GitHub repo</a></li>
      <li><a href="https://github.com/slundberg/shap">SHAP GitHub repo</a></li>
      <li><a href="https://github.com/marcotcr/anchor">Anchors GitHub repo</a></li>
    </ul>
  </li>
  <li>And an <a href="https://github.com/altamiracorp/awesome-xai#critiques">Awesome GitHub repo</a> on different XAI tools and papers.</li>
</ul>

<!-- # Annex A: A few words about increasing the predictive performance of mostly classifiers

Robustification
- adversarial training
- focal loss for tail errors
- label smoothing
- self-distillation -->]]></content><author><name></name></author><category term="posts" /><category term="machine" /><category term="learning," /><category term="machine" /><category term="learning" /><category term="debugging," /><category term="error" /><category term="analysis," /><category term="deep" /><category term="learning," /><category term="machine" /><category term="learning" /><category term="evaluation," /><category term="machine" /><category term="learning" /><category term="testing," /><category term="artificial" /><category term="intelligence," /><category term="fairness," /><category term="ml," /><category term="ai," /><category term="data" /><category term="science" /><summary type="html"><![CDATA[When deploying machine learning algorithms, the stakes are much higher than in any toy problem or competition. For this reason, we need a much more thorough evaluation of our models, to make sure it is indeed good.]]></summary></entry><entry><title type="html">K-Means tricks for fun and profit</title><link href="https://alexandruburlacu.github.io/posts/2021-06-18-kmeans-trick" rel="alternate" type="text/html" title="K-Means tricks for fun and profit" /><published>2021-06-19T18:30:00+00:00</published><updated>2021-06-19T18:30:00+00:00</updated><id>https://alexandruburlacu.github.io/posts/kmeans-trick</id><content type="html" xml:base="https://alexandruburlacu.github.io/posts/2021-06-18-kmeans-trick"><![CDATA[<h1 id="prologue">Prologue</h1>

<p>This will be a pretty small post, but an interesting one nevertheless.</p>

<p>K-Means is an elegant algorithm. It’s easy to understand (make random points, move them iteratively to become centers of some existing clusters) and works well in practice. When I first learned about it, I recall being fascinated. It was elegant. But then, in time, the interest faded away, I was noticing numerous limitations, among which is the spherical cluster prior, its linear nature, and what I found especially annoying in EDA scenarios, the fact that it doesn’t find the optimal number of clusters by itself, so you need to tinker with this parameter too. And then, a couple of years ago, I found out about a few neat tricks on how to use K-Means. So here it goes.</p>

<h1 id="the-first-trick">The first trick</h1>

<p>First, we need to establish a baseline. I’ll use mostly the breast cancer dataset, but you can play around with any other dataset.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.cluster</span> <span class="kn">import</span> <span class="n">KMeans</span>
<span class="kn">from</span> <span class="nn">sklearn.svm</span> <span class="kn">import</span> <span class="n">LinearSVC</span>
<span class="kn">from</span> <span class="nn">sklearn.datasets</span> <span class="kn">import</span> <span class="n">load_breast_cancer</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>

<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>

<span class="n">X</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">load_breast_cancer</span><span class="p">(</span><span class="n">return_X_y</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">)</span>

<span class="n">svm</span> <span class="o">=</span> <span class="n">LinearSVC</span><span class="p">(</span><span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">)</span>
<span class="n">svm</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">svm</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)</span> <span class="c1"># should be ~0.93
</span></code></pre></div></div>
<p>So, what’s this neat trick that reignited my interest for K-Means?</p>

<blockquote>
  <p><strong><em>K-Means can be used as a source of new features.</em></strong></p>
</blockquote>

<p>How, you might ask? Well, K-Means is a clustering algorithm, right? You can add the inferred cluster as a new categorical feature.</p>

<p>Now, let’s try this.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># imports from the example above
</span>
<span class="n">svm</span> <span class="o">=</span> <span class="n">LinearSVC</span><span class="p">(</span><span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">)</span>
<span class="n">kmeans</span> <span class="o">=</span> <span class="n">KMeans</span><span class="p">(</span><span class="n">n_clusters</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">)</span>
<span class="n">X_clusters</span> <span class="o">=</span> <span class="n">kmeans</span><span class="p">.</span><span class="n">fit_predict</span><span class="p">(</span><span class="n">X_train</span><span class="p">).</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>

<span class="n">svm</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">hstack</span><span class="p">([</span><span class="n">X_train</span><span class="p">,</span> <span class="n">X_clusters</span><span class="p">]),</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">svm</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">hstack</span><span class="p">([</span><span class="n">X_test</span><span class="p">,</span> <span class="n">kmeans</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">).</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)]),</span> <span class="n">y_test</span><span class="p">)</span> <span class="c1"># should be ~0.937
</span></code></pre></div></div>

<p><img src="https://i.kym-cdn.com/photos/images/newsfeed/001/551/546/7ae.png" alt="Source: knowyourmeme.com" /></p>

<p><em>Source: knowyourmeme.com</em></p>

<p>These features are categorical, but we can ask the model to output distances to all the centroids, thus obtaining (hopefully) more informative features.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># imports from the example above
</span>
<span class="n">svm</span> <span class="o">=</span> <span class="n">LinearSVC</span><span class="p">(</span><span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">)</span>
<span class="n">kmeans</span> <span class="o">=</span> <span class="n">KMeans</span><span class="p">(</span><span class="n">n_clusters</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">)</span>
<span class="n">X_clusters</span> <span class="o">=</span> <span class="n">kmeans</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="c1">#                       ^^^^^^^^^
#                       Notice the `transform` instead of `predict`
# Scikit-learn supports this method as early as version 0.15
</span>
<span class="n">svm</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">hstack</span><span class="p">([</span><span class="n">X_train</span><span class="p">,</span> <span class="n">X_clusters</span><span class="p">]),</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">svm</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">hstack</span><span class="p">([</span><span class="n">X_test</span><span class="p">,</span> <span class="n">kmeans</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_test</span><span class="p">)]),</span> <span class="n">y_test</span><span class="p">)</span> <span class="c1"># should be ~0.727
</span></code></pre></div></div>

<p>Wait, what’s wrong? Could it be that there’s a correlation between existing features and the distances to the centroids?</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>

<span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s">'mean radius'</span><span class="p">,</span> <span class="s">'mean texture'</span><span class="p">,</span> <span class="s">'mean perimeter'</span><span class="p">,</span> <span class="s">'mean area'</span><span class="p">,</span>
       <span class="s">'mean smoothness'</span><span class="p">,</span> <span class="s">'mean compactness'</span><span class="p">,</span> <span class="s">'mean concavity'</span><span class="p">,</span>
       <span class="s">'mean concave points'</span><span class="p">,</span> <span class="s">'mean symmetry'</span><span class="p">,</span> <span class="s">'mean fractal dimension'</span><span class="p">,</span>
       <span class="s">'radius error'</span><span class="p">,</span> <span class="s">'texture error'</span><span class="p">,</span> <span class="s">'perimeter error'</span><span class="p">,</span> <span class="s">'area error'</span><span class="p">,</span>
       <span class="s">'smoothness error'</span><span class="p">,</span> <span class="s">'compactness error'</span><span class="p">,</span> <span class="s">'concavity error'</span><span class="p">,</span>
       <span class="s">'concave points error'</span><span class="p">,</span> <span class="s">'symmetry error'</span><span class="p">,</span>
       <span class="s">'fractal dimension error'</span><span class="p">,</span> <span class="s">'worst radius'</span><span class="p">,</span> <span class="s">'worst texture'</span><span class="p">,</span>
       <span class="s">'worst perimeter'</span><span class="p">,</span> <span class="s">'worst area'</span><span class="p">,</span> <span class="s">'worst smoothness'</span><span class="p">,</span>
       <span class="s">'worst compactness'</span><span class="p">,</span> <span class="s">'worst concavity'</span><span class="p">,</span> <span class="s">'worst concave points'</span><span class="p">,</span>
       <span class="s">'worst symmetry'</span><span class="p">,</span> <span class="s">'worst fractal dimension'</span><span class="p">,</span>
       <span class="s">'distance to cluster 1'</span><span class="p">,</span> <span class="s">'distance to cluster 2'</span><span class="p">,</span> <span class="s">'distance to cluster 3'</span><span class="p">]</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">.</span><span class="n">from_records</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">hstack</span><span class="p">([</span><span class="n">X_train</span><span class="p">,</span> <span class="n">X_clusters</span><span class="p">]),</span> <span class="n">columns</span><span class="o">=</span><span class="n">columns</span><span class="p">)</span>
<span class="n">sns</span><span class="p">.</span><span class="n">heatmap</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">corr</span><span class="p">())</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">rotation</span><span class="o">=-</span><span class="mi">45</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="https://alexandruburlacu.github.io/_data/webp/corr_heatmap.webp" alt="The heatmap shows that our K-Means based features are most correlated with the target variable" /></p>

<p><em>Notice the last 3 columns, especially the last one, and their color on every row.</em></p>

<p>You probably heard that we want the features in the dataset to be as independent as possible. The reason is that a lot of machine learning models assume this independence to have a simpler algorithm. Some more info on this topic can be found <a href="https://datascience.stackexchange.com/questions/24452/in-supervised-learning-why-is-it-bad-to-have-correlated-features">here</a> and <a href="https://towardsdatascience.com/why-exclude-highly-correlated-features-when-building-regression-model-34d77a90ea8e">here</a>, but the gist of it is that having redundant information in linear models destabilizes the model, and in turn, it is more likely to mess up. On numerous occasions, I noticed this problem, sometimes even with non-linear models, and purging the dataset from correlated features usually gives a slight increase in the model’s performance characteristic.</p>

<p>Back to our main topic. Given that our new features are indeed correlated with some of the existing ones, what if we use only the distances to the cluster means as features, will it work then?</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># imports from the example above
</span>
<span class="n">svm</span> <span class="o">=</span> <span class="n">LinearSVC</span><span class="p">(</span><span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">)</span>
<span class="n">kmeans</span> <span class="o">=</span> <span class="n">KMeans</span><span class="p">(</span><span class="n">n_clusters</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">)</span>
<span class="n">X_clusters</span> <span class="o">=</span> <span class="n">kmeans</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>

<span class="n">svm</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_clusters</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">svm</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">kmeans</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_test</span><span class="p">),</span> <span class="n">y_test</span><span class="p">)</span> <span class="c1"># should be ~0.951
</span></code></pre></div></div>

<p>Much better. With this example, you can see that we can use KMeans as a way to do dimensionality reduction. Neat.</p>

<p>So far so good. But the piece de resistance is yet to be shown.</p>

<h1 id="the-second-trick">The second trick</h1>

<blockquote>
  <p><strong><em>K-Means can be used as a substitute for the kernel trick</em></strong></p>
</blockquote>

<p>You heard me right. You can, for example, define <em>more</em> centroids for the K-Means algorithm to fit than there are features, much more.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># imports from the example above
</span>
<span class="n">svm</span> <span class="o">=</span> <span class="n">LinearSVC</span><span class="p">(</span><span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">)</span>
<span class="n">kmeans</span> <span class="o">=</span> <span class="n">KMeans</span><span class="p">(</span><span class="n">n_clusters</span><span class="o">=</span><span class="mi">250</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">)</span>
<span class="n">X_clusters</span> <span class="o">=</span> <span class="n">kmeans</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>

<span class="n">svm</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_clusters</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">svm</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">kmeans</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_test</span><span class="p">),</span> <span class="n">y_test</span><span class="p">)</span> <span class="c1"># should be ~0.944
</span></code></pre></div></div>

<p>Well, not as good, but pretty decent. In practice, the greatest benefit of this approach is when you have a lot of data. Also, predictive performance-wise your mileage may vary, I, for one, had run this method with <code class="language-plaintext highlighter-rouge">n_clusters=1000</code> and it worked better than only with a few clusters.</p>

<p>SVMs are known to be slow to train on big datasets. Impossibly slow. Been there, done that. That’s why, for example, there are numerous techniques to approximate the kernel trick with much less computational resources.</p>

<p>By the way, let’s compare how this K-Means trick will do against classic SVM and some alternative kernel approximation methods.</p>

<p>The code below is inspired by <a href="https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_kernel_approximation.html">these</a> <a href="https://scikit-learn.org/stable/auto_examples/kernel_approximation/plot_scalable_poly_kernels.html">two</a> scikit-learn examples.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">time</span> <span class="kn">import</span> <span class="n">time</span>

<span class="kn">from</span> <span class="nn">sklearn.datasets</span> <span class="kn">import</span> <span class="n">load_breast_cancer</span>
<span class="kn">from</span> <span class="nn">sklearn.svm</span> <span class="kn">import</span> <span class="n">LinearSVC</span><span class="p">,</span> <span class="n">SVC</span>
<span class="kn">from</span> <span class="nn">sklearn</span> <span class="kn">import</span> <span class="n">pipeline</span>
<span class="kn">from</span> <span class="nn">sklearn.kernel_approximation</span> <span class="kn">import</span> <span class="n">RBFSampler</span><span class="p">,</span> <span class="n">Nystroem</span><span class="p">,</span> <span class="n">PolynomialCountSketch</span>
<span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">MinMaxScaler</span><span class="p">,</span> <span class="n">Normalizer</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="kn">from</span> <span class="nn">sklearn.cluster</span> <span class="kn">import</span> <span class="n">MiniBatchKMeans</span>


<span class="n">mm</span> <span class="o">=</span> <span class="n">pipeline</span><span class="p">.</span><span class="n">make_pipeline</span><span class="p">(</span><span class="n">MinMaxScaler</span><span class="p">(),</span> <span class="n">Normalizer</span><span class="p">())</span>

<span class="n">X</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">load_breast_cancer</span><span class="p">(</span><span class="n">return_X_y</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">mm</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>

<span class="n">data_train</span><span class="p">,</span> <span class="n">data_test</span><span class="p">,</span> <span class="n">targets_train</span><span class="p">,</span> <span class="n">targets_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">)</span>
</code></pre></div></div>

<p>We will test 3 methods for kernel approximation available in the scikit-learn package, against the K-Means trick, and as baselines, we will have a linear SVM and an SVM that uses the kernel trick.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Create a classifier: a support vector classifier
</span><span class="n">kernel_svm</span> <span class="o">=</span> <span class="n">SVC</span><span class="p">(</span><span class="n">gamma</span><span class="o">=</span><span class="p">.</span><span class="mi">2</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">)</span>
<span class="n">linear_svm</span> <span class="o">=</span> <span class="n">LinearSVC</span><span class="p">(</span><span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">)</span>

<span class="c1"># create pipeline from kernel approximation and linear svm
</span><span class="n">feature_map_fourier</span> <span class="o">=</span> <span class="n">RBFSampler</span><span class="p">(</span><span class="n">gamma</span><span class="o">=</span><span class="p">.</span><span class="mi">2</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">)</span>
<span class="n">feature_map_nystroem</span> <span class="o">=</span> <span class="n">Nystroem</span><span class="p">(</span><span class="n">gamma</span><span class="o">=</span><span class="p">.</span><span class="mi">2</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">)</span>
<span class="n">feature_map_poly_cm</span> <span class="o">=</span> <span class="n">PolynomialCountSketch</span><span class="p">(</span><span class="n">degree</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">)</span>
<span class="n">feature_map_kmeans</span> <span class="o">=</span> <span class="n">MiniBatchKMeans</span><span class="p">(</span><span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">)</span>
<span class="n">fourier_approx_svm</span> <span class="o">=</span> <span class="n">pipeline</span><span class="p">.</span><span class="n">Pipeline</span><span class="p">([(</span><span class="s">"feature_map"</span><span class="p">,</span> <span class="n">feature_map_fourier</span><span class="p">),</span>
                                        <span class="p">(</span><span class="s">"svm"</span><span class="p">,</span> <span class="n">LinearSVC</span><span class="p">(</span><span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">))])</span>

<span class="n">nystroem_approx_svm</span> <span class="o">=</span> <span class="n">pipeline</span><span class="p">.</span><span class="n">Pipeline</span><span class="p">([(</span><span class="s">"feature_map"</span><span class="p">,</span> <span class="n">feature_map_nystroem</span><span class="p">),</span>
                                        <span class="p">(</span><span class="s">"svm"</span><span class="p">,</span> <span class="n">LinearSVC</span><span class="p">(</span><span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">))])</span>

<span class="n">poly_cm_approx_svm</span> <span class="o">=</span> <span class="n">pipeline</span><span class="p">.</span><span class="n">Pipeline</span><span class="p">([(</span><span class="s">"feature_map"</span><span class="p">,</span> <span class="n">feature_map_poly_cm</span><span class="p">),</span>
                                        <span class="p">(</span><span class="s">"svm"</span><span class="p">,</span> <span class="n">LinearSVC</span><span class="p">(</span><span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">))])</span>

<span class="n">kmeans_approx_svm</span> <span class="o">=</span> <span class="n">pipeline</span><span class="p">.</span><span class="n">Pipeline</span><span class="p">([(</span><span class="s">"feature_map"</span><span class="p">,</span> <span class="n">feature_map_kmeans</span><span class="p">),</span>
                                        <span class="p">(</span><span class="s">"svm"</span><span class="p">,</span> <span class="n">LinearSVC</span><span class="p">(</span><span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">))])</span>

</code></pre></div></div>

<p>Let’s collect the timing and score results for each of our configurations.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># fit and predict using linear and kernel svm:
</span><span class="n">kernel_svm_time</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span>
<span class="n">kernel_svm</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">data_train</span><span class="p">,</span> <span class="n">targets_train</span><span class="p">)</span>
<span class="n">kernel_svm_score</span> <span class="o">=</span> <span class="n">kernel_svm</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">data_test</span><span class="p">,</span> <span class="n">targets_test</span><span class="p">)</span>
<span class="n">kernel_svm_time</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">kernel_svm_time</span>

<span class="n">linear_svm_time</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span>
<span class="n">linear_svm</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">data_train</span><span class="p">,</span> <span class="n">targets_train</span><span class="p">)</span>
<span class="n">linear_svm_score</span> <span class="o">=</span> <span class="n">linear_svm</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">data_test</span><span class="p">,</span> <span class="n">targets_test</span><span class="p">)</span>
<span class="n">linear_svm_time</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">linear_svm_time</span>

<span class="n">sample_sizes</span> <span class="o">=</span> <span class="mi">30</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
<span class="n">fourier_scores</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">nystroem_scores</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">poly_cm_scores</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">kmeans_scores</span> <span class="o">=</span> <span class="p">[]</span>

<span class="n">fourier_times</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">nystroem_times</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">poly_cm_times</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">kmeans_times</span> <span class="o">=</span> <span class="p">[]</span>

<span class="k">for</span> <span class="n">D</span> <span class="ow">in</span> <span class="n">sample_sizes</span><span class="p">:</span>
    <span class="n">fourier_approx_svm</span><span class="p">.</span><span class="n">set_params</span><span class="p">(</span><span class="n">feature_map__n_components</span><span class="o">=</span><span class="n">D</span><span class="p">)</span>
    <span class="n">nystroem_approx_svm</span><span class="p">.</span><span class="n">set_params</span><span class="p">(</span><span class="n">feature_map__n_components</span><span class="o">=</span><span class="n">D</span><span class="p">)</span>
    <span class="n">poly_cm_approx_svm</span><span class="p">.</span><span class="n">set_params</span><span class="p">(</span><span class="n">feature_map__n_components</span><span class="o">=</span><span class="n">D</span><span class="p">)</span>
    <span class="n">kmeans_approx_svm</span><span class="p">.</span><span class="n">set_params</span><span class="p">(</span><span class="n">feature_map__n_clusters</span><span class="o">=</span><span class="n">D</span><span class="p">)</span>
    <span class="n">start</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span>
    <span class="n">nystroem_approx_svm</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">data_train</span><span class="p">,</span> <span class="n">targets_train</span><span class="p">)</span>
    <span class="n">nystroem_times</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">start</span><span class="p">)</span>

    <span class="n">start</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span>
    <span class="n">fourier_approx_svm</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">data_train</span><span class="p">,</span> <span class="n">targets_train</span><span class="p">)</span>
    <span class="n">fourier_times</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">start</span><span class="p">)</span>

    <span class="n">start</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span>
    <span class="n">poly_cm_approx_svm</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">data_train</span><span class="p">,</span> <span class="n">targets_train</span><span class="p">)</span>
    <span class="n">poly_cm_times</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">start</span><span class="p">)</span>

    <span class="n">start</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span>
    <span class="n">kmeans_approx_svm</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">data_train</span><span class="p">,</span> <span class="n">targets_train</span><span class="p">)</span>
    <span class="n">kmeans_times</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">start</span><span class="p">)</span>

    <span class="n">fourier_score</span> <span class="o">=</span> <span class="n">fourier_approx_svm</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">data_test</span><span class="p">,</span> <span class="n">targets_test</span><span class="p">)</span>
    <span class="n">fourier_scores</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">fourier_score</span><span class="p">)</span>
    <span class="n">nystroem_score</span> <span class="o">=</span> <span class="n">nystroem_approx_svm</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">data_test</span><span class="p">,</span> <span class="n">targets_test</span><span class="p">)</span>
    <span class="n">nystroem_scores</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">nystroem_score</span><span class="p">)</span>
    <span class="n">poly_cm_score</span> <span class="o">=</span> <span class="n">poly_cm_approx_svm</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">data_test</span><span class="p">,</span> <span class="n">targets_test</span><span class="p">)</span>
    <span class="n">poly_cm_scores</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">poly_cm_score</span><span class="p">)</span>
    <span class="n">kmeans_score</span> <span class="o">=</span> <span class="n">kmeans_approx_svm</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">data_test</span><span class="p">,</span> <span class="n">targets_test</span><span class="p">)</span>
    <span class="n">kmeans_scores</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">kmeans_score</span><span class="p">)</span>
</code></pre></div></div>

<p>Now let’s plot all the collected results.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
<span class="n">accuracy</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">211</span><span class="p">)</span>
<span class="n">timescale</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">212</span><span class="p">)</span>

<span class="n">accuracy</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">sample_sizes</span><span class="p">,</span> <span class="n">nystroem_scores</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">"Nystroem approx. kernel"</span><span class="p">)</span>
<span class="n">timescale</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">sample_sizes</span><span class="p">,</span> <span class="n">nystroem_times</span><span class="p">,</span> <span class="s">'--'</span><span class="p">,</span>
               <span class="n">label</span><span class="o">=</span><span class="s">'Nystroem approx. kernel'</span><span class="p">)</span>

<span class="n">accuracy</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">sample_sizes</span><span class="p">,</span> <span class="n">fourier_scores</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">"Fourier approx. kernel"</span><span class="p">)</span>
<span class="n">timescale</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">sample_sizes</span><span class="p">,</span> <span class="n">fourier_times</span><span class="p">,</span> <span class="s">'--'</span><span class="p">,</span>
               <span class="n">label</span><span class="o">=</span><span class="s">'Fourier approx. kernel'</span><span class="p">)</span>

<span class="n">accuracy</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">sample_sizes</span><span class="p">,</span> <span class="n">poly_cm_scores</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">"Polynomial Count-Min approx. kernel"</span><span class="p">)</span>
<span class="n">timescale</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">sample_sizes</span><span class="p">,</span> <span class="n">poly_cm_times</span><span class="p">,</span> <span class="s">'--'</span><span class="p">,</span>
               <span class="n">label</span><span class="o">=</span><span class="s">'Polynomial Count-Min approx. kernel'</span><span class="p">)</span>

<span class="n">accuracy</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">sample_sizes</span><span class="p">,</span> <span class="n">kmeans_scores</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">"K-Means approx. kernel"</span><span class="p">)</span>
<span class="n">timescale</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">sample_sizes</span><span class="p">,</span> <span class="n">kmeans_times</span><span class="p">,</span> <span class="s">'--'</span><span class="p">,</span>
               <span class="n">label</span><span class="o">=</span><span class="s">'K-Means approx. kernel'</span><span class="p">)</span>

<span class="c1"># horizontal lines for exact rbf and linear kernels:
</span><span class="n">accuracy</span><span class="p">.</span><span class="n">plot</span><span class="p">([</span><span class="n">sample_sizes</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">sample_sizes</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]],</span>
              <span class="p">[</span><span class="n">linear_svm_score</span><span class="p">,</span> <span class="n">linear_svm_score</span><span class="p">],</span> <span class="n">label</span><span class="o">=</span><span class="s">"linear svm"</span><span class="p">)</span>
<span class="n">timescale</span><span class="p">.</span><span class="n">plot</span><span class="p">([</span><span class="n">sample_sizes</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">sample_sizes</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]],</span>
               <span class="p">[</span><span class="n">linear_svm_time</span><span class="p">,</span> <span class="n">linear_svm_time</span><span class="p">],</span> <span class="s">'--'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'linear svm'</span><span class="p">)</span>

<span class="n">accuracy</span><span class="p">.</span><span class="n">plot</span><span class="p">([</span><span class="n">sample_sizes</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">sample_sizes</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]],</span>
              <span class="p">[</span><span class="n">kernel_svm_score</span><span class="p">,</span> <span class="n">kernel_svm_score</span><span class="p">],</span> <span class="n">label</span><span class="o">=</span><span class="s">"rbf svm"</span><span class="p">)</span>
<span class="n">timescale</span><span class="p">.</span><span class="n">plot</span><span class="p">([</span><span class="n">sample_sizes</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">sample_sizes</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]],</span>
               <span class="p">[</span><span class="n">kernel_svm_time</span><span class="p">,</span> <span class="n">kernel_svm_time</span><span class="p">],</span> <span class="s">'--'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'rbf svm'</span><span class="p">)</span>
</code></pre></div></div>

<p>And some more plot adjustments, to make it pretty.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># legends and labels
</span><span class="n">accuracy</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">"Classification accuracy"</span><span class="p">)</span>
<span class="n">timescale</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">"Training times"</span><span class="p">)</span>
<span class="n">accuracy</span><span class="p">.</span><span class="n">set_xlim</span><span class="p">(</span><span class="n">sample_sizes</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">sample_sizes</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="n">accuracy</span><span class="p">.</span><span class="n">set_xticks</span><span class="p">(())</span>
<span class="n">accuracy</span><span class="p">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="nb">min</span><span class="p">(</span><span class="n">fourier_scores</span><span class="p">),</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">timescale</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Sampling steps = transformed feature dimension"</span><span class="p">)</span>
<span class="n">accuracy</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"Classification accuracy"</span><span class="p">)</span>
<span class="n">timescale</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"Training time in seconds"</span><span class="p">)</span>
<span class="n">accuracy</span><span class="p">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="s">'best'</span><span class="p">)</span>
<span class="n">timescale</span><span class="p">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="s">'best'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/_data/webp/big_comparative_study_kmeans_svm.webp" alt="K-Means as a kernel approximator maybe is not the most performant solution, but it still has some special characteristics" /></p>

<p><em>Meh. So was it all for nothing?</em></p>

<p>You know what? Not in the slightest. Even if it’s the slowest, K-Means as an approximation of the RBF Kernel is still a good option. I’m not kidding. You can use this special kind of K-Means in scikit-learn called <code class="language-plaintext highlighter-rouge">MiniBatchKMeans</code> which is one of the few algorithms that support the <code class="language-plaintext highlighter-rouge">.partial_fit</code> method. Combining this with a machine learning model that has <code class="language-plaintext highlighter-rouge">.partial_fit</code> too, like a <code class="language-plaintext highlighter-rouge">PassiveAggressiveClassifier</code> one can create a pretty interesting solution.</p>

<p>Note that the beauty of <code class="language-plaintext highlighter-rouge">.partial_fit</code> is twofold. First, it makes it possible to train algorithms in an out-of-core fashion, which is to say, with more data than fits in the RAM. Second, depending on your type of problem, if you could in principle (very-very in principle) never need to switch the model, it could be additionally trained right where it is deployed. That’s called online learning, and it’s super interesting. Something like this is <a href="https://huyenchip.com/2020/12/27/real-time-machine-learning.html">what some Chinese companies are doing</a> and in general can be pretty useful for <a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41159.pdf">AdTech</a>, because you can receive the info whenever your ad recommendation was right or wrong within seconds.</p>

<p>You know what, here’s a little example of this approach for out-of-core learning.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.cluster</span> <span class="kn">import</span> <span class="n">MiniBatchKMeans</span>
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">PassiveAggressiveClassifier</span>

<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="kn">from</span> <span class="nn">sklearn.datasets</span> <span class="kn">import</span> <span class="n">load_breast_cancer</span>

<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>

<span class="k">def</span> <span class="nf">batch</span><span class="p">(</span><span class="n">iterable</span><span class="p">,</span> <span class="n">n</span><span class="o">=</span><span class="mi">1</span><span class="p">):</span>
    <span class="c1"># source: https://stackoverflow.com/a/8290508/5428334
</span>    <span class="n">l</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">iterable</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">ndx</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">l</span><span class="p">,</span> <span class="n">n</span><span class="p">):</span>
        <span class="k">yield</span> <span class="n">iterable</span><span class="p">[</span><span class="n">ndx</span><span class="p">:</span><span class="nb">min</span><span class="p">(</span><span class="n">ndx</span> <span class="o">+</span> <span class="n">n</span><span class="p">,</span> <span class="n">l</span><span class="p">)]</span>

<span class="n">X</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">load_breast_cancer</span><span class="p">(</span><span class="n">return_X_y</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">)</span>

<span class="n">kmeans</span> <span class="o">=</span> <span class="n">MiniBatchKMeans</span><span class="p">(</span><span class="n">n_clusters</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">)</span> <span class="c1"># K-Means has a constraint, n_clusters &lt;= n_samples to fit
</span><span class="n">pac</span> <span class="o">=</span> <span class="n">PassiveAggressiveClassifier</span><span class="p">(</span><span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">)</span>

<span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">batch</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">n</span><span class="o">=</span><span class="mi">100</span><span class="p">),</span> <span class="n">batch</span><span class="p">(</span><span class="n">y_train</span><span class="p">,</span> <span class="n">n</span><span class="o">=</span><span class="mi">100</span><span class="p">)):</span>
    <span class="n">kmeans</span><span class="p">.</span><span class="n">partial_fit</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>       <span class="c1"># fit K-Means a bit
</span>    <span class="n">x_dist</span> <span class="o">=</span> <span class="n">kmeans</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>   <span class="c1"># obtain distances
</span>    <span class="n">pac</span><span class="p">.</span><span class="n">partial_fit</span><span class="p">(</span><span class="n">x_dist</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">classes</span><span class="o">=</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span>     <span class="c1"># learn a bit the classifier, we need to indicate the classes
</span>    <span class="k">print</span><span class="p">(</span><span class="n">pac</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">kmeans</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_test</span><span class="p">),</span> <span class="n">y_test</span><span class="p">))</span>

<span class="c1"># 0.909 after 100 samples
# 0.951 after 200 samples
# 0.951 after 300 samples
# 0.944 after 400 samples
# 0.902 after 426 samples
</span>

<span class="c1"># VS
</span><span class="n">kmeans</span> <span class="o">=</span> <span class="n">MiniBatchKMeans</span><span class="p">(</span><span class="n">n_clusters</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">)</span>
<span class="n">pac</span> <span class="o">=</span> <span class="n">PassiveAggressiveClassifier</span><span class="p">(</span><span class="n">random_state</span><span class="o">=</span><span class="mi">17</span><span class="p">)</span>

<span class="n">pac</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">kmeans</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">),</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">pac</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">kmeans</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_test</span><span class="p">),</span> <span class="n">y_test</span><span class="p">)</span>
<span class="c1"># should be ~0.951
</span>
</code></pre></div></div>

<!-- Spherical k-means -->
<!-- https://sites.google.com/site/dataclusteringalgorithms/kernel-k-means-clustering-algorithm -->

<h1 id="epilogue">Epilogue</h1>

<p>So you’ve made it till the end. Hope now your ML toolset is richer. Maybe you’ve heard about the so-called “no free lunch” theorem; basically, there’s no silver bullet, in this case for ML problems. Maybe for the next project, the methods outlined in this post won’t work, but for the one that will come after that, they will. So just experiment, and see for yourself. And if you need an online learning algorithm/method, well, there’s a bigger chance that K-Means as a kernel approximation is the right tool for you.</p>

<p>By the way, <a href="https://alexandruburlacu.github.io/posts/2021-07-26-ml-error-analysis">there’s another blog post</a>, also on ML, in the works now. What’s even nicer, among many other nice things in it, it describes a rather interesting way to use K-Means. But no spoilers for now. Stay tuned.</p>

<p>Finally, if you’re reading this, thank you! If you want to leave some feedback or just have a question, you’ve got quite a menu of options (see the footer of this page for contacts + you have the Disqus comment section).</p>

<h2 id="some-links-you-might-find-interesting">Some links you might find interesting</h2>

<ul>
  <li><a href="https://datascience.stackexchange.com/questions/24324/how-to-use-k-means-outputs-extracted-features-as-svm-inputs">A stackexchange discussion about using K-Means as a feature engineering tool</a></li>
  <li><a href="https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html">A more in-depth explanation of K-Means</a></li>
  <li><a href="http://www.jcomputers.us/vol8/jcp0810-25.pdf">A research paper that uses K-Means for an efficient SVM</a></li>
</ul>

<h2 id="acknowledgements">Acknowledgements</h2>

<p>Special thanks to <a href="https://twitter.com/dgaponcic">@dgaponcic</a> for style checks and content review, and thank you <a href="https://twitter.com/anisoara_ionela">@anisoara_ionela</a> for grammar checking this article more thoroughly than any AI ever could. You’re the best &lt;3</p>

<p><strong>P.S.</strong> I believe you noticed all these <code class="language-plaintext highlighter-rouge">random_state</code>s in the code. If you’re wondering why I added these, it’s to make the code samples reproducible. Because frequently tutorials don’t do this and it leaves space for cherry-picking, where the author presents only the best results, and when trying to replicate these, the reader either can’t or it takes a lot of time. But know this, you can play around with the values of <code class="language-plaintext highlighter-rouge">random_state</code> and get widely different results. For example, when running the snippet with original features and distances to the 3 centroids, the one with a 0.727 score, with a random seed of 41 instead of 17, you can get the accuracy score of 0.944. So yeah, <code class="language-plaintext highlighter-rouge">random_state</code> or however else the random seed is called in your framework of choice is an important aspect to keep in mind, especially when doing research.</p>]]></content><author><name></name></author><category term="posts" /><category term="machine" /><category term="learning," /><category term="clustering," /><category term="artificial" /><category term="intelligence," /><category term="k-means," /><category term="svm," /><category term="kernel" /><category term="trick," /><category term="kmeans," /><category term="kmeans" /><category term="svm" /><category term="trick," /><category term="ml," /><category term="ai," /><category term="unsupervised" /><category term="ml," /><category term="classification" /><summary type="html"><![CDATA[K-Means is an interesting, simple, and pretty intuitive algorithm. It turns out it can do more than just clustering, for example classification.]]></summary></entry><entry><title type="html">Logging, Tracing, Monitoring, et al.</title><link href="https://alexandruburlacu.github.io/posts/2021-05-20-logs-traces-how-to" rel="alternate" type="text/html" title="Logging, Tracing, Monitoring, et al." /><published>2021-05-18T22:10:00+00:00</published><updated>2021-05-18T22:10:00+00:00</updated><id>https://alexandruburlacu.github.io/posts/logs-traces-how-to</id><content type="html" xml:base="https://alexandruburlacu.github.io/posts/2021-05-20-logs-traces-how-to"><![CDATA[<h1 id="so-you-want-to-launch-your-codeappsystem-in-production">So, you want to launch your code/app/system in production?</h1>

<p>Wait, before you do, ask yourself this question: <em>If something goes south, how will I know what <strong>exactly</strong> happened?</em></p>

<p>A good question, indeed.</p>

<p>A more seasoned engineer might say: <em><strong>I will use logs!!!</strong></em> But what if I tell you, logs are only the begging?</p>

<blockquote>
  <p>[Disclaimer Time] This article is not about some concrete technology, framework, or library, although it references some of these. It’s more of an overview/tips about what logging/tracing/et al are and how to approach these when designing and operating software systems. The information here is based mostly from my own experience, but also from information available in papers and industry blog posts. You might need to google some stuff while/after reading it, especially if you’ve never operated a system running in production.</p>
</blockquote>

<h1 id="act-1-ill-set-up-logs-alright">Act 1: I’ll set up logs, alright…</h1>

<p>So, what exactly is a log?</p>

<p><img src="https://media.giphy.com/media/xUOxfbAOLZmR356YgM/giphy.gif" alt="We'll talk about logs, just not this kind of logs" /></p>

<p>Technically, this is a log, but I want to talk about other kinds of logs.</p>

<blockquote>
  <p><strong>Logs are a record about some event in a system</strong></p>
</blockquote>

<p>Pretty abstract, huh? A log is like an entry in a journal about something that happened, maybe with some context. Somewhat like the Twitter feed of an Apple-reporter during the WWDC event. You have time, you have a record of something that just happened, and maybe you have context too. Now, jokes aside, logs are necessary for a system running in production. They help you uncover what was happening moments before applications crash. Or malicious activity. Or other stuff. But how do we make <strong>good</strong> logs?</p>

<h2 id="tenets-of-a-good-log-message">Tenets of a good log message</h2>

<p>So, how should we design our logs? Here are some tenets:</p>

<ul>
  <li>
    <p>Thy logs must be <strong>hierarchical</strong>: we need to respect the distinction between <code class="language-plaintext highlighter-rouge">DEBUG/INFO/WARNING/ERROR</code> and possibly other levels. We should not crowd the system with <code class="language-plaintext highlighter-rouge">WARNING</code> logs when <code class="language-plaintext highlighter-rouge">INFO</code> or <code class="language-plaintext highlighter-rouge">DEBUG</code> logs are more appropriate. Crowding also refers to how much information a log contains. That said, a good idea for an <code class="language-plaintext highlighter-rouge">ERROR</code> log is to register as much information as possible to aid in debugging. Use <code class="language-plaintext highlighter-rouge">DEBUG</code>-level logs to register information about what setting the program is using, even how much time or resources some subroutine is using, but don’t abuse this. As for <code class="language-plaintext highlighter-rouge">INFO</code> logs, anything in between. Like information about a call to a top-level route handler in an HTTP server. Also, <code class="language-plaintext highlighter-rouge">INFO</code> logs are the right way to use prints in a system.</p>
  </li>
  <li>
    <p>Thy logs must be <strong>informative</strong>: A good rule of thumb is to log everything that might help you debug your system. If an error happens, you will want to log the traceback. Also, logging the context in which the error happened will prove to be useful. By context, I mean some surrounding variables, which might have something to do with the failure. If your system is running with multiple processes or is multithreaded, or multi-whatever, do yourself a favor and log the PIDs/Thread IDs. Finally, be very careful with how you represent time, explaining why would require an entire blog, but time in computer systems is a pain, <a href="https://www.youtube.com/watch?v=-5wpm-gesOY">see for yourself</a>.</p>
  </li>
</ul>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ERROR: Error name, message, traceback, variables in scope is possible
WARNING: Warning name, message
INFO: Calls to top-level functions/handlers, like: [2021-05-17 00:06:23] INFO: GET /posts 200 OK
DEBUG: Program setup/initialization info, possibly memory or performance information*

*: more on that later
</code></pre></div></div>

<ul>
  <li>Thy logs must be <strong>filterable</strong>: logs are meant to be analyzed. Make them as searchable as possible. Consider formatting them as JSON documents, and don’t abuse nesting.</li>
</ul>

<p>Why not? If the JSON is too nested, it becomes hard to search/analyze, defying its purpose.</p>

<p>For example, Elasticsearch can’t properly index JSONs with two or more levels of nesting. That is, something like the example below can be indexed:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{"timestamp": "2021-05-18T21:09:54Z", "level": "error", "msg": "bad thing happened"}
</code></pre></div></div>

<p>Even something like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{"timestamp": {"date": "17th May, 2021", "time": "11:30:30am"}, "level": "error", "msg": "bad thing happened"}
</code></pre></div></div>

<p>But do something like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{"timestamp": {
    "date": "17th May, 2021",
    "time": [11, 30, 30, 124]
    },
 "level": "error",
 "msg": "bad thing happened",
 "context": {
    "some_key_for_multiple_values": []
    }
}
</code></pre></div></div>

<p>And Elastic will treat your deeply nested elements like strings, and then good luck filtering and aggregating these logs. So keep it flat, whenever possible.</p>

<p>Another good format is NCSA Common log format, but if possible, choose JSON. Why? Most log analysis tools use JSON. Something like NCSA Common log format is better for smaller systems, where you can search your logs with <code class="language-plaintext highlighter-rouge">grep</code> and friends. Finally: <em>Whatever format you choose, be consistent across your whole system</em></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Bad log (1): [2021-05-17 12:30:30] ERROR: KeyError // JSON version would be just as bad
Bad log (2): {"datetime": {"date": "17th May, 2021", "time": "11:30:30am"}, "type": "ERROR", "msg": "A KeyError error occured in function some_function"}
Better log: {"timestamp": "2021-05-18T21:09:54Z", "level": "error", "pid": 1201, "traceback": &lt;your traceback as a string&gt;, "msg": "KeyError: 'key_name'"}
</code></pre></div></div>

<h2 id="some-wisdom-on-logging-ops">Some wisdom on logging ops</h2>

<p>So you have well-written logs. That’s great!!</p>

<p>But now you have to decide how to access and analyze them. Funny thing, these decisions should also be guided by the stage and the scale of your system. In other words, I would advise against a complex infrastructure if you have one app serving a few hundred people.</p>

<p>Now we should dive into details.</p>

<p>You will roughly have three stages.</p>

<ul>
  <li>Log collection/shipment</li>
  <li>Log storage</li>
  <li>Log processing/analytics</li>
</ul>

<p>First, log collection. We want to save our logs somewhere and not just let them print to stderr/stdout. So, now we have to think about where do we write them. It could be a file, or to Syslog, for example, or we could even write them into a TCP or UDP socket, sending them away to some logging server. To be honest, all choices are somewhat good. As long as you don’t block the thread where the action happens, you should be fine, otherwise, prepare for a performance hit.</p>

<p>Regarding storage, for a simple app leaving them in file format should work for a while, but eventually, a storage solution with indexing support or really anything that can help you quickly search your logs will be advised.</p>

<p>Once you have multiple services, you can think of a centralized logging server, something like an ELK (Elasticsearch, Logstash, Kibana) cluster, with one or a few Elastic instances in a cluster setup.</p>

<p>So here comes my personal opinion: you should start by logging into a file, and mandatory ensure log file rotation because you don’t want a single 10GB text file. Believe me… you don’t. At some point, you will also have to think of log compression and possibly log shipping. Log shipping means transferring the logs from where these were created to where these will be analyzed and stored for a long time.</p>

<p><img src="/_data/webp/LoggingArch.webp" alt="An efficient logging architecture will try to offload log shipping to a separate component" /></p>

<p>When it comes to log shipping, I would strongly suggest using TCP or HTTP over UDP and other protocols. Why, you may ask? Because first of all, with UDP you might lose logs while transferring them due to (1) no way of retransmitting lost packets, (2) no flow control, which might be the cause of lost packets, but also because with UDP message size is limited to 65KB of data, or even less, depending on network settings, which quite frankly could be not nearly enough. Also, your company firewalls might block this kind of traffic. So, a lot of trouble.</p>

<p>Having a centralized logging solution, you will now absolutely need to ship the logs, and having them first written to a file will prove a very nice idea because now your logs won’t be lost in case of network outages, server failure, logging system failure or any of the above mentioned being too slow.</p>

<p>Nice.</p>

<p><img src="https://media.giphy.com/media/k0hKRTq5l9HByWNP1j/giphy.gif" alt="Borat approves" /></p>

<h1 id="act-11-hey-i-think-i-can-make-a-chatbot-to-notify-me-when-something-blows-up">Act 1.1: Hey, I think I can make a chatbot to notify me when something blows up</h1>

<p>Yup, you can. And if you want to reduce MTTR you most likely should. Just take into account a few things.</p>

<ul>
  <li>First and foremost, if you have the possibility, set up alerting thresholds. You don’t want to be notified when something is even slightly off every. single. time. Maybe it’s some unique (non-critical) event, no need to bother, while if the issue happens frequently, you better be notified.</li>
  <li>Another consideration, when it comes to alerting, is the possibility to have <strong>escalation alerting</strong>. First, send an alert via email. If no action was taken, now send it to a chat group of the responsible team. Still no activity? Send it in DM to an engineer, or even to a technical manager.</li>
  <li>Finally, just aggregate the stuff, no need for 12, or a hundred, emails/Slack messages of the same issue. Something like one log message and then some text like <code class="language-plaintext highlighter-rouge">X occurred 25 times in the last Y seconds</code> should be good.</li>
</ul>

<p>When it comes to what tools to use for alerting, well, you have Sentry, also to my knowledge, it is possible to set up alerting in Kibana, although I don’t know whenever this is a paid option or free, and there are of course other tools.</p>

<p>This is by no means a definitive guide on how to do it, only some things to keep in mind. This whole blog post isn’t a definitive guide if you haven’t noticed yet.</p>

<h1 id="act-2-my-system-is-slow-i-guess-ill-log-execution-time-and--of-requests-and-">Act 2: My system is slow, I guess I’ll log execution time, and # of requests, and …</h1>

<p><img src="https://i.kym-cdn.com/photos/images/newsfeed/001/246/726/244.png" alt="" /></p>

<p>… just. Stop. Please. The fact that you <strong>can</strong> do it, doesn’t mean you should. Welcome to the world of telemetry and performance monitoring, where you will initially wonder, why not just use logs? I mean, in principle you could do this, but better to have a different infrastructure, to not mess everything up.</p>

<p>Mess up how? Well, if you’re like me, you might want to set up performance monitoring not just at the route controller level, to see how much requests take to be handled and responded to (assuming a hypothetical server). You will also want to track how much time queries to the database take to execute, even functions! And now you have a ton of very fine-grained info, which will for sure overload the logging infrastructure. You don’t want that. Besides, even if all runs smoothly, your read and write patterns will be different. Log analysis queries can be much more complex than analysis required for performance monitoring. Also, performance monitoring usually has smaller messages that need to be recorded with lower latency.
All in all, better set up a dedicated infrastructure for this.</p>

<p>The easiest thing is of course to use <code class="language-plaintext highlighter-rouge">TRACE</code> level logging, and as said earlier, dedicated infrastructure for performance monitoring. But this works only on small scale, where frankly, you don’t even need performance monitoring.</p>

<p>As the system scales, you might start looking towards a more restricted type of logs, maybe some binary protocols, given that you will be sending small packets of information right away, very frequently.</p>

<p>Performance monitoring has a bit of a different write and query patterns than log analytics (ik, said it earlier), so different storage is recommended. Queries are simpler mainly showing trends, time series, current values, or some simple aggregate values, like counts, means, medians, and percentiles, and writes are very frequent but with little data, only a few metrics, compared with logging tracebacks and contexts and stuff like that.</p>

<p>That’s why for example ELK stack is more common in logging infrastructure, where Elasticsearch can index and analyze even very unstructured data, and stuff like Grafana + Prometheus are more commonly used for performance monitoring. Prometheus, among other things, contains a time-series database, just the right thing to store and quickly query performance metrics.</p>

<p>Also, when it comes to performance analysis, you will want to monitor your system utilization, not just the stuff intrinsic to your code. If you’re using Prometheus, that’s easy to do.</p>

<h1 id="act-3-my-microservice-system-is-slow-but-i-cant-figure-out-why">Act 3: My microservice system is slow, but I can’t figure out why</h1>

<hr />

<p><strong>First, a likbez (crash-course) on networking and dynamic systems</strong>: Against our intuition, a computer network is a shared resource with a limited capacity. This basically means if one service is very chatty, it will influence the throughput and latency for all the rest. Also given that networks are a priori not 100% reliable and we mostly use TCP-based traffic, in the network, there will be plenty of packets (chunks of data, retransmissions, packets from administrative protocols). That’s only half the problem though. There’s more 😉</p>

<p>Our services are dependent upon each other and upon 3rd parties. So if one service is slow, it might influence other services, even ones that are not directly interacting with it. One metaphor to help you think of it is a spider web. When you touch it on one side, it will ripple on the other side. Kinda like a butterfly effect. And that’s not just a simple comparison, you could indeed see failure due to some other service being somewhat slower.</p>

<hr />

<p>So, how do we monitor this?</p>

<p>Maybe logs? Or something like performance monitoring from the previous act?</p>

<p>Well, I mean, it’s a start, but only logs won’t cut it. Because we don’t see the full picture, specifically, we don’t see the interaction between services, only each individual’s performance. We need something more. Enter <strong>tracing</strong>.</p>

<p>First, a good mental model about tracing is that it’s like logging, but with a <a href="https://www.enterpriseintegrationpatterns.com/patterns/messaging/CorrelationIdentifier.html">correlation identifier</a>, which makes it possible to combine said logs into a “trace”.
A trace like this now can show us how, for example, a single request spans multiple services, how much time does each step takes and even how much time was spent on communication. All this can help uncover bugs and performance bottlenecks which a simple performance monitoring tool, or just logs, won’t be able to do. Tracing will help you find bottleneck services, and sometimes even aid you in debugging distributed systems.</p>

<p><img src="/_data/webp/Tracing.webp" alt="Tracing allows us to see the lineage of each request, and find potential bottlenecks in our systems" /></p>

<p>Traces should be thought of as an extension to performance monitoring tools, rather than logs. Traces’ primary purpose is to uncover performance issues, also sometimes pinpoint the reason a specific operation failed. You could use them as logs, but don’t overload them with information, otherwise, your collection, storage, and analysis infrastructure will cry.</p>

<p>How to structure your traces? The easiest thing to do is to use tools that automagically will patch your dependencies like database clients, web servers, and HTTP/RPC clients and be done with it. Sensible defaults, you know. If you want to have more control, be prepared to write some boilerplate, especially if you want to manually control what things will be propagated between services. When it comes to adding info to your spans (the pieces which combined form a trace) don’t add your whole application context, only the most important things, for example, current configurations of your system.</p>

<p>Side note, sometimes it is important to correlate traces with logs, for this you can use yet another correlation identifier, for a more in-depth analysis of your system, combining traces with individual logs. <!-- That's what Uber does, for example. LINK --></p>

<p>There are some existing Open Source tools with great support, like <a href="https://www.jaegertracing.io/">Jaeger</a> and <a href="https://zipkin.io/">Zipkin</a>, there are also industry initiatives like OpenTracing, OpenCensus and “their combination” OpenTelemetry, not to mention a few trace formats, like <a href="https://w3c.github.io/trace-context/">W3C Trace Context</a> and <a href="https://github.com/openzipkin/b3-propagation">Zipkin B3</a> formats.</p>

<p><img src="/_data/webp/TracingArch.webp" alt="Tracing looks like magic, but in fact can be achieved with special correlation identifiers, and a good clock" /></p>

<p>A common architecture for tracing subsystems is a combination of a sidecar, collector, storage, and “presenter” components, not to mention the client library. When it comes to using tracing in a serverless setup it gets tricky, one solution would be to bypass the sidecar and send data directly to the collector, <a href="https://www.jaegertracing.io/docs/1.22/faq/#do-i-need-to-run-jaeger-agent">but you will lose some nice features</a>.</p>

<p>Tracing, in general, is huuuuge topic, and covering it would require at least one more long-read article. That’s why, for more information, I’d like to point you towards <a href="https://static.googleusercontent.com/media/research.google.com/en//archive/papers/dapper-2010-1.pdf">these</a> two <a href="https://www.pdl.cmu.edu/PDL-FTP/SelfStar/CMU-PDL-14-102.pdf">articles</a> and <a href="https://eng.uber.com/distributed-tracing/">this post from Uber</a>. In these you’ll find more “war stories” on how such systems where implemented (first article and the post from Uber) and also such important topics as trace sampling strategies and trace visualizations (second article).</p>

<h1 id="final-act-welcome-to-observability">Final act: Welcome to observability!!!</h1>

<p>Observability, what?</p>

<p>Observability is the property of a system to be understood. It’s a property of how well can one infer the internal state of something from its external outputs.
It’s a spectrum and depending on where your system stands, you can use monitoring and alerting more or less efficiently.
In other words, if a system is observable you can understand what is happening within it from its outputs.</p>

<p>We need to design our systems with observability in mind. And with all the stuff outlined above, that should become a doable task.</p>

<p>I prefer to think of observability, with a proper incident response procedure, of course, as a way to make said system anti-fragile (see the works of Nasim Taleb),
because with every failure and issue that happens, it “learns”, on the organizational level, to be better. Or one could argue that on the contrary, the system now becomes more fragile because with every fix we believe more and more that the system is now unkillable, which it never will be.</p>

<p>Pick for yourself, but don’t forget to use logging. At least you’ll know when and why things go south, and that’s something.</p>

<h1 id="epilogue">Epilogue</h1>

<p>You’ve made it! Congrats! Now you have some very important knowledge of how to be prepared when manure hits the proverbial fan in production.
This knowledge should help you debug even super-obscure bugs. Of course, this isn’t going to be easy, plus you now have an entire infrastructure to take care of,
but hey, if this helps reducing time to solve an issue from 1 week (or more) to 1, maybe 2 days, it might be worth it.</p>

<p>I know for a fact that it was worth it for me, time and time again when it helped me quickly identify edge cases, stupid misconfigurations, and performance bottlenecks.</p>

<p>So yeah, that’s it for now. Incredibly, it didn’t take much time since my last blog post.</p>

<p>Finally, if you’re reading this, I’d like to thank you. Let me know what are your thoughts about it via Twitter, for now, until I plug in some form of a comment section. Your feedback is valuable for me.</p>

<!-- https://ferd.ca/erlang-otp-21-s-new-logger.html
https://iamondemand.com/blog/open-source-distributed-tracing-why-you-need-it-how-to-get-started/ -->]]></content><author><name></name></author><category term="posts" /><category term="logging," /><category term="logs," /><category term="tracing," /><category term="traces," /><category term="observability," /><category term="telemetry," /><category term="monitoring," /><category term="alerting," /><category term="distributed-systems," /><category term="debugging," /><category term="software" /><category term="engineering" /><summary type="html"><![CDATA[When it comes to production-ready systems we need a way to know what's going on in it, aiding us in debugging it, when the time comes.]]></summary></entry></feed>