Build it or buy it. Certainly, a debate that applies to data labeling. One of the first and most crucial steps in the machine learning development cycle and probably constantly on the mind of any data scientist. In this blog, Julian von der Goltz, Head of Machine Learning at Mainblades discusses the considerations we made to build a fully customizable labeling interface for aircraft drone inspection data from scratch.
At Mainblades we deal, like any other AI/ML company, with a lot of data in the forms of images that come from the drone after inspecting an aircraft or any other type of asset. Those images are processed by our deep learning damage detection algorithms (FasterRCNN-type architecture) to help our customers find issues with their assets.
However, for developing and evaluating the algorithms, or “models”, we need a large set of images that contain correct examples of all the objects that we want to detect in the images. Those examples are called labels and additionally, we distinguish between annotations (labels made by a human) and predictions (labels made by the algorithm).
So how do we get the datasets? We need to gather images (relatively easy because of fully automated drone flight) and label them (very time consuming because manual) in a labeling interface. Examples for labeling interfaces are:
- Computer Vision Annotation Tool: https://cvat.org/
- Heartex / Labelstudio: https://heartex.com/
- Superannotate: https://www.superannotate.com/
- Labelbox: https://labelbox.com/
These solutions all have support for standard bounding box annotations and have a lot of other useful features like model assisted labeling and analytics.
It is very tempting, especially as a small start-up, to just use one of those excellent existing tools. Labelbox has a very convincing page https://labelbox.com/learn/build-vs-buy where they discuss the pro’s and con’s of building a training data platform (of which labeling is a part of) yourself vs buying it.
This blog post is an answer to that page where we will present our view along with our experiences after two years of building a label interface and data platform ourselves!
To provide a little bit of context: We started the development process in 2020, many of the above mentioned services were not as mature as they were today, so what is described in this article might not be so relevant anymore, but many key points will hold true!
Before we start with the discussion, let us first define the requirements. Some of the requirements come directly from the product itself (rotated boxes, image resolution) while others are more related to optimizing the quality and annotator throughput:
Let’s go back to the beginning of 2020 and see how we stand with the requirements. The two top contenders where Labelbox and Labelstudio. The free versions had roughly all of the basic requirements, including some kind of model integration but lacked advanced analytics for annotations and predictions and being able to use separate accounts for the annotators.
The pricing for these premium features was very steep (e.g. it scaled linearly for the number of user accounts) so that gave us several options:
- A) Use an existing premium solution and pay money to an outside company to have satisfy all the critical requirements.
- B) Use an existing but free solution (free tier of a service like Labelbox or an open source solution like Heartex) and invest developer time to hack around it using APIs to add the crucial extra features we need.
- C) Invest our own time and money to build a custom built solution in-house.
So either way, we had to invest time and/or money to get to a solution, and we decided, spoiler alert, to do C.
The off-the-shelf myth for data labeling interfaces
But even if we had chosen for A (I guess it should be obvious why we didn’t go for B), here is why it is not as simple as “Pay money to someone else and all your problems will be solved” like the article linked above suggests. To make a point, I changed the above graphic slightly (pardon my graphic design skills using the fantastic https://jspaint.app)
Let me go through each of those points one-by-one:
A labeling solution usually works like this: You import a dataset from your own platform that is not labeled, a model performs pre-labeling, then your annotators go over the dataset and finally you export the dataset from the solution back to your own platform to train your models on them. So obviously, you need to plan how exactly that whole process would go about.
2.-4. Design, build and test import and export workflows
So we have seen in 1. that we need to import and export data between our own platform and the labeling platform. This has the following implications:
- Data has to be converted between the two platforms, these conversions could be trivial but still add more moving parts (and therefore bugs) to the overall architecture.
- If it is not desired that import and export is done manually, an automated pipeline has to be built that requires additional engineering.
- Images have to be made securely available in the environment of the labeling solution. At the time we were looking at this, there was either not full support for leaving your images in your own cloud storage, or you needed to implement signed links to your files that expire. Both are not really desirable implementations.
All of the above requires the typical development lifecycle of designing, building and testing (and repeat), so it is absolutely false that this wouldn’t be required when buying a labeling solution.
5. Integrate the solution into your own workflow
Integrate is actually not quite correct: A 3rd party solution is by definition separate and not integrated. We already had a database, data storage and a tailored user interface to view our images as part of the main product, and buying a labeling solution outside meant basically duplicating everything. And doing so in a sub-optimal way: using processing- and volume-heavy import and export pipelines, different formats, fields missing that you would like to have, other fields that are redundant, etc. etc.
6. Maintaining interfaces
Imagine we have worked past the previous issues and have an external labeling platform and set up all the pipelines to move the data to and from that platform. Everything works nicely in production and you, your data scientist and the annotators are happy.
But don’t forget: In 2020, the above mentioned companies are priding themselves to be “fast-moving” start-ups somewhere in the beginning of the alphabet when it comes to their funding round designations. They iterate and implement new features quickly. But that means, that existing features might also be dropped or redesigned because of of various reasons. So when that happens, all your precious pipelines and engineering goes out the window and you have to keep it up-to-date by starting to maintain it.
We touched upon it before; duplicating your data platform is not scalable because of these main reasons:
- Increased storage requirements and therefore cost because the data lives in two places
- Increased compute requirements & cost for data transfer
- Bandwidth requirements & cost for data transfer
So you definitely have to deal with scaling issues, even when buying a training data platform!
So in summarizing the above paragraphs, we can see that buying also requires significant engineering effort if you already have a data platform yourself! Of course, if you are starting from scratch and you can build more
Time to market
The Labelbox article has a paragraph about the time to market. Quote:
That is true if the purpose-built platform is already feature complete. Back in the beginning of 2020, most of the platforms were just slightly more advanced than the available open source solutions. Things like more sophisticated model integration for pre-labeling, rotated bounding boxes, cloud storage support for image hosting were in the roadmap, but they weren’t exactly ready for market.
So we have a new feature we desperately need. We request that feature be added in the roadmap, pretty please. Will we be sure that it will make it into the solution? No. Will we be sure that it will look the way we want it? No. Do we wait for them to implement the feature? Again no, because time to market.
So the decision was made. How difficult can it be to draw some boxes on an image? We integrated our own data platform and labeling interface and we don’t regret it!
So where are we at now?
After making the decision to build, it took us about 3–4 months to arrive at the first version of the data manager and labeling interface. It wasn’t one day but also it’s not terrible in the grand scheme of things. We used our expertise in designing an app for automated aircraft inspections to develop a tailored labeling interface that is built on top of our existing stack.
The interface itself satisfies the above requirements, but there is more advantages: When we find bugs, we can fix and deploy immediately. When our annotators have suggestions about improving their productivity, we can implement and deploy immediately. When we want to get a certain statistic for analysis, we write a query and deploy immediately.
We can query our data manager for all sorts of edge cases, we can create training datasets, train models on them and evaluate the models on the test datasets, all within our own platform. There is no importing and exporting and data transfers going on.
We don’t have to pay 100$ per annotator just for using the interface. When we need more annotators, we hire them, and they can get started immediately.
To conclude this post, here a little decision help:
When you decide to buy, we definitely won’t encourage you to reinvent the wheel and the above platforms will give you excellent options for getting started!
Got questions? Feedback? Anything else? Do not hesitate to get in touch with me!