logo

Aiex.ai

3 Mins Read

Dataset Development Lifecycle

Dataset-Development-Lifecycle copy

Dataset Development Lifecycle

Those working in data analysis and machine learning have all probably needed to collect and create data at some point. Here’s a paper from Google that provides a structured framework for data collection inspired by software development concepts in a 5-step cyclical process. The stages are requirements analysis, design, implementation, testing, and maintenance.

Requirements analysis: In this stage, we determine the required data by deliberating about the intentions of the project, consulting with the stakeholders, and analyzing use cases.

Design: This is where we find out whether data requirements can be met, and if so what is the most optimal way to do it by conducting research about the subject matter and consulting the experts of the field.

Implementation: Design decisions are transformed into technologies such as software systems, annotator guidelines, and labeling platforms.

Testing: Data is evaluated and decisions about whether or not to use it are made.

Maintenance: Once collected, a dataset requires a large set of affordances, including tools, policies, and designated owners.

One noteworthy aspect of Google’s approach is its emphasis on producing artifacts at each stage. This means that a document must be prepared at each stage based on provided templates, which is considered the output of that stage. According to the paper, there are critical document types for accountable dataset development. Each one is directly analogous to documentation types produced by the Software Development Lifecycle.

Dataset Development
Figure1. Critical document types for accountable dataset development. Source

The paper mentioned three of Nissenbaum’s barriers to accountability and specific data concerns and also prepared a proposal to mitigate these barriers which can be found in the figure below.

Dataset Development
Figure2.  Three of Nissenbaum’s barriers to accountability, their specific dataset concerns, and proposals for mitigation in this paper. Source
Related articles
Ensemble-Machine-Learning
Ensemble machine learning is a powerful technique that leverages the strengths of multiple weak learning models, also known as...
Neural Network
Activation functions are the main components of neural network nodes. This article examines the various types of activation functions...
regularization
Regularization is a technique used in Machine Learning and Deep Learning models to prevent overfitting. This paper introduces L1,...
Machine Learning Engineers Should Use Docker
Docker is a platform that enables developers to easily create, deploy, and run applications in containers, and has gained...
TPU-GPU-CPU
In this paper, we compare the performance of CPU, GPU, and TPU processors to see which one is better...
History of AI
In the second part of a series of articles about the history of artificial intelligence, we look at important...
Subscribe to our newsletter and get the latest practical content.

You can enter your email address and subscribe to our newsletter and get the latest practical content. You can enter your email address and subscribe to our newsletter.