• Features
    • Manage
    • Annotate
    • Validate
  • Solutions
  • Pricing
  • Workforce
  • About
  • News & Update
Docs Login
Manage Annotate Validate Solutions Workforce About News & Updates
Pricing Docs

Existing customer? Sign in

Challenges you will face when building datasets for AI products

May 12, 2022
in BlueEyes Insight
Challenges you will face when building datasets for AI products

Introduction

AI is a field of Computer Science. It is a man-made intelligence with the goal of making it possible for computers to be able to automate intelligent behaviours like humans. Over the past decade, the application of AI (Artificial Intelligence) has been thriving. AI is being developed by small and large businesses for applications of human life in all areas such as insurance, health, information security, etc, …

According to Savvycom’s forecast: β€œAI will be worth $190 billion in 2025” and following the information of PwC, the overall contribution based on AI solutions for the global economy will reach $15.7 trillion in 2030. Therefore, the AI development opportunities of technology companies are very large along with the challenges about data are not small, limited technology, expenditure, etc,… After many years building AI products, we realised that there are recurring problems when we are working on AI projects, So in this blog, We will share with you some of the most common challenges, and our experience to overcome them.

I. Lack of quality data source

When we start a project, we will have to determine in advance what the AI model will do, how fast it will be processed, and how to deploy it to the system. Once identified, the first problem which needs solving is a dataset for the model. And in the majority of AI application cases, data is not available so AI teams must start to collect the data. In case having available data, there are always extremely common problems: Duplicated Data, Incomplete fields, Human Error. To solve those problems, these solutions can be taken into account when we start to make data for AI models.

Firstly, we must determine the standard of data to train the model: tagging, metadata or we can use the available data provided that it is closest to our actual problem that the business wants to solve at most. By making the standard of the data, businesses can rely on it to assess whether the collection is consistent with the problem to solve it.

In addition, using simulation and data analytics will be able to help the business visualize the data that the business has been owning whether the above-mentioned problem to quickly find a way to fix it.

II. Dataset has too much variations and requires so much cleaning efforts

Towards the data which is available from open sources or previously collected, business will always have issues about variation backlog. The available data which has no standard is a common problem of the open-source data resources, even in a few cases, available data has a completely different standard and quality than the AI problem that needs solving.

During the process of data labeling and standardizing, this often takes upwards of 80% of the overall time in our projects. A few datasets require meticulousness and high accuracy such as segmentation problems, data labeling takes more time and it requires a supervisor and quality management of the label after hitting. In fact, if a Polygon is corresponding to an object, the labeler will take one minute to complete in the most accurate way, so the dataset needs a lot of labeling and monitoring. Currently, there are not too many tools in the world that can support monitoring, checking, and evaluating data, this comes up against serious difficulties for supervisors to evaluate and have to wait until the data is labeled completely to evaluate the quality of the dataset. This makes the progress of the project slow, and till the dataset is evaluated perfectly, it may be too late and the project has delayed the deadline. In addition, the charge for a supervisor is often very expensive because their requirements are very high including knowledge of Data Science and Data pipelines.Β 

To resolve the above problem, we need to have a tool that supports data labeling accompanied with

data management, evaluation, and statistics. BlueEye is going to be a great tool for that with a feature that can divide into Labeler and Reviewer, which makes it easier for managers to manage the project’s dataset. In addition, the tools also provide a statistical chart of the labels that have been typed, the data has been completed and verified, so it helps us detect a lot of problems in the dataset such as data bias, lack of labels, slow completion data, etc, …so that the manager can catch on quickly and come up with a timely solution.

III. No existing research for the problem.

AI can be applied in all fields, industries but not all. Every year, there are thousands of articles researching and improving AI for all fields but so far some areas have not been able to have an optimal solution for AI that can completely replace humans because of the accuracy, operating expense, cost of the research team are too large and it takes a lot of time to be able to apply practically. Even in a few research, they show that AI technologies that have been researched for practical application account for only 10% of all research each year.

Therefore, how can we troubleshoot this problem? AI does not have to completely replace humans, it can support humans in a certain field. Try to make AI support simple things first and optimize workflows so that we can reduce our effort and process as possible.

IV. Labelling costs and time keep skyrocketing.

 

One of the biggest problems in making data for AI models is the inability to anticipate the cost of data makers, data takers, and supervisors. Sometimes, the data has been collected so much that the cost of money and time is ballooning, which has accounted for the majority of the investment in our project. Suppose that the price of workers for a car segmentation problem with an average of 10 polygons on an image is $0.7 for a photo and $0.3 for reviewing a photo, so the total cost of an image is $1 and it takes at least 5 minutes to complete the entire process for an image. Assume the AI model’s requirements require 1 million images, so the making data expense is $1.000.000 and approximately 112 months of work to complete entirely the dataset. It’s not a small cost and the time it takes to complete them is too long for us.

 

To solve the above problem, we will have to hire workers in countries that have cheap labor such as Vietnam. Use data labeling platforms that support monitoring and control of labels and labelling schedules, and finally, try to use AI technologies that can reduce the large number of labels needed for models such as self-supervised Learning, Contrastive Learning so that we can reduce the number of images need hitting, that also reduces the cost and time to type data.

V. Lack of efficient development and deployment pipeline.

During the studying and working process of Computer Science majors, AI engineers often get little or no knowledge of good standards and practice for effective development and implementation. They know almost exclusively about how to research and use the Python programming language and build models through some available libraries such as Tensorflow or Pytorch but most of them don’t know how to deploy their system to servers like Amazon to reach users. This makes the output system deployment very time-consuming for users and it is not the most optimal. They will probably waste a lot of resources and the cost is not small and extremely wasted.

Most AI engineers use Jupyter or Google Colab to facilitate model building and test the model’s output. Because of using such tools, they can not deploy the system both on the computer and the server of the system. In addition, data, code, model weighting (Hyperparameter and Learning parameter) are everywhere, there is no version control, no tracking, nobody knows what problems are outstanding, what is missing, the issues need solving and what their teammates are doing.

For back-ends and frontend developers, there are often common standards and formats for their code when they start a project or deploy them to the product, but it does not have with an AI developer. For AI developers, there is no standard way to deploy AI models and display them as APIs. Libraries and formats are arranged by person’s thinking and code, so they are very diverse and do not have a common format. This makes it very difficult to grasp what their teammates are doing, how to code, it’s not easy, and it takes a lot of time to read the code and learn. Sometimes, to operate an AI system requires having a strong server, multiple hardware resources, and dividing the standard ratio for web applications, so it is often not suitable for heavy AI.

The best solution to these problems is to use at least one MLOps system, a pipeline from scratch, and force the AI team to use it as a common standard for AI model, this can make team take some time to accommodate, but when they are familiar with the system, group coordination and scaling system will take place extremely smooth and effective.

—
To learn more about Blueeye and AI data training platform, we would like to invite you to
πŸ’‘Sign-up for Pro-plan to get FREE ANNOTATION hours: shorturl.at/dzEGY
🀝Visit: https://blueeye.ai/landing
🀝BlueEye’s insight: https://lnkd.in/gi5r9uCP
🀝BlueEye’s case studies: https://lnkd.in/gTn9KYYv
#datalabelling #AI #artificialintelligence
ShareTweetPin

Related Posts

10 enterprise AI trends for 2022
BlueEyes Insight

10 enterprise AI trends for 2022

Artificial intelligence has hit the mainstream. Across industries, companies have rolled out successful proofs-of-concept and have even been successful in...

May 24, 2022
BlueEye presents at VIETNAM EXPO 2022
BlueEyes Insight

BlueEye presents at VIETNAM EXPO 2022

BlueEye will be presenting at VIETNAM EXPO 2022 as a pioneer in Vietnam for providing AI data training platform with...

April 6, 2022
BlueEye will be presenting at JAPAN IT Week 2022
BlueEyes Insight

BlueEye will be presenting at JAPAN IT Week 2022

From 6-8 Apr 2022, BlueEye will be attending the 31st Japan IT week spring with other 494 exhibitors in Asia...

March 28, 2022
BLUEEYE.AI – AUTO ANNOTATION FEATURE
BlueEyes Insight

BLUEEYE.AI – AUTO ANNOTATION FEATURE

Blueeye - New Feature Automatic Annotation https://www.youtube.com/watch?v=rjrVttQ4Dhw&feature=youtu.be +70 auto-annotation label classes available (new classes continuously updated) Time-saving for large-scale project...

March 7, 2022
Next Post
10 enterprise AI trends for 2022

10 enterprise AI trends for 2022

Recommended

Introduction: Labelling tool interface

Introduction: Labelling tool interface

September 23, 2021
Supported annotation types

Supported annotation types

September 30, 2021
WEBINAR: 𝙃𝙀𝙬 π™žπ™’π™₯𝙀𝙧𝙩𝙖𝙣𝙩 π™žπ™¨ 𝙙𝙖𝙩𝙖 π™‘π™–π™—π™šπ™‘π™žπ™£π™œ 𝙩𝙀 π˜Όπ™„ π™˜π™€π™’π™₯π™–π™£π™žπ™šπ™¨?

WEBINAR: 𝙃𝙀𝙬 π™žπ™’π™₯𝙀𝙧𝙩𝙖𝙣𝙩 π™žπ™¨ 𝙙𝙖𝙩𝙖 π™‘π™–π™—π™šπ™‘π™žπ™£π™œ 𝙩𝙀 π˜Όπ™„ π™˜π™€π™’π™₯π™–π™£π™žπ™šπ™¨?

January 27, 2022
Video Labelling Overview

Video Labelling Overview

September 27, 2021

Categories

  • BlueEyes Insight
  • News & Updates

Instagram

    Go to the Customizer > JNews : Social, Like & View > Instagram Feed Setting, to connect your Instagram account.

1048 Irvine Avenue

#612

Newport Beach

California

92660

Platform

  • About
  • Pricing
  • Workforce
  • Solutions

Features

  • Manage
  • Annotate
  • Validate

Resources

  • Docs
  • Use Cases
  • Try it free
  • Term Of Service
  • Privacy Policy
No Result
View All Result
  • Layouts
    • Homepage Layout 1
    • Homepage Layout 2