It has never been easier to capture data than it is today. With millions of personal mobile devices in circulation, and cloud computing making data-sharing simple, companies can capture unprecedented insights to help them succeed.
However, the abundance of data presents challenges for organizations wishing to capture and utilize it, especially during custom software engineering. An increasing number of new digital business solutions apply machine learning, which is exceptionally data-hungry.
To overcome the difficulties, organizations can take advantage of data crowdsourcing: using an extensive network of individual people to gather a vast quantity of data.
Are you considering this approach to source data for an AI, machine learning or other software development project? If so, here are a few guidelines, best practices, and examples of successful projects to help you with your plans.
2 Proven Use Cases for Crowdsourced Data
Before exploring some best practices for data crowdsourcing, let’s look at a couple of technology niches in which businesses have used it successfully: natural language processing (NLP) and machine vision.
Both of these applications require vast sets of data samples on which to train the relevant software. Machine vision and natural language processing applications access thousands of images and audio samples to learn how to recognize and process the cues they will receive once active.
In both fields, developers have turned to crowdsourcing platforms to acquire those images and sound-bites. It’s an approach that spares them the substantial costs of collaborating with more traditional contributors such as data collection and research organizations.
NLP Data Crowdsourcing
Some of the projects that have used crowdsourcing successfully to collect data for NLP software development include:
- A company that built an application with speech capabilities and needed it to deliver audible output using the natural and accurate Chinese language with perfect intonation.
- A car manufacturer in Japan that needed to train its satellite navigation software to understand non-native speakers of Japanese.
- Developers of an application to understand political discussions in Arabic, for use in sentiment analysis.
Machine Vision Crowdsourcing
In the machine vision arena, the automotive industry’s drive to develop autonomous vehicles is a notable example of extensive data crowdsourcing adoption. At the forefront of autonomous automobile development, companies like Tesla and Uber require millions of images to train their machine vision algorithms.
Each image must be annotated manually and in great detail. The annotations provide algorithms with the information to distinguish objects captured by machine-vision cameras, and to predict the actions of vehicles, pedestrians, and other objects in the field of view.
With image annotation requiring hours in manual labor, autonomous-vehicle companies have found that crowdsourcing offers an economical way to label data in bulk.
Some Best Practices in Crowdsourcing AI Data
Assuring data quality is perhaps one of the most significant challenges when crowdsourcing, so the following tips and best practices focus primarily on this aspect of acquiring and using data.
The choice of provider will be a crucial factor. Aside from its effect on quality, it can impact the workload involved for your business. Therefore, a look at provider-suitability is a good place to start.
Work with a Specialist in AI Data
Amazon’s Mechanical Turk is perhaps the best-known crowdsourcing platform, but it takes a hands-off approach to crowd accessibility. Choosing a generalist provider like MTurk will require you to manage the following activities directly:
- Crowdworker selection
- Creation of tasks and task descriptions/instructions
- Crowd training
- Specialized tool development
- Quality control
- Payment to crowd-workers
As the demand for data to train AI applications has grown, a new generation of specialized crowdsourcing platforms has emerged and is flourishing. Apart from their narrow field of focus, which endears them to clients with a need for high-quality training data, they typically offer more services to simplify engagement with the crowd.
For this reason, it will probably make sense for your organization to choose such a specialized platform for your data-sourcing project rather than opting for a generalist solution.
Allocate Some Internal Resources for Data Verification and Quality Checks
Selecting a specialist crowdsourcing platform will mean paying substantially higher rates compared to working via a general provider like MTurk. The rewards, though, are also proportionately higher, since you will have less to worry about in the way of crowd training, tool selection, and project management.
At the same time, you will also receive more attention to quality control. However, quality still may not be assured to the standard you need. Therefore, it will pay to assign some of your internal project team members to verify and check batches of crowdsourced data.
You might find that checking on a random basis is sufficient or that early efforts, combined with feedback to the crowdsourcing provider, will result in greater assurance later on. Whether you opt for full or incremental quality checks will depend on your project’s unique needs.
In any case, though, it would be a mistake to assume that the crowd, however well-trained, will automatically deliver quality to the standards you expect.
Be Clear and Keep It Simple
A specialized crowdsourcing provider will probably take some accountability for creating instructions for crowd-workers and breaking activities down into appropriate micro-tasks. Nevertheless, your ability to receive results that match your quality needs will be enhanced by any initial efforts you make to provide clear and straightforward instructions for the provider.
Essentially, the shorter, clearer and more prescriptive your instructions, and the more granular tasks you set before passing them on, the greater the likelihood of high-quality outputs from your crowd.
Crowdsource Your Way to Successful Data Use
Crowdsourcing can be an effective and cost-efficient way to capture data for your next AI or machine-learning software development project. However, the successful use of data will be dependent on its quality.
Choose the right crowdsourcing platform, maintain a focus on quality checking and verification, and articulate clear expectations and instructions for the crowd. Those three actions will help you secure crowdsourced data of a quality otherwise attainable only by engaging conventional—and expensive—data collection service providers or research agencies.