Blog > June 2019 > The 3 Cs of Big Data — Curation, Crowdsourcing and Collaboration

The 3 Cs of Big Data — Curation, Crowdsourcing and Collaboration

When people talk about turning Big Data into bottom-line value, they can mix up two concepts: curation and crowdsourcing. It is important to understand how these different data-driven capabilities can serve the business. In the first part of this four-blog series, we will distinguish between curation and crowdsourcing and also talk about how a third “C” — collaboration — also contributes to organizations’ bottom line. 
Data Curation
Data curation is how subject matter experts (SMEs) make a collection of diverse data sources useful and integrated so that other users have rationalized data sets readily available to drive understanding and decision-making.
Data curation requires a structured approach:
  • It starts with the identification of sources that might be valuable
  • The next step is some initial investigation to see which of the sources really are usable and useful
  • Next, the data may need to be cleansed and transformed so that it’s ready for integration
  • Finally, after integration, the resultant dataset might need rationalization before publishing
Curation is the foundation for turning many data sources into value. But there are some drawbacks: Curation that depends on SMEs isn’t scalable — the more data sources to review, the bigger the problem gets. That’s where crowdsourcing comes in.
While curation is a long-established idea (there were museum curators as far back as the 18th century), crowdsourcing is a relatively new term for a problem-solving model that’s been used for a long time. Jeff Howe coined the word in a 2006 article in Wired Magazine to describe a sourcing model in which materials, services, ideas and even money are gathered from a broad community — dividing a problem to accelerate, improve and enlarge solutions. Crowdsourcing in the Big Data world has at least four forms:
  • Crowdsourced data collection is often called “polling” — the answers to questions are requested from widespread populations, usually internet communities.
  • Crowdsourced data gathering involves the “crowd” in nominating data sources
  • Crowdsourced analytics comes in various flavors – from using the crowd to identify significant “features” in data, to sponsoring competition (as Kaggle does) to determine the “best” approach, to analyzing data to answer specific questions
  • Crowdsourcing data trust is about collecting user opinions about the accuracy, value and validity of data
Crowdsourcing is an excellent way to reduce the challenge of Big Data scale, expand analytic capabilities and improve the harvesting of value from data. However, it’s not a perfect solution! Some problems are just too complicated for crowdsourcing. Some questions can’t be asked broadly, for reasons of confidentiality. Sometimes finding the right crowd, keeping them focused and motivated and making sure they focus on the right problem takes too much effort or takes too long. These are all genuine challenges. In those situations, increasing data science productivity through intelligent curation is a better approach.
It’s important to note that crowdsourcing and curation are not the same — but they can be complementary. Crowdsourcing data gathering might help the curator in the dataset building process; wisdom of the crowd about data trust can be a great help to a data curator. That’s where the third of the three Cs — collaboration — comes in.
A collaborative environment has several elements:
  • The ability to define an issue and manage its progress
  • Workflow to transmit work from one participant to the next
  • Notifications to communicate status and availability
  • A sharing environment to allow data, information and commentary to be shared