Data Considerations when Building a Cognitive Solution

As they embark on their cognitive journey, several teams struggle with what kinds of data are most adequate for cognitive solutions. In this blog, I will share some thoughts and suggestions to address the question of identifying data for cognitive solutions.

Kinds of Data:

The key ingredient for any cognitive solution is data. Specifically, two kinds of data are needed:

  • The knowledge base which is the corpus of documents that contain the information relevant to the use case
  • The ground truth which is the training data to teach the Machine Learning system how to identify and extract relevant knowledge/insights from the knowledge base (corpus).

For example, in a question/answer cognitive use case, it is critical that the answers to end-user questions reside somewhere in the knowledge base and it is as critical that the ground truth consists of representative end-user questions.

When considering what data to use for a cognitive use case, it is helpful to think of how you’d learn a new subject, for example, Economics. The typical learning process involves getting one or more books on the subject (Economics), which in this analogy, would constitute the knowledge base. Then to prepare for an exam on the subject, you’d go through several practice tests which are representative of the exam; these tests would map to the ground truth needed for training.

Data Format:

The other question to consider is data format. For this question, it is helpful to go back to the definition of cognitive systems as systems that learn and interact naturally with humans to help them extract knowledge and insight from big data. Humans commonly interact with one another using text, speech, and images. As such, IBM’s Watson Developer Cloud services are designed to handle text, speech, and images.

For instance, Alchemy Language, Natural Language Classifier, Conversation, Retrieve & Rank, Personality Insights and Tone Analyzer services mainly accept text as input.

Visual Recognition, on the other hand, accepts as input images, photographs, and drawings and returns as output insights to understand the contents of the image such as what objects are in the image, whether the images include a face or not and if so, is it the face of a male/female and what age group. Furthermore, Visual Recognition service can be trained to identify objects in an image based on custom classifiers designed for a particular task. For example, Visual Recognition can be used to differentiate sports cars from trucks and leopards from tigers.

For audio, Watson’s Speech to Text service accepts speech as input and produces text as output. Speech to text currently supports multiple languages (US English, UK English, Spanish, Japanese, Mandarin Chinese, Brazilian Portuguese, and Arabic) with a roadmap for supporting more languages in the future. Watson’s Speech to Text also has some unique capabilities as keyword spotting, profanity filtering, and word alternatives/confidence/timestamps.

Data Cleansing:

Teams planning on building cognitive solutions should include in their plans the effort and resources required for data cleansing. Data cleansing is a necessity because real data is rarely available without errors. The task of data cleansing involves detecting and removing errors and inconsistencies in data within one data source as well as across a variety of data sources. Common errors include misspelling, duplicate information, conflicting information, as well as formatting or invalid character encoding.

A variety of tools and techniques have been developed to help with the task of data cleansing including data analysis (sanity checking), defining data transformations (to match a specific schema), and identification and removal of duplicate and/or conflicting information. Data ingested into a cognitive system without being properly cleansed can lead to inaccuracies in retrieving results from the system.

Consider the example where 2 duplicate documents are ingested into the cognitive system. When training the cognitive system, a subject matter expert may select only one of the documents as containing the relevant information to a query. Such a scenario would lead to inaccuracy in the system as it thinks only one of the two duplicate documents are relevant which clearly is providing conflicting and inconsistent knowledge to train the system.


In conclusion, when starting on building a cognitive solution, teams need to plan for the effort and resources required to identify the relevant data for creating the knowledge base as well as the subject matter experts required to provide training data to the system. Understanding what data formats are support by the cognitive system you’re planning to leverage is also critical. Lastly, team needs to plan for data cleansing as it is a necessity for building successful cognitive applications.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s