Structuring and Cleaning Data

How to Structure Data

Put data into a format that can be accepted by Watson Analytics.

  1. Only one title column at the top of the page
  2. Must be vertical columns
  3.  Follow all recommendations from the Watson Analytics Badge
  4. select the data type for each column. Right click on the data select format cells,  select column then the type (number, text, time/date etc)

How to Clean Data

Note: The recommended program for this class is Excel, the next choice is Google sheets. These programs are supported with videos on how to do all of these tools and more. The program Numbers is less supported and not recommended. All of the spreadsheets can be used; however, if  you use Numbers, you may need to find how to do several of the following steps on your own.

  1. Make sure you have a good reliable data source. You must cite this source so record where you obtained it. Remember “Junk in results in Junk out”. Cleaning data does not change bad data into good data. You may have  multiple (reliable) data sources. In these cases, we may have to have an extra effort looking for “inconsistencies”.
  2. Make a copy of your data using Save as. Always structure and clean the copy. Leave the original data as a reference.
  3. Fill in missing gaps if possible. For example, if a gap means zero, replace it with a zero. If a gap means that no data was collected at that time, leave it blank. If you need to replace dashes with zeros In Excel go to Format then Number then Automatic. In many cases, you can fill missing gaps based on the information that you have or in some logical “rule”. For example,  suppose that you know the “country” column but you don’t have any value on the “international phone code”. In this case, for example,  if the value of the country is “Vietnam” you can fill the column “international phone code” by “+84”. However, you should always avoid/remove redundant columns.
  4. Separate data. For example, if the date is together in a single column, separate it into different columns. To do this in Excel, go to Data then text to columns then Fixed width then Break apart the columns
  5. Don’t include columns that are summaries (such as averages) of other columns. This type of column throws off the computations and can lead to wrong conclusions.
  6. Eliminate data that is not useful. Structuring data is a very important step because is when we decide what are the (relevant) attributes of our data. Removing redundant/irrelevant attributes will help Watson focus on finding helpful relationships among the data.
  7. Add more columns of information if possible.  For example, If you are studying crime data, add weather data etc. Note: this can be a long, slow process.
  8.  “Tag” data if necessary. If two items are really the same thing then tag them as the same. For example, if a store sold brand A of inexpensive reflectors then changed to brand B of inexpensive reflectors, your program will not identify these items as the same. Copy and paste the name column and make a new column. Then use Find  and Replace within a column to replace the different names as “Inexpensive reflector.” This sounds complicated but should not take that long. Delete the original column and keep only the tag.
  9. Eliminate obviously wrong data. For example, if  you are working with AQI  and there is a result that is negative or 999 because the machine did not record data in that time period, then delete these entries. Sometimes you can identify and fix data entry errors, e.g. mistyping. For example, the word “Stret” instead of “Street”.
  10. Format specific types of data. If the number represents currency, you need to format it as currency. In Excel go to format then currency then type.
  11. Joining spreadsheets. Of course, you can copy and paste to put things together into a single spreadsheet, but this is slow and spreadsheets have a size limit. If the column names are exactly the same, then you can append spreadsheets in Watson. To do this click on the 3 dots next to the data set. Please note that when you analyze data in Watson, when asking questions you can change the source of the data and compare information from different spread sheets.
  12. If you use Google Sheets, you must export it as an Excel document, before uploading to Watson Analytics
  13. Don’t try to get a particular answer or relationship. “Listen” to the data in a neutral way. For example, looking for trends is AQI are good. Trying to show that the AQI is increasing while you are doing your project, will lead to bias.

More Powerful Tools

Use a program such as Data Wrangler . You can select this tutorial and certification  on Data Wrangler as one of the badges for this class.

More Information

The Best Way to Prepare Data  Note: the coding portion of this class goes beyond the level 1 Big Data class, however the parts on cleaning are helpful.