CS725-F13: Homework 2: Data Foundations

Assigned: Tuesday, Sep 3, 2013
Due: Tuesday, Sep 10, 2013 before class begins

Description

There is a written part and a hands-on part to this assignment. The purpose is to assess your understanding of the material and to give you practice with data cleaning.

Part 1 - Written Assignment

1. Give examples, other than the ones listed in Ch 2, of data sets with the following characteristics:
1. with an ordering relationship
2. with a distance metric
3. with an absolute zero
For each example, include the URL where you found the data and a description of which part of the data set has the desired characteristic.
1. Describe the difference between a data attribute and a value. Use examples to clarify your response.
2. There are numerous strategies for dealing with missing data in a data set. These include deleting the row containing the missing value, replacing the missing value with a special number (such as -999), replacing the value with the average value for that data dimension, and replacing the value with the corresponding entry from the nearest neighbor (using some distance calculation). Comment on the strengths and weaknesses of each of these strategies. What is gained or lost by following one approach over the others?

Submission: Submit a hard copy at the beginning of class.

Part 2 - Hands-On Assignment

First, create a webpage (`http://www.cs.odu.edu/~username/cs725f13/hw2.html` would be suitable) to hold the results of this part of the assignment.

In this part of the assignment, you will be analyzing data from a real study. The objective of the study was to determine if significant gender differences existed between subjects 65 years of age and older with regard to calcium, phosphorous, and alkaline phosphatase levels. The researchers performed a retrospective chart review of laboratory procedures performed in 6 different physician practices. The data consisted of 178 subjects representing 92 males and 86 females age 65 or older.

The data set is available at http://academic.csuohio.edu/holcombj/clean/calcium.xls. In the data set, there are three discrete variables: sex, lab, and agegroup.

The coding is as follows:

• Sex: 1=Male; 2=Female
• Lab: 1=Metpath; 2=Deyor; 3=St. Elizabeth's; 4=CB Rouche; 5=YOH; 6=Horizon
• Agegroup: 1=65-69; 2=70-74; 3=75-79; 4=80-84; 5=85-89 years

The other variables of age, alkphos - alkaline phosphatase (IU/L), cammol - calcium (mmol/L), and phosmmol  inorganic phosphorus (mmol/L), are continuous.

You are not expected to have knowledge of physiology for this -- you're just analyzing data.

1. First, look at the format of the data file. List and explain any changes you make to the format of the dataset. (You are not required to make changes, but if you do, you must to document them.)
2. Then, check the validity of the data. Determine if this is a "messy" data set with variable values that appear incorrect. Once you find incorrect values, attempt to recover the correct values by looking up the true values from the actual data records, available at http://academic.csuohio.edu/holcombj/clean/bigtable.htm.
List and explain the steps you used to check the data. The goal is to find ways to discover the erroneous data without manually comparing each record against the actual data records. List each error and what you did to correct it.
1. Once the data is "clean", perform a summary analysis (i.e., counts, distribution) of the three discrete variables (sex, lab, and agegroup). For the variables alkphos, cammol and phosmmol, report the mean, median, standard deviation, min, and max broken down by sex. Also summarize the variables alkphos, cammol, and phosmmol in a similar way with the factor variable as lab.
2. Construct side-by-side box plots using Tableau Desktop of the variables alkphos, cammol, and phosmmol with the factor variable as sex. Then construct side-by-side box plots of the alkphos, cammol, and phosmmol continuous variables with the factor variable as lab.
Box plot is not a built-in plot type in Tableau, but it is a useful graph type for showing distributions. Tableau has a tutorial on how to create box plots, including a 6 min video that explains box plots and how to create them in Tableau at http://kb.tableausoftware.com/articles/knowledgebase/box-plot-analog
Put a snapshot (png or jpg image) of your plots on your webpage.
1. Compare the mean and standard deviation of age, alkphos, cammol and phosmmol from the messy dataset with the mean and standard deviation from your cleaned dataset. Does cleaning the data make a difference? Explain.

Grading Rubric: Rubric for Programming Assignments. This rubric doesn't exactly fit the assignment, so here's what I'll be looking for:

• you find all of the errors in the data - I will not tell you how many there are
• you have high quality explanations of what you did
• your Tableau box plot visualizations are clear and appropriate

Submission: Email me the URL of your webpage before class begins. We will discuss the problem in class. I may show one or two solutions as examples.

Credit: Part 2 based on an assignment from John Holcomb, Cleveland State Univ.