**Assigned:** Tuesday, Sep 3, 2013 **Due:** Tuesday, Sep 10, 2013 *before class begins*

There is a written part and a hands-on part to this assignment. The purpose is to assess your understanding of the material and to give you practice with data cleaning.

- Give examples, other than the ones listed in Ch 2, of data sets with the following characteristics:
- with an ordering relationship
- with a distance metric
- with an absolute zero

For each example, include the URL where you found the data and a description of which part of the data set has the desired characteristic.

- Describe the difference between a data attribute and a value. Use examples to clarify your response.
- There are numerous strategies for dealing with missing data in a data set. These include deleting the row containing the missing value, replacing the missing value with a special number (such as -999), replacing the value with the average value for that data dimension, and replacing the value with the corresponding entry from the nearest neighbor (using some distance calculation). Comment on the strengths and weaknesses of each of these strategies. What is gained or lost by following one approach over the others?

**Grading Rubric:** Rubric for Short Answer/Essay Questions

**Submission:** Submit a hard copy *at the beginning of class*.

First, create a webpage (`http://www.cs.odu.edu/~`

would be suitable) to hold the results of this part of the assignment.
*username*/cs725f13/hw2.html

In this part of the assignment, you will be analyzing data from a real study. The objective of the study was to determine if significant gender differences existed between subjects 65 years of age and older with regard to calcium, phosphorous, and alkaline phosphatase levels. The researchers performed a retrospective chart review of laboratory procedures performed in 6 different physician practices. The data consisted of 178 subjects representing 92 males and 86 females age 65 or older.

The data set is available at http://academic.csuohio.edu/holcombj/clean/calcium.xls. In the data set, there are three discrete variables: **sex**, **lab**, and **agegroup**.

The coding is as follows:

- Sex: 1=Male; 2=Female
- Lab: 1=Metpath; 2=Deyor; 3=St. Elizabeth's; 4=CB Rouche; 5=YOH; 6=Horizon
- Agegroup: 1=65-69; 2=70-74; 3=75-79; 4=80-84; 5=85-89 years

The other variables of **age**, **alkphos** - alkaline phosphatase (IU/L), **cammol** - calcium (mmol/L), and **phosmmol** inorganic phosphorus (mmol/L), are continuous.

You are not expected to have knowledge of physiology for this -- you're just analyzing data.

- First, look at the format of the data file. List and explain any changes you make to the format of the dataset. (You are not required to make changes, but if you do, you must to document them.)
- Then, check the validity of the data. Determine if this is a "messy" data set with variable values that appear incorrect. Once you find incorrect values, attempt to recover the correct values by looking up the true values from the actual data records, available at http://academic.csuohio.edu/holcombj/clean/bigtable.htm.

List and explain the steps you used to check the data. The goal is to find ways to discover the erroneous data *without* manually comparing each record against the actual data records. List each error and what you did to correct it.

- Once the data is "clean", perform a summary analysis (i.e., counts, distribution) of the three discrete variables (
**sex**,**lab**, and**agegroup**). For the variables**alkphos**,**cammol**and**phosmmol**, report the mean, median, standard deviation, min, and max broken down by**sex**. Also summarize the variables**alkphos**,**cammol**, and**phosmmol**in a similar way with the factor variable as**lab**. - Construct side-by-side box plots using Tableau Desktop of the variables
**alkphos**,**cammol**, and**phosmmol**with the factor variable as**sex**. Then construct side-by-side box plots of the**alkphos**,**cammol**, and**phosmmol**continuous variables with the factor variable as**lab**.

Box plot is not a built-in plot type in Tableau, but it is a useful graph type for showing distributions. Tableau has a tutorial on how to create box plots, including a 6 min video that explains box plots and how to create them in Tableau at http://kb.tableausoftware.com/articles/knowledgebase/box-plot-analog

Put a snapshot (png or jpg image) of your plots on your webpage.

- Compare the mean and standard deviation of
**age**,**alkphos**,**cammol**and**phosmmol**from the messy dataset with the mean and standard deviation from your cleaned dataset. Does cleaning the data make a difference? Explain.

**Grading Rubric:** Rubric for Programming Assignments. This rubric doesn't *exactly* fit the assignment, so here's what I'll be looking for:

- you find all of the errors in the data - I will not tell you how many there are
- you have high quality explanations of what you did
- your Tableau box plot visualizations are clear and appropriate

**Submission:** Email me the URL of your webpage *before class begins*. We will discuss the problem in class. I may show one or two solutions as examples.

Credit: Part 2 based on an assignment from John Holcomb, Cleveland State Univ.

Retrieved from https://www.cs.odu.edu/~mweigle/CS725-F13/HW2

Page last modified on September 03, 2013, at 12:32 PM