Homework#2


Data Format:

The age column doesn’t need to be continuous; it can be presented as integers since the data source has no continuous number for this field.

Data Validity (31 error detected)

In this section I will use open refine to extract as much problems as possible in the dataset.


First: Filter the columns according to their expected values. For example SEX is supposed to hold either 1, or 2 for Male, and Female respectively. If another value is found this will be considered as error and checked back in the original data. This process will apply to columns SEX, LAB, AGEGroup as have other groups.

  • Gender:

  • Erros: 3 records.
    The data documentation stated that the SEX parameter can hold values 1,and 2 representing Male, and Female respectively. The previous image depicts the 3 records holding incorrect values and how they were detected using open refine facets


  • LAB:

  • Erros: 3 records.
    The LAB field holds values from 1-6 as per the data documentation. The open refine facets made it easy to detect values out od the bound as the figure shows.


  • Age Group:

  • Erros: 1 record.
    The open refine was able to detect blank fields in the AgeGroup data as shown in the figure.

Second: Check for outliers in the other columns using the scattered plot to spot incoherent data points.

  • AGE:

  • Erros: 3 records .

  • CAMMOL:

  • Erros: 14 records(1 mistyped, and 13 read in different units).

  • PHOSMMOL:

  • Erros: 3 record due to typing the values in different order.

Blank Records: 6 records.
  • AGE: 1 record that was empty in original data. The discrete nature of the variable made the averaging impractical so I omitted the record from further use.
  • ALKPHOS: 1 record restored from original data.
  • LAB: 1 record that was empty in original data. The discrete nature of the variable made the averaging impractical so I omitted the record from further use.
  • CAMMOL: 1 record that was blank in the original data. The value was restored using the nearest age, phosmmol, alkphos values in order.
  • PHOSMMOL: 1 record that was blank in the original data. The value was restored using the nearest age, cammol, alkphos values in order.
  • AGEGROUP:1 record (empty in original data).

Summary Analysis (using R)

Discrete Values

  • SEX: 178 records
  • LAB: 177 records
  • AGEGroup: 177 records

Continous Values

By Gender:
  • Male
  • ALKPHOS CAMMOL PHOSMMOL
    Mean 99.21839 2.393837 1.148605
    Median 83 2.4 1.145
    st. deviation 35.74438 0.1402868 0.1599718
    Min 43 2 0.81
    Max 219 2.75 1.61

  • Female
  • ALKPHOS CAMMOL PHOSMMOL
    Mean 84.84615 2.308242 1.05044
    Median 83 2.33 1.07
    st. deviation 24.32188 0.1778676 0.2082995
    Min 9 1.05 0.09
    Max 168 2.58 1.42

By LABs:
  • ALKPHOS
  • LAB 1 LAB 2 LAB 3 LAB 4 LAB 5 LAB 6
    Mean 94.79545 86.83 83.38 118.43 70.63 84.33
    Median 85.5 84.5 72.5 111 67 78.5
    st. deviation 31.34281 27.03 30.94 36.58 18.78 26.26
    Min 50 9 42 83 45 57
    Max 219 168 138 213 111 122

  • CAMMOL
  • LAB 1 LAB 2 LAB 3 LAB 4 LAB 5 LAB 6
    Mean 2.303908 2.422381 2.351875 2.445 2.31091 2.36
    Median 2.3 2.4 2.375 2.465 2.3 2.375
    st. deviation 0.188 0.101 0.148 0.137 0.155 0.087
    Min 1.05 2.23 2 2.13 2.15 2.23
    Max 2.7 2.75 2.53 2.65 2.63 2.48

  • PHOSMMOL
  • LAB 1 LAB 2 LAB 3 LAB 4 LAB 5 LAB 6
    Mean 1.096136 1.129268 1.036250 1.034286 1.084545 1.21
    Median 1.130 1.130 1.020 1.085 1.130 1.190
    st. deviation 0.204 0.158 0.158 0.195 0.262 0.0899
    Min 0.09 0.87 0.84 0.65 0.52 1.10
    Max 1.61 1.42 1.42 1.32 1.49 1.36

Tableau:

By Gender:

The previous images shows the box plot using Tableau for ALKPHOS, CAMMOL, and PHOSMMOL respectively using the SEX as the factor.

By Labs:

The previous images shows the box plot using Tableau for ALKPHOS, CAMMOL, and PHOSMMOL respectively using the LAB as the factor.

Messy vs. Clean Data:

Mean
AGE ALKPHOS CAMMOL PHOSMMOL
Clean 72.3 91.9 2.35 1.1
Messy 82.4 92 3.92 1.2

The mean shows much variation Age,Phosmmol, and Cammol fields. In Age, and Cammol fields outliers of significatant difference was detected shifting up the mean values. Phosmmol showed slight variation in the mean as the detected errors were for values of the same unit. The ALKPHOS showed slight change in the mean as the only detected error in it was a blank field that was restored from the original data.


St. Deviation
AGE ALKPHOS CAMMOL PHOSMMOL
Clean 4.8 31.2 0.17 0.2
Messy 85.8 31.22 5.6 0.64

The standard deviation in Age and Cammol confirms the results from the mean for having significant outliers. Phosmmol also showed lower variation in the standard deviation because its detected error were not significantly different from the true values. Again the ALKPHOS showed slight change in the standared deviation due to the error described before.

Written Part:file.

 

This template downloaded form free website templates