CS 795/895 - Information Visualization
Fall 2012: Tues/Thurs 3-4:15pm, E&CS 2120

Print - Admin




Course Topics





Homework 5: Visualization of Real-World Data

Assigned: Thursday, Oct 25, 2012
Due: Tuesday, Nov 20, 2012 (3 1/2 weeks)

You may work in groups (2 persons max) on this assignment. You may work with your final project partner or form a different group for this assignment.


The purpose of this assignment is to introduce you to some real-world problems and allow you to come up with an interactive visualization to address one of them.


Record Linkage

Dataset: 2000 voter registration records, different dates, different amounts of added error (see me for the datafiles)

Our colleagues at UNC have gathered voter registration data (~2000 records) from the same zip code. Voter registration data is updated weekly, and we have datasets from two different dates (July 9, 2012 and Aug 29, 2012).

Develop an interactive visualization that would assist an analyst in working with this data. Specifically, your visualization should allow an analyst to complete the following tasks:

  1. determine what information is in the Aug dataset that doesn't exist in the July dataset (i.e., newly registered voters)
  2. determine what information is in the July dataset that doesn't exist in the Aug dataset (i.e., purged voters)
  3. identify the duplicate records in both datasets (i.e., the same voter appearing more than once in a single dataset)
  4. identify the twins in the dataset

Your visualization should be able to help the analyst with these tasks for all versions of the datasets:

  • clean (no errors introduced, but may have duplicates)
  • 10% error (30% duplicate error)
  • 20% error (60% duplicate error)

Note about the error: New (10/26 -MCW)

  • 10% error - total % of errors in the data set (i.e., 10% of the total records in the data set were altered)
  • 30% duplicate error - total % of errors introduced in records marked as duplicate - 30% of the total duplicate records

For 10% error (30% duplicate error), if there are 100 total records with 20 duplicates, then 10 errors introduced in any row (duplicate or not). Then 6 additional errors (30% of 20) were generated in the duplicate records.

Here's a bit more information about the voter data - HW5-voterdata

It is suggested that you use a tool such as the CDC's LinkPlus (Windows) to perform the record linkage and focus your efforts on visualizing the results of the linkage.

Navy Medical Data (Hearing Conservation)

Dataset: 700k+ records in a Microsoft Access database (see me for the datafile)

The goal of the Navy Hearing Conservation Program is to protect hearing and prevent hearing loss.

Develop an interactive visualization that allows the doctors to analyze at least the following factors (add more if you find interesting relationships):

  • hearing over time
  • hearing vs. age
  • hearing vs. amount of time in the program

Here's a bit more background information about the program and how the data was collected.


  • beep at different frequencies with increasing volume, participant raises hand when they hear the sound
  • audiograms taken at different times over service period
  • data for over 20 years
  • measures are dBA
  • higher value means the volume was louder before the participant heard the beep


  • differences in 5 dB are not significant (limit of what device can detect)
  • 3-4 dB increase means a doubled impulse noise
  • must speak about twice as loud as background noise to be understood

Noise Induced Hearing Loss (NIHL)

  • The theory is that if you're exposed to a massive noise, you lose lots of hearing instantly, but that if you're exposed to long-term noise, there could be a 5 year lag before you notice hearing loss.
    • Note that only people exposed to 85-100 dB are in the program.
  • 4000-6000 Hz range is most affected by NIHL
    • Hearing loss in this range particularly affects language and speech recognition because consonants are often in the higher frequencies

Apps4VA Open Competition

Develop an interactive visualization targeted towards the Apps4VA competition. You do not have to enter the competition, but you are welcome to do so (deadline is Nov 15).

Competition website - http://www.apps4va.org/apps4va-open-competition.html

Detailed instructions - http://www.weebly.com/uploads/1/1/1/0/11104538/overview_entry_instructions091012.pdf

Data sets - http://www.apps4va.org/data.html

Your Own Research Data

If you choose this option, you must get approval from me (soon!) before starting.


Put an electronic version online at http://www.cs.odu.edu/~username/cs795f12/hw5.html. Include a link to your report on the web page and submit a hard-copy in class on the due date.

You must also write a report (posted on the webpage), detailing how you developed the visualization and how the visualization can be used. Describe what you did for all 7 steps of data visualization: acquire, parse, filter, mine, represent, refine, interact. (See Intro to Info Vis lecture and Visualizing Data text for more info on the steps.) If you chose the Record Linkage project, you must also include answers to the 4 tasks (provide screenshots that indicate the answer) and discuss how you can use your visualization to get the answer.


I will grade the assignment based on the quality of the developed visualization and how well it would help an analyst complete the stated tasks. I will also grade the quality of your report.