CS 725/825 - Information Visualization
Spring 2019: Wednesdays, 9:30am-12:15pm, Dragas 1102

Home

Staff

Syllabus

Schedule
  summary
  objectives
  this week

Links

Blackboard
Paper Presentations

Project

Homework 2

Due: January 30, 2019 before 9:30am

The goal of this week's assignment is to gain experience using OpenRefine for data cleaning. Later in the semester, you will work on assignments using real-world data. This data will be messy. Learning how to use this tool now will save you a ton of time later in the semester.

Setup

Create your project

  • Follow the instructions at How to create a project in GitLab to create a new project.
  • Create a README.md file (see Markdown) that includes your name, CS 725/825, Spring 2019. For this assignment, you'll writing your solution in this project README.md. Make sure to use Markdown formatting to make your writeup neat and easy to read.
  • Add me (mweigle) as a Reporter on your project -- if this step is not done, I will not be able to grade your assignment

Tasks

Install

Download and install OpenRefine

Tutorial

Work through the tutorial at http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial through "Export Data". Put the answers to the questions I ask below in your project README.md.

  • Under "Clean up country names", what other countries had issues with spelling? List the variations and explain how you discovered them in your project README.md. Be specific in the string comparison methods and keying functions used. Think about the information that someone else would need to replicate your work.
  • Under "Clean up values for the endowment", report the number of entries that used the term "million" or "Million" in the endowment column. Also report the number that used the word "billion" or "Billion".
  • Under "Finding issues in other columns", identify and report on issues that you find in at least one other column (other than the country column shown in the tutorial).
  • Under "Exploring the data with scatter plots", export the endowment(x) vs. numStudents(y) plot, save it into your Gitlab project, and insert the image into your project README.md. Is there a correlation between endowment and number of students?
  • After completing the "Geocoding names and addresses" section, export your cleaned data file as a CSV (comma-separated value) file and add this file to your Gitlab project. How many rows did you end up with?

You can skip the "Geocoding names and addresses" section, but still export your cleaned data file. -MCW 1/29/19

Exercise

The last part of the tutorial is the section "More Data Sets - Is the 27 Club Real?". Use OpenRefine to determine how many musicians in the dataset died at age 27. Only use OpenRefine for this -- creating a chart in Excel or something else is not necessary. Export your final data file as CSV and add it to your Gitlab project.

In your project README.md, explain the steps you took to clean and analyze the data to reach your conclusion.

Important: Your write-up this section is the most important part of this assignment. You need to include enough detail so that I am convinced that you understand how to use OpenRefine. In addition, you will lose points if there are many spelling or grammatical errors and if your write-up does not use appropriate Markdown markup for clarity and neatness.

Submission

Submit the URL of your solution Gitlab project in Blackboard

  • Click on HW2 under Homeworks
  • Under "Assignment Submission", click the "Write Submission" button.
  • Copy/paste the URL of your project (should be something like https://git-community.cs.odu.edu/username/projectName/) into the edit box and make the link clickable
  • Double-check your link to make sure that it is working properly -- do not link directly to a file ending in .git
  • Make sure to "Submit" your assignment.