CS 725/825 - Information Visualization
Spring 2017: Wednesdays, 9:30am-12:15pm, E&CS 2120

Print - Admin

Home

Staff

Syllabus

Schedule
  Objectives

Assignment Guidelines

Blackboard
CS725 @ GitLab
WebEx

Paper Presentations

Project - updated

Links


Tableau's data visualization software is provided through the Tableau for Teaching program.

Visualization Implementation 2

Due: January 25, 2017 before 9:30am

The goal of this week's assignment is to have you use Trifacta Wrangler or OpenRefine to clean up some messy data. Later in the semester, you will work on a project using real-world data. This data will be messy. Learning how to use tools like these now will save you a ton of time later in the semester.

Background

Academics typically publish their research findings in journals. Most journals are subscription-based and require readers (or their university libraries) to pay for a subscription to the journal in order to read the article. Some journals now allow authors to pay an article processing charge (APC) that then allows the reader to access the article free of charge, making the article "open access" (OA). The data you will examine for this assignment comes from a list of articles where this APC was paid.

Data

The data in vi2-data-journals.xlsx contains information about fees paid for publishing 2128 articles in academic journals. The accompanying file, vi2-data-README.pdf, provides information on the fields in the spreadsheet. Note that this comes from an UK company, so the amounts are in British pounds.

Original data source: https://figshare.com/articles/Wellcome_Trust_APC_spend_2012_13_data_file/963054
Blog post about the data: http://web.archive.org/web/20150816201006/https://biomickwatson.wordpress.com/2014/03/25/biologists-this-is-why-bioinformaticians-hate-you/

Assignment

Use either Trifacta Wrangler or OpenRefine to clean this data. Cleaning the data should including making the publisher and journal names uniform and ensuring the formatting of all of the data fields is consistent. Then use another tool (or write your own) to convert the Excel spreadsheet into a JSON data file. We will use this JSON data file in a future assignment.

Create a project in GitLab to hold the original data file, the cleaned data file, and your final JSON data file.

The README.md in that project should describe what you did to clean the data. In particular, include

  • a description of the tool you chose
  • the inconsistencies you found in the data
  • the steps you took to clean the data
  • the tool you used (or wrote) to convert the data to JSON
  • a paragraph on the benefits of the tool you used over manually cleaning the data using something like Excel

Your project must be private (from the Settings icon, choose 'Edit Project', then change 'Visibility Level' to 'Private') and you must add me as a guest member (from the Settings icon, choose 'Members', my username is mweigle).

Submission: Submit the URL of your GitLab project in Blackboard (under "Visualization Implementations" > VI2).