Effective Visualizations Techniques Develop experience with selecting data and presentations that are effective for the goals
Primary Language
R
We used R for processing the data as well as many of the visualization work. It is very intuitive to do the visualization directly after processing the data in one language. Not like Google Chart or some other popular mapping tools online that cannot draw the county maps (we need to have access to the borders of san francisco city), R has packages like
ggMap and
ggplot2 that can conveniently draw out county boundaries supported by
US census. We also referenced
WebMaps package in R for drawing interactive map online.
Javascript
We also used javascript to work with the
Google developer's interface for Motion Charts.
R + Google Maps API
R also has interfaces with Google API, so we utilized
this combination to illustrate our discovery in the google maps canvas, so that it is the most intuitive way for the user to zoom-in and zoom-out with more access to the detailed address information, with respect to the data we summarized that associated with films.
Time-based Motion Chart
We have both ordinal data type (ratings) and nominal data (year released), as well as lots of textual information (the address, genre, director). So we looked at data in the form of different combinations, and posted them with respect to time. The motion chart is a good solution for the reason that it can reveal underlying patterns based on time, and it suited our objective to look for different kinds of combinations.
Location-based Marking Graph
We have used the latitude and longitude information to plot locations in a designated area (the SF city) that considered how many movie are shot in that location. Also, for the purpose of comparison over time, we've also provided several time chunks for these information to suit into.
Source and Approach Develop experience with visually documenting the data
Data Set
Film Locations in San Francisco
Content
Listing of filming locations of movies shot in San Francisco from 1915 - 2015.
Columns
Raw data from the website:
| Column Name |
Type |
Comment |
| Title |
Text |
The title of the movie |
| Release Year |
Date |
Year of release (yyyy) |
| Locations |
Text |
Description of the location. Mostly not geo-encoded, but a way people used to call that place. |
| Fun Facts |
Text |
Some interesting additional information |
| Production Company |
Text |
Production Company of the movie |
| Distributor |
Text |
Distributor of the movie |
| Director |
Text |
Director of the movie |
| Writer |
Text |
Writer of the movie |
| Actor 1 |
Text |
Main actor 1 |
| Actor 2 |
Text |
Main actor 2 |
| Actor 3 |
Text |
Main actor 3 |
Cleaning and Transforming
First of all, we remove the duplicated data rows. The number of duplicated data is
3/1067, which is hard to detect but essentially matters to draw the motion chart with lots of dimensions to manipulate about – we need to assure the primary key is valid enough to decide unique rows in our whole data set.
Second, we use the location information as keyword to get the precise latitude and longitude via Google place text search API. The location information has two kind of formats. If it does not contain parenthesis, we use the whole description as input. If it contains parenthesis, we use the substring before the parenthesis as input 1, and the substring after the parenthesis as input 2. We will get two pairs of geometric information. Then, we compare which result is closer to the center point of
San Francisco, which latitude and longitude are
37.77493 and -122.4194, and choose the closer one as result.
Third, we remove the results with abnormal latitude and longitude. That is, they are out of the scope of San Francisco. Since all of the locations should be in San Francisco, we probably get wrong results from the place search API. So we treat them as outliers and remove them.
Finally, we use the OMDb API to get the additional information of movies, which are runtime, genre, imdb rating and imdb votes. This concludes the process of back-end data processing, and it is then imported into R with more efforts to remove any NA or null values as marked before, as well as sorting and aggregating the data, in order to ensure a working data set for the visualization dashboard.
The final version of data set has
1010 rows out of
1067 rows from the raw data, and it has following fields:
Workable data after transforming:
| Column Name |
Type |
Comment |
| Title |
Text |
The title of the movie |
| Timeid |
Date |
A special id that encodes each combination of "Year Release" and "Title" that uniquely defines a movie. This is to work with the motion chart's timeid requirement. |
| Location |
Text |
Same as the raw data set. A description of the location (whether geo-encoded information included or just a name people used to call). |
| Lat |
Double |
Latitude information; specified to 0.00001 |
| Long |
Double |
Latitude information; specified to 0.0001 |
| LatLng |
Double |
Combination of Latitude and Longitude information with a column mark to separate in between (meaning "Latitude:Longitude"), to comply with Google gMap Visualization package) |
| Runtime |
Integer |
Total minutes of time that the movie runs |
| Genre |
Text |
Genre of the movie |
| Rating_IMDB |
Number |
IMDB rating info about a film, retrieved by OMDB api |
| Popularity_IMDB.Votes |
Number |
The votes info about a film, retrieved by OMDB api |
| Director |
Text |
Director name of the movie; same as raw data |
* Data that constructed from extra resources or through additional transforming other than the primary data set is marked in orange.
What Learned & Future Work For the insights of the future
Deep understanding of the data is required in order to develop effective visualizations. It sounds general, but from our experience in this project this pitches us very much, especially when we need to further integrating data with sources that still need to define: what will work with the original ecosystem of the data matters a lot.
Explore as much as it can to know the implications of data. It's about understanding the data integrity, uniqueness, primary keys, and level of dependencies across the data columns. Also, understanding the data helped way so much to design the interactive visualization where users can customize properly the dimensions to gain insight.
Plan ahead for the infrastructure, especially the back-end processing, to work properly. This leads to a robust dashboard product to show off the insights it can have.
In the future, we can apply our approach to more cities besides just in San Fran. We can even help to develop a greater tourism experience for people who feel that having the movies in life is inevitable!