San Francisco Movie Locations

Effective Visualizations Techniques Develop experience with selecting data and presentations that are effective for the goals

Primary Language

R

We used R for processing the data as well as many of the visualization work. It is very intuitive to do the visualization directly after processing the data in one language. Not like Google Chart or some other popular mapping tools online that cannot draw the county maps (we need to have access to the borders of san francisco city), R has packages like ggMap and ggplot2 that can conveniently draw out county boundaries supported by US census. We also referenced WebMaps package in R for drawing interactive map online.

Javascript

We also used javascript to work with the Google developer's interface for Motion Charts.

R + Google Maps API

R also has interfaces with Google API, so we utilized this combination to illustrate our discovery in the google maps canvas, so that it is the most intuitive way for the user to zoom-in and zoom-out with more access to the detailed address information, with respect to the data we summarized that associated with films.

Time-based Motion Chart

We have both ordinal data type (ratings) and nominal data (year released), as well as lots of textual information (the address, genre, director). So we looked at data in the form of different combinations, and posted them with respect to time. The motion chart is a good solution for the reason that it can reveal underlying patterns based on time, and it suited our objective to look for different kinds of combinations.

Location-based Marking Graph

We have used the latitude and longitude information to plot locations in a designated area (the SF city) that considered how many movie are shot in that location. Also, for the purpose of comparison over time, we've also provided several time chunks for these information to suit into.

Source and Approach Develop experience with visually documenting the data

Data Set

Film Locations in San Francisco

Content

Listing of filming locations of movies shot in San Francisco from 1915 - 2015.

Columns

Raw data from the website:

Column Name	Type	Comment
Title	Text	The title of the movie
Release Year	Date	Year of release (yyyy)
Locations	Text	Description of the location. Mostly not geo-encoded, but a way people used to call that place.
Fun Facts	Text	Some interesting additional information
Production Company	Text	Production Company of the movie
Distributor	Text	Distributor of the movie
Director	Text	Director of the movie
Writer	Text	Writer of the movie
Actor 1	Text	Main actor 1
Actor 2	Text	Main actor 2
Actor 3	Text	Main actor 3

Cleaning and Transforming

First of all, we remove the duplicated data rows. The number of duplicated data is 3/1067, which is hard to detect but essentially matters to draw the motion chart with lots of dimensions to manipulate about – we need to assure the primary key is valid enough to decide unique rows in our whole data set.

Second, we use the location information as keyword to get the precise latitude and longitude via Google place text search API. The location information has two kind of formats. If it does not contain parenthesis, we use the whole description as input. If it contains parenthesis, we use the substring before the parenthesis as input 1, and the substring after the parenthesis as input 2. We will get two pairs of geometric information. Then, we compare which result is closer to the center point of San Francisco, which latitude and longitude are 37.77493 and -122.4194, and choose the closer one as result.

Third, we remove the results with abnormal latitude and longitude. That is, they are out of the scope of San Francisco. Since all of the locations should be in San Francisco, we probably get wrong results from the place search API. So we treat them as outliers and remove them.

Finally, we use the OMDb API to get the additional information of movies, which are runtime, genre, imdb rating and imdb votes. This concludes the process of back-end data processing, and it is then imported into R with more efforts to remove any NA or null values as marked before, as well as sorting and aggregating the data, in order to ensure a working data set for the visualization dashboard.

The final version of data set has 1010 rows out of 1067 rows from the raw data, and it has following fields:

Workable data after transforming:

Column Name	Type	Comment
Title	Text	The title of the movie
Timeid	Date	A special id that encodes each combination of "Year Release" and "Title" that uniquely defines a movie. This is to work with the motion chart's timeid requirement.
Location	Text	Same as the raw data set. A description of the location (whether geo-encoded information included or just a name people used to call).
Lat	Double	Latitude information; specified to 0.00001
Long	Double	Latitude information; specified to 0.0001
LatLng	Double	Combination of Latitude and Longitude information with a column mark to separate in between (meaning "Latitude:Longitude"), to comply with Google gMap Visualization package)
Runtime	Integer	Total minutes of time that the movie runs
Genre	Text	Genre of the movie
Rating_IMDB	Number	IMDB rating info about a film, retrieved by OMDB api
Popularity_IMDB.Votes	Number	The votes info about a film, retrieved by OMDB api
Director	Text	Director name of the movie; same as raw data

* Data that constructed from extra resources or through additional transforming other than the primary data set is marked in orange.

What Learned & Future Work For the insights of the future

Deep understanding of the data is required in order to develop effective visualizations. It sounds general, but from our experience in this project this pitches us very much, especially when we need to further integrating data with sources that still need to define: what will work with the original ecosystem of the data matters a lot.

Explore as much as it can to know the implications of data. It's about understanding the data integrity, uniqueness, primary keys, and level of dependencies across the data columns. Also, understanding the data helped way so much to design the interactive visualization where users can customize properly the dimensions to gain insight.

Plan ahead for the infrastructure, especially the back-end processing, to work properly. This leads to a robust dashboard product to show off the insights it can have.

In the future, we can apply our approach to more cities besides just in San Fran. We can even help to develop a greater tourism experience for people who feel that having the movies in life is inevitable!

Source & Approach