CUtunes Project Report


Updated December '05

This document is a complete reference for the CUtunes system. The entire report has been updated to reflect the changes made in the last semester. Major changes have been marked in red. The original report can be found here.


Blake Shaw (bs2018@columbia.edu)
Columbia University
New York, NY 10027

Work from the original report was contributed by:
Hart Lambur (hal2001@columbia.edu)
Lawrence Wang (levity@gmail.com)

Table of Contents

Abstract

CUtunes is a tool for exploring the music listening habits of the Columbia University community. We provide the following services:

New services:

Introduction to CUtunes

With iPods now holding over 60 gigs of music (more than 25,000 songs) and the average college student having more than 1500 MP3s on their computer, music is becoming more accessible than ever before. No longer do college students have to fill their rooms with boxes of vinyl records or racks of CDs -- new technologies allow us to access a tremendous amount of music instantaneously. But this 'overload' of music presents an interesting problem: how is one supposed to organize, explore, or know how to expand one's music collection?

To help solve this problem, we propose a new way to explore music in a small community, such as Columbia University. CUtunes offers a real-time 'Top 40' list, showing the most popular songs, albums and artists being played by students on their computers and iPods. We allow users to browse through this community by looking at profile pages for users, artists, albums, and songs. We compare the music libraries of our users to calculate a 'compatibility' rating identifying users with similar musical tastes and providing recommendations for new music based on this similarity. Furthermore, by analyzing the similarity between users and artists, we provide a tool for intelligent playlist generation, allowing the user to specify a group of users and/or artists, and say "make me a playlist that is like these items from my own music collection."

Accessibility is not the only benefit of a digital music format. For the first time we can collect precise data about the listening habits of a community. With this information, we can provide intelligent tools which can give music-listeners a better understanding of their musical tastes in a fashion which develops naturally out of a community of people exhibiting their musical preferences. The discovery of new music can be shifted away from the hands of advertisers, TV, and radio stations and can come from a more genuine source of information, the listening habits of the people of our community.

To demo the service, please log on to CUtunes.com and sign up. To read more about our implementation and how our service works, read on below.

Architecture

The CUtunes service involves four basic components:

New additions: A discussion of the design and implementation decisions made in creating the analysis components which produce recommendations and compatibility ratings, as well as allow for visualization, and smart playlist generation can be found in the Analysis section.

Client View

There are 4 main sections of the website: Me, Neighbors, Community, and Information pages.

Technologies Used

When a user goes to CUtunes.com, the following components and technologies are involved:

The web pages are served by Apache 2 with mod_python, a module that embeds the interpreter for the Python programming language within Apache. Among other features, mod_python provides a way to mix HTML and Python in a single page, called Python Server Pages (PSP), using syntax similar to JSP. It also provides a session object, which we use to handle user login and logout. All this is entirely server-side, of course, requiring no special support from the end-user's web browser.

We use embedded Python in the pages of CUtunes.com to retrieve data on songs and users from our MySQL database, using the python module MySQLdb, which makes connecting to MySQL a simple matter. The data is then laid out in a readable manner with HTML, Javascript, and CSS.

More detail on the PSP files is in the Program Documentation section.

Design Decisions

When initially designing the client architecture, we briefly considered writing a client that would display the user's top songs and other information directly on the desktop. However, it soon became apparent that writing a web-based system was the better option, for several reasons:

The system that is fully usable by any Columbia student with a modern browser and support for Java applets, and Flash.

Python was chosen due to existing familiarity with the language and because it has an extensive collection of third-party modules for interfacing with many different data formats easily and quickly. This turned out to be quite true in the case of python-ldap, which required only a small amount of time to understand and use, even with no prior experience using LDAP.

It is also interesting to note, though perhaps common enough now that it is taken for granted, that all the technologies used to implement this project are freely available. As mentioned earlier, the web front-end is constructed from a LAMP (Linux-Apache-MySQL-Python) stack, all of which are open source projects.

Client Update

The CUtunes client is the foundation of our system, as it is the sole means by which we acquire music data. In designing the client we knew it was essential that the software work flawlessly and easily, so as many people as possible would be tempted to sign up for our service and continue to use our service.

iTunes MP3 player and the iTunes XML file

Early on we made the decision that CUtunes would support Apple's iTunes MP3 player exclusively. We made this choice for a number of important reasons. We believe that iTunes is the best and most popular MP3 player used in the Columbia community. Furthermore, the number of student with iPods (which requires iTunes) continues to grow, and we wanted to be sure to capture the musical taste of those users. iTunes stores its music database (including play counts and last played dates) in an easily accessible XML file (and this XML file is updated with play-counts from iPods whenever an iPod is 'synced' with its host computer). Apple's use of vanilla XML makes acquiring iTunes data relatively simple.

Choosing a SAX parser

To read the iTunes XML file, we had to decide on an XML parser and implementation. Due to the potentially large size of the XML file (users with ~10,000 songs have XML files of ~15 MB) we knew using a DOM parser to load the XML into memory didn't make much sense, especially since we didn't need to modify the DOM tree. Instead, we chose the standard SAX (Simple API for XML) parser. The SAX parser allows for only serial access (which is all we need), allowing for much faster reads with lower memory overhead. Performance comparisons showed the SAX parser to be much more effective than DOM.

The SAX parser has efficient, well-supported implementations in all the major languages. For quite a while we debated between using Java or C/C++ for our client software. Initially, we thought that C/C++ would allow for the cleanest, easiest-to-use client software with the nicest GUI, even though using C/C++ would require that we support two codebases, one for Mac OS X and one for Windows XP. However, we were more comfortable in Java, and quite liked Java's SAX implementation. Java has worked quite well and, due to some other consideration/design choices discussed below, we have since dropped our plans for a C/C++ client.

Choosing a messaging protocol (XML-RPC)

We need a flexible system to transfer the information that our client program gathers to the CUtunes server. Initally we looked at many options: Java ServerSockets, J2EE, CGI GET and POST, REST, FTP, XML-RPC, SOAP. Because we potentially wanted to move to a C/C++ client at a later date, the Java solutions were crossed out. Furthermore, we didn't think GET/POST or REST would be robust enough for the amount of transfer. In the end, XML-RPC seemed to be the perfect lightweight and flexible protocol, which would be easier to implement than SOAP.

Nice XML-RPC packages exist for both C/C++ and Java, although we were particularly impressed with Apache XML-RPC and its robust yet simple Java implementation.

CUtunes Mac OS X client

When the idea of using a Java applet first surfaced back in the project's infancy, we were turned off by the idea because of bad applet experiences, ugly GUIs, and slow load times. We also thought applet security issues would restrict our ability to read files on the user's computer. Instead we choose to write a Java application that could be wrapped in OS specific GUI code, and produced a pretty Mac OS X version.

CUtunes Java applet

When it came time to write a PC version, however, we played with the applet idea again. After investing considerable time researching the security restrictions, we realized applets can read from a user's computer if the JAR file is signed with a security certificate that the user must accept. We also experienced a remarkably better applet experience than we had previously experiences -- modern JVMs are much faster and new browsers cache the applet. Now we have implemented a very simple and clean applet, which works quite well on both the Mac OS X and Windows XP platforms, and integrates seamlessly into the web interface, as shown below. The Mac OS X application is no longer available, and the applet is now the main way users update their data.

Client software implementation

After we made our design choices, the client implementation was rather simple. Three Java classes form the com.cutunes.client package:

Playlist Generation

The CUtunes playlist creator leverages the large ammount of data we analyze to provide a more intelligent way to create playlists. The tool in essence allows the user to say, "Make me a playlist that is like these musical artists, and/or these CUtunes users."

Currently, the playlist creator is only provided as a Mac OS X application. Here is a simplified view of how it works: For a given query of artists and users, the application asks the CUtunes server for the top artists for each user, and the most similar artists for each artist. A list of artists with weights is constructed. The program then finds N of the most popular songs for each artist which the user has, where N is determined by the weight of that artist. The playlist is then constructed in iTunes using applescript.

Playlist creator implementation

Note: Currently the playlist creator is available to only to a limited group of beta-testers.

Server Programs

The CUtunes server software is the backbone of our system. The main component is an XML-RPC server for sending and receiving data to both the client update application and the playlist generating application. Below we detail some of the major design choices and implementation details.

XML-RPC Server

We used the same Apache XML-RPC package the client software uses for our server backend. The Apache package offers a number of flexible options for starting an XML-RPC server -- the package can be used within a J2EE/JSP Application Server, or run on its own as a flexible, lightweight XML-RPC server. We choose to run the server as a stand-alone Java application, and use JDBC to connect to our database (running on the same machine).

XML-RPC API

The XML-RPC server is started in a very simple ServerCommunicator java class and uses a RequestHandler to receive XML-RPC messages:

New additions:

The main method for loading songs is bulkLoadSongs() which sends a group of SongData objects to the server. SongData is a simple abstraction which holds all relevant data pulled from the iTunes XML file for each track.

Cron Jobs for Updates and Analysis

We user Cron to schedule a variety of tasks to be run throughout the day:

New additions:

Server Implementation

The following classes define the com.cutunes.server package. Class names below link to the Javadoc.

Analysis

Introduction

CUtunes analyzes the data collected to provide a 'compatibility' rating between all users, similarity information between artists, recommendations for new music for each user, as well as a tool for visualizing the community.

The data can be thought of as a NxD matrix where N is the number of users and D is the number of songs/albums/artists. Each value of this matrix corresponds to how much a user plays a given item.

Compatibility

Our current implementation (in use now) focuses on calculating a similarity rating for each pair of users using the following method:

Where:

However, the problem of calculating compatibility between users can more accurately be thought of as calculating distances between two users in a D dimensional space. The closer two users are, the more compatible they are.

We have been experimenting with other metrics focusing on distance:

From intuition, although it is less mathematically rigorous, our current implementation seems to provide more accurate results then the Euclidean Distance and KL divergence metrics. However, for calculating the distances between artists for use in the playlsit generator application, we use KL-divergence.

We have also been experimenting with ways to visualize compatibility of all users using nonlinear dimensionality reduction techniques such Semidefinite Embedding. Here is a 'map' of CUtunes' users:

For more information about our experimental metrics for compatibility and visualizing this data, please see the following papers that I wrote on the topic:

Semidefinte Embedding: Applied to Visualizing Folksonomies

Machine Learning Techniques for Visualizing High-Dimensional Data

Recommendations

To provide recommendations for a user, we find 10 items (songs, albums, and artists) not in the user's current library which have the highest recommendation score. The recommendation score is calculated by multiplying the similarity rating for another user by the percent they listen to an item. The algorithm finds music which you don't have that users similar to yourself play often.

Future Directions

We have only recently begun to archive snapshots of our data. This new archival system offers exciting possibilities for studying the dynamics of the CUtunes system, allowing us to gain insight into how the popularity of certain songs, albums, and artists change over time.

Program Documentation

Click here to view the Program Documentation.

Task List

Here is a comprehensive list of all of the changes and additions made this semester:

References

Thank you...

...Professor Schulzrinne, for all your advice and support this semester, as well as helping us host CUtunes.