Parallel Coordinates for Multidimensional Data Visualization

Parallel coordinates were invented in far 1885 by French engineer and mathematician Philbert Maurice d’Ocagne. When I discovered this way of visualization – I was really impressed how it allows to visualize such a complicated thing as multidimensional data in a simple and intuitive way.

This is how I visualized few dimensions of Mobile App Store Kaggle data set with the help of Plotly and some Python code (the plot is interactive so you can apply filters to axis`s or change their order):

import plotly
import plotly.plotly as py
import plotly.graph_objs as go

df = pd.read_csv('AppleStore.csv')
plotly.tools.set_credentials_file(username='PuzyrovSerhii', api_key='yor_IP_key_from_plotly')
data = [go.Parcoords(
        line = dict(color = df['user_rating'],
        colorscale = [[0,'#D7C16B'],[0.5,'#23D8C3'],[1,'#F3F10F']]),
        dimensions = list([
            dict(range = [0,max(df.user_rating)],
                constraintrange = [min(df.user_rating),max(df.user_rating)],
                label = 'Average User Rating', values = df['user_rating']),
            dict(range = [min(df.price),max(df.price)],
                label = 'Price', values = df['price']),
            dict(range = [min(df.size_bytes),max(df.size_bytes)],
                label = 'Size (in Bytes)', values = df['size_bytes']),
            dict(range = [min(df['lang.num']),max(df['lang.num'])],
                label = 'Number of supported languages', values = df['lang.num']),
            dict(range = [min(df['rating_count_tot']),max(df['rating_count_tot'])],
                label = 'User Rating counts', values = df['rating_count_tot']),
            dict(range = [min(df['sup_devices.num']),max(df['sup_devices.num'])],
                label = 'Number of supporting devices', values = df['sup_devices.num'])]))]
layout = go.Layout(
    plot_bgcolor = '#E5E5E5',
    paper_bgcolor = '#E5E5E5')
fig = go.Figure(data = data, layout = layout)
py.iplot(fig, filename = 'parcoords-basic')

As You can see – in parallel coordinates plot every numerical dimension has its own axis which are placed in parallel to each other. Values are plotted as lines that connected across all the axes, so we can see the relationships and conections between our dimensions.
There is also a way to create parallel coordinates plot with the help of free pandas library:

from pandas.plotting import parallel_coordinates
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
from sklearn.preprocessing import MinMaxScaler

df = pd.read_csv('AppleStore.csv')
Features = ['user_rating', 'price', 'size_bytes', 'lang.num', 'rating_count_tot', 'sup_devices.num']
#MinMaxScaler
scaler = MinMaxScaler()
df_scal = df
df_scal['user_rating_label'] = df['user_rating']
df_scal[Features] = scaler.fit_transform(df[Features])
#Parallel Coordinates plot
figure(figsize=(26, 8))
df_scal = df_scal.sort_values(['user_rating_label'], ascending=False)
parallel_coordinates(df_scal[['user_rating_label', 'user_rating', 'price', 'size_bytes', 'lang.num', 'rating_count_tot', 'sup_devices.num']], 'user_rating_label')
plt.show()

But pandas version of this plot has two big disadvantages: it`s not interactive, and by default it does not scale each axis separately so you have to scale all your dimensions before plotting. I`ve done scaling with the help of MinMaxScaler, so after that each feature becomes placed between 0 and 1. Definitely there are other tools and libraries for visualization of parallel coordinates, I showed here some wich I tried myself.

One of my favorite examples of interactive parallel coordinates plot – Nutrient Database Explorer. I highly recommend to check it out.

I personally like parallel coordinates not only because of its simplicity, but also because sometime they reminds me something more related to art then the math.
There are some other good examples of parallel coordinates I`ve found on web:

Leave a Reply

Your email address will not be published. Required fields are marked *