Data visualization is at the heart of data science! It is an essential task in data exploration and analysis. Making the proper visualization is vital to understand the data, uncover pattern and communicate insights.
Mathplotlib is a popular and widely used python plotting library. It is possibly the easiest way to plot data in python. It also provides some interative features such as zoom, pan and update. The functionality of matplotlib can also be extended with many third party packages such as Cartopy, Seaborn. Matplotlib is very powerful for creating aesthetics and publication quality plots but the figures are usually static.
Plotly is a python library for interactive plotting. The significance of interactive data visualization is apparent when analyzing large datasets with numerous features. Another advantage of plotly over matplotlib is that aestheically pleaseing plots can be created with few lines of codes. With plotly, over 40 beautiful interactive web-based visualizations can be displayed in jupyter notebook or saved to HTML files.
This notebook provides a code-base examples of how to create interactive plotting using plotly.
The hilarious image below describes some fundamental types of plots for data visualization
Three datasests are employed in this to demonstrate the different plot types.
We will be using the insitu snow depth data collected during the SnowEx 2020 Intensive Operation Period (IOP) in Grand Mesa, Colorado. Snow depth was measured using one of three instruments - Magnaprobe, Mesa 2, or pit ruler. Pit ruler data were collected from 150 snow pits identified for the Grand Mesa IOP. Check the SnowEx20 Depth Probe Landing Page and the User’s Guide for more info.
We will also use the gapminder dataset. The third data will be scraped from the wikipedia.
# import necessary packages
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
# Read snow depth data
snowDepth = pd.read_csv('SnowEx2020_SnowDepths_COGM_alldepths_v01.csv', parse_dates= {'Datetime': [2,3]})
# rename some columns
snowDepth.rename(columns= {'Measurement Tool (MP = Magnaprobe; M2 = Mesa 2; PR = Pit Ruler)': 'Tool'}, inplace = True)
snowDepth
Datetime | Tool | ID | PitID | Longitude | Latitude | Easting | Northing | Depth (cm) | elevation (m) | equipment | Version Number | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-01-28 11:48:00 | MP | 100000 | 8N58 | -108.13515 | 39.03045 | 747987.62 | 4324061.71 | 94 | 3148.200000 | CRREL_B | 1 |
1 | 2020-01-28 11:48:00 | MP | 100001 | 8N58 | -108.13516 | 39.03045 | 747986.75 | 4324061.68 | 74 | 3148.300000 | CRREL_B | 1 |
2 | 2020-01-28 11:48:00 | MP | 100002 | 8N58 | -108.13517 | 39.03045 | 747985.89 | 4324061.65 | 90 | 3148.200000 | CRREL_B | 1 |
3 | 2020-01-28 11:48:00 | MP | 100003 | 8N58 | -108.13519 | 39.03044 | 747984.19 | 4324060.49 | 87 | 3148.600000 | CRREL_B | 1 |
4 | 2020-01-28 11:48:00 | MP | 100004 | 8N58 | -108.13519 | 39.03042 | 747984.26 | 4324058.27 | 90 | 3150.100000 | CRREL_B | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
37916 | 2020-02-04 13:40:00 | PR | 300162 | 5S29 | -108.16532 | 39.01801 | 745419.00 | 4322599.00 | 110 | 3094.260010 | ruler | 1 |
37917 | 2020-01-29 14:00:00 | PR | 300163 | 6S19 | -108.18073 | 39.01846 | 744083.00 | 4322607.00 | 139 | 3051.560059 | ruler | 1 |
37918 | 2020-02-11 15:04:00 | PR | 300164 | 1N5 | -108.21137 | 39.03618 | 741369.00 | 4324492.00 | 88 | 3031.800049 | ruler | 1 |
37919 | 2020-02-01 08:40:00 | PR | 300165 | 2S37 | -108.15929 | 39.01926 | 745936.51 | 4322753.96 | 104 | 3102.780029 | ruler | 1 |
37920 | 2020-02-08 13:25:00 | PR | 300166 | 3N26 | -108.18423 | 39.03341 | 743728.00 | 4324258.00 | 107 | 3066.909912 | ruler | 1 |
37921 rows × 12 columns
The data has 37921 records of snow depth and 13 columns. Let's check the data types.
snowDepth.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 37921 entries, 0 to 37920 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Datetime 37921 non-null datetime64[ns] 1 Tool 37921 non-null object 2 ID 37921 non-null int64 3 PitID 37921 non-null object 4 Longitude 37921 non-null float64 5 Latitude 37921 non-null float64 6 Easting 37921 non-null float64 7 Northing 37921 non-null float64 8 Depth (cm) 37921 non-null int64 9 elevation (m) 37921 non-null float64 10 equipment 37921 non-null object 11 Version Number 37921 non-null int64 dtypes: datetime64[ns](1), float64(5), int64(3), object(3) memory usage: 3.5+ MB
# Select records within the the period of GM IOP campaign
GMP_SnowDepth = snowDepth[(snowDepth['Datetime']>='1/28/2020') & (snowDepth['Datetime'] <='2/12/2020')]
GMP_SnowDepth
Datetime | Tool | ID | PitID | Longitude | Latitude | Easting | Northing | Depth (cm) | elevation (m) | equipment | Version Number | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-01-28 11:48:00 | MP | 100000 | 8N58 | -108.13515 | 39.03045 | 747987.62 | 4324061.71 | 94 | 3148.200000 | CRREL_B | 1 |
1 | 2020-01-28 11:48:00 | MP | 100001 | 8N58 | -108.13516 | 39.03045 | 747986.75 | 4324061.68 | 74 | 3148.300000 | CRREL_B | 1 |
2 | 2020-01-28 11:48:00 | MP | 100002 | 8N58 | -108.13517 | 39.03045 | 747985.89 | 4324061.65 | 90 | 3148.200000 | CRREL_B | 1 |
3 | 2020-01-28 11:48:00 | MP | 100003 | 8N58 | -108.13519 | 39.03044 | 747984.19 | 4324060.49 | 87 | 3148.600000 | CRREL_B | 1 |
4 | 2020-01-28 11:48:00 | MP | 100004 | 8N58 | -108.13519 | 39.03042 | 747984.26 | 4324058.27 | 90 | 3150.100000 | CRREL_B | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
37916 | 2020-02-04 13:40:00 | PR | 300162 | 5S29 | -108.16532 | 39.01801 | 745419.00 | 4322599.00 | 110 | 3094.260010 | ruler | 1 |
37917 | 2020-01-29 14:00:00 | PR | 300163 | 6S19 | -108.18073 | 39.01846 | 744083.00 | 4322607.00 | 139 | 3051.560059 | ruler | 1 |
37918 | 2020-02-11 15:04:00 | PR | 300164 | 1N5 | -108.21137 | 39.03618 | 741369.00 | 4324492.00 | 88 | 3031.800049 | ruler | 1 |
37919 | 2020-02-01 08:40:00 | PR | 300165 | 2S37 | -108.15929 | 39.01926 | 745936.51 | 4322753.96 | 104 | 3102.780029 | ruler | 1 |
37920 | 2020-02-08 13:25:00 | PR | 300166 | 3N26 | -108.18423 | 39.03341 | 743728.00 | 4324258.00 | 107 | 3066.909912 | ruler | 1 |
36388 rows × 12 columns
The snow depth data measurement of 150 snow pits measured with pit ruler. The other two mearement tool were used to collect snow depth along spiral tracks moving outwards from snow pit location. Let's check the number of records for each tools.
GMP_SnowDepth.Tool.value_counts()
MP 31850 M2 4390 PR 148 Name: Tool, dtype: int64
It appears there are 148 snow pits not 150. well, nsidc platforms says there are 150. Let's select the Pit Ruler (PR) records
# Select records associated with a Pit ruler
pit_ruler_depth = GMP_SnowDepth[GMP_SnowDepth['Tool'] == 'PR']
pit_ruler_depth
Datetime | Tool | ID | PitID | Longitude | Latitude | Easting | Northing | Depth (cm) | elevation (m) | equipment | Version Number | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
37755 | 2020-01-30 11:24:00 | PR | 300001 | 7C15 | -108.19593 | 39.04563 | 742673.94 | 4325582.37 | 100 | 3048.699951 | ruler | 1 |
37756 | 2020-01-29 15:00:00 | PR | 300002 | 6C37 | -108.14791 | 39.00760 | 746962.00 | 4321491.00 | 117 | 3087.709961 | ruler | 1 |
37757 | 2020-02-09 12:30:00 | PR | 300003 | 8C31 | -108.16401 | 39.02144 | 745520.00 | 4322983.00 | 98 | 3099.639893 | ruler | 1 |
37758 | 2020-01-28 09:13:00 | PR | 300004 | 6N18 | -108.19103 | 39.03404 | 743137.23 | 4324309.43 | 92 | 3055.590088 | ruler | 1 |
37760 | 2020-02-10 10:30:00 | PR | 300006 | 8S41 | -108.14962 | 39.01659 | 746783.00 | 4322484.00 | 95 | 3113.870117 | ruler | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
37916 | 2020-02-04 13:40:00 | PR | 300162 | 5S29 | -108.16532 | 39.01801 | 745419.00 | 4322599.00 | 110 | 3094.260010 | ruler | 1 |
37917 | 2020-01-29 14:00:00 | PR | 300163 | 6S19 | -108.18073 | 39.01846 | 744083.00 | 4322607.00 | 139 | 3051.560059 | ruler | 1 |
37918 | 2020-02-11 15:04:00 | PR | 300164 | 1N5 | -108.21137 | 39.03618 | 741369.00 | 4324492.00 | 88 | 3031.800049 | ruler | 1 |
37919 | 2020-02-01 08:40:00 | PR | 300165 | 2S37 | -108.15929 | 39.01926 | 745936.51 | 4322753.96 | 104 | 3102.780029 | ruler | 1 |
37920 | 2020-02-08 13:25:00 | PR | 300166 | 3N26 | -108.18423 | 39.03341 | 743728.00 | 4324258.00 | 107 | 3066.909912 | ruler | 1 |
148 rows × 12 columns
Let's see what plotting with the plot method of dataframe and matplotlib look like. Let's recall the types of data visualization charts
ax = pit_ruler_depth.plot(x='Easting', y='Northing',
c='Depth (cm)', kind='scatter', alpha=0.7,
colorbar=True, colormap='PuBu', legend=True,
figsize=(10,6))
ax.set_title('Grand Mesa Pit Ruler Depths')
ax.set_xlabel('Easting [m]')
ax.set_ylabel('Northing [m]')
plt.show()
f, ax = plt.subplots(figsize = (10,6))
ax.scatter(pit_ruler_depth.Easting, pit_ruler_depth.Northing, c= pit_ruler_depth['Depth (cm)'], alpha=0.7, cmap= 'Reds')
ax.set(title = 'Grand Mesa Pit Ruler Depths', xlabel = 'Easting \n(m)', ylabel = 'Northing \n (m)')
plt.show()
As we can see in the above plots, there is no way to interactively check the value of each depth plotted. At least four lines of codes are required to produce the plots. Let's see what we can do with plotly
#Read the second data- gapminder
Data2 = px.data.gapminder()
Data2
#house_price = pd.read_csv('house-prices/train.csv')
#house_price
country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
---|---|---|---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 | AFG | 4 |
1 | Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.853030 | AFG | 4 |
2 | Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.100710 | AFG | 4 |
3 | Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.197138 | AFG | 4 |
4 | Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.981106 | AFG | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1699 | Zimbabwe | Africa | 1987 | 62.351 | 9216418 | 706.157306 | ZWE | 716 |
1700 | Zimbabwe | Africa | 1992 | 60.377 | 10704340 | 693.420786 | ZWE | 716 |
1701 | Zimbabwe | Africa | 1997 | 46.809 | 11404948 | 792.449960 | ZWE | 716 |
1702 | Zimbabwe | Africa | 2002 | 39.989 | 11926563 | 672.038623 | ZWE | 716 |
1703 | Zimbabwe | Africa | 2007 | 43.487 | 12311143 | 469.709298 | ZWE | 716 |
1704 rows × 8 columns
A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. If the points are coded, one additional variable can be displayed [1]. If the points are coded (color/shape/size), one additional variable can be displayed.
fig = px.scatter(pit_ruler_depth, x = pit_ruler_depth.Easting, y = pit_ruler_depth.Northing, color = 'Depth (cm)')
fig.show()
As we can from the cell above, an aesthetic interactive plot is produced with just two lines of code. We can add also change the hover name
A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent. A bar graph shows comparisons among discrete categories. One axis of the chart shows the specific categories being compared, and the other axis represents a measured value [1]
Let's visualize the growth in population of Nigeria over time
df_Nig = Data2[Data2['country'] == 'Nigeria']
df_Nig
country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
---|---|---|---|---|---|---|---|---|
1128 | Nigeria | Africa | 1952 | 36.324 | 33119096 | 1077.281856 | NGA | 566 |
1129 | Nigeria | Africa | 1957 | 37.802 | 37173340 | 1100.592563 | NGA | 566 |
1130 | Nigeria | Africa | 1962 | 39.360 | 41871351 | 1150.927478 | NGA | 566 |
1131 | Nigeria | Africa | 1967 | 41.040 | 47287752 | 1014.514104 | NGA | 566 |
1132 | Nigeria | Africa | 1972 | 42.821 | 53740085 | 1698.388838 | NGA | 566 |
1133 | Nigeria | Africa | 1977 | 44.514 | 62209173 | 1981.951806 | NGA | 566 |
1134 | Nigeria | Africa | 1982 | 45.826 | 73039376 | 1576.973750 | NGA | 566 |
1135 | Nigeria | Africa | 1987 | 46.886 | 81551520 | 1385.029563 | NGA | 566 |
1136 | Nigeria | Africa | 1992 | 47.472 | 93364244 | 1619.848217 | NGA | 566 |
1137 | Nigeria | Africa | 1997 | 47.464 | 106207839 | 1624.941275 | NGA | 566 |
1138 | Nigeria | Africa | 2002 | 46.608 | 119901274 | 1615.286395 | NGA | 566 |
1139 | Nigeria | Africa | 2007 | 46.859 | 135031164 | 2013.977305 | NGA | 566 |
fig = px.bar(df_Nig, x = 'year', y = 'pop')
fig.show()
A type of graph that shows the relationship between two variables with a line that connects a series of successive data points. It is similar to a scatter plot except that the measurement points are ordered (typically by their x-axis value) and joined with straight line segments. A line chart is often used to visualize a trend in data over intervals of time – a time series – thus the line is often drawn chronologically. [1]
Data2.loc[Data2['continent'] == 'Oceania'].country.nunique()
2
There are two unique countries in Oceania - Australia and New Zealand! Let's compare the life expectancy of these countries over the years
df_line = Data2.query('continent =="Oceania"')
fig = px.line(df_line, x = 'year', y = 'lifeExp', color = 'country')
fig.show()
A pie chart is a circular statistical chart, which is divided into sectors to illustrate numerical proportion. [1]
Data2.columns
Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap', 'iso_alpha', 'iso_num'], dtype='object')
df_Africa_Pop2007 =Data2.query('year == 2007').query('continent == "Africa"')
df_Africa_Pop2007.loc[df_Africa_Pop2007['pop'] < 10.e6, 'country'] = 'Other countries'
fig = px.pie(df_Africa_Pop2007, values = 'pop', names = 'country', hover_data=['lifeExp'])
fig.show()
A histogram is an approximate representation of the distribution of numerical data.
px.histogram(GMP_SnowDepth, x = 'Depth (cm)')
As seen from above, Snowdepth distribution is symmetrically. Histogram can also be used to show the count of categoical feature.
px.histogram(GMP_SnowDepth, x = 'Tool')
MP tool was used to record most points outward of the snow Pits (148) where PR measurements was recorded.
A box plot is a statistical representation of numerical data through their quartiles. The ends of the box represent the lower and upper quartiles, while the median (second quartile) is marked by a line inside the box.
fig = px.box(GMP_SnowDepth, y = 'Depth (cm)')
fig.show()
If we hover on the above plot, we see that the max, median and mean values of snowdepth are 260cm, 96cm, 17cm. We can plot the distribution of measurement by each measurement tools by passing the column of interest as the argument of x
fig = px.box(GMP_SnowDepth, x = 'Tool', y = 'Depth (cm)')
fig.show()
A violin plot is a statistical representation of numerical data. Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator.
fig = px.violin(GMP_SnowDepth, x = 'Tool', y = 'Depth (cm)')
fig.show()
3D scatter plots are used to plot data points on three axes in the attempt to show the relationship between three variables. Here we will show relationships betwen life expectancy, population and gdp per capital for ocean african counrties
#select oceania countries
df_oceania = Data2.loc[Data2.continent == 'Oceania']
#selcet africa countries with population greater than 50million
df_africa = Data2.loc[Data2.continent == 'Africa']
df_africa2 = df_africa[df_africa['pop'] > 50.e6]
fig = px.scatter_3d(df_oceania, x = 'lifeExp', y = 'pop', z = 'gdpPercap', color = 'country' )
fig.show()
fig = px.scatter_3d(df_africa2, x = 'lifeExp', y = 'pop', z = 'gdpPercap', color = 'country' )
fig.show()
fig = px.line_3d(df_africa2, x = 'lifeExp', y = 'pop', z = 'gdpPercap', color = 'country' )
fig.show()
We will be putting the total population of all the countries on the map. Let's find the sum of population of the country data
country_population = Data2.groupby('country')[['pop']].sum()
country_population.reset_index(inplace= True)
country_population
country | pop | |
---|---|---|
0 | Afghanistan | 189884585 |
1 | Albania | 30962990 |
2 | Algeria | 238504874 |
3 | Angola | 87712681 |
4 | Argentina | 343226879 |
... | ... | ... |
137 | Vietnam | 654822851 |
138 | West Bank and Gaza | 22183278 |
139 | Yemen, Rep. | 130118302 |
140 | Zambia | 76245658 |
141 | Zimbabwe | 91703593 |
142 rows × 2 columns
#Here, I scraped wiki to get the country code
url = 'https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes'
data3 = pd.read_html(url, match='ISO 3166', header = 1)[0]
data3.set_index('Country name[5]', inplace= True)
data3
Country name[5] | Official state name[6] | Sovereignty[6][7][8] | Alpha-2 code[5] | Alpha-3 code[5] | Numeric code[5] | Subdivision code links[3] | Internet ccTLD[9] | |
---|---|---|---|---|---|---|---|---|
0 | Afghanistan | The Islamic Republic of Afghanistan | UN member state | .mw-parser-output .monospaced{font-family:mono... | AFG | 004 | ISO 3166-2:AF | .af |
1 | Akrotiri and Dhekelia – See United Kingdom, The | Akrotiri and Dhekelia – See United Kingdom, The | Akrotiri and Dhekelia – See United Kingdom, The | Akrotiri and Dhekelia – See United Kingdom, The | Akrotiri and Dhekelia – See United Kingdom, The | Akrotiri and Dhekelia – See United Kingdom, The | Akrotiri and Dhekelia – See United Kingdom, The | Akrotiri and Dhekelia – See United Kingdom, The |
2 | Åland Islands | Åland | Finland | AX | ALA | 248 | ISO 3166-2:AX | .ax |
3 | Albania | The Republic of Albania | UN member state | AL | ALB | 008 | ISO 3166-2:AL | .al |
4 | Algeria | The People's Democratic Republic of Algeria | UN member state | DZ | DZA | 012 | ISO 3166-2:DZ | .dz |
... | ... | ... | ... | ... | ... | ... | ... | ... |
275 | Wallis and Futuna | The Territory of the Wallis and Futuna Islands | France | WF | WLF | 876 | ISO 3166-2:WF | .wf |
276 | Western Sahara [ah] | The Sahrawi Arab Democratic Republic | disputed [ai] | EH | ESH | 732 | ISO 3166-2:EH | [aj] |
277 | Yemen | The Republic of Yemen | UN member state | YE | YEM | 887 | ISO 3166-2:YE | .ye |
278 | Zambia | The Republic of Zambia | UN member state | ZM | ZMB | 894 | ISO 3166-2:ZM | .zm |
279 | Zimbabwe | The Republic of Zimbabwe | UN member state | ZW | ZWE | 716 | ISO 3166-2:ZW | .zw |
280 rows × 8 columns
#Let's convert the country code to dictionary so we can map with the country_population dataframe
country_dic = data3[:]['Alpha-3 code[5]'].to_dict()
country_population['alpha_code']= country_population['country'].map(country_dic)
country_population
country | pop | alpha_code | |
---|---|---|---|
0 | Afghanistan | 189884585 | AFG |
1 | Albania | 30962990 | ALB |
2 | Algeria | 238504874 | DZA |
3 | Angola | 87712681 | AGO |
4 | Argentina | 343226879 | ARG |
... | ... | ... | ... |
137 | Vietnam | 654822851 | NaN |
138 | West Bank and Gaza | 22183278 | NaN |
139 | Yemen, Rep. | 130118302 | NaN |
140 | Zambia | 76245658 | ZMB |
141 | Zimbabwe | 91703593 | ZWE |
142 rows × 3 columns
fig = px.scatter_geo(country_population, locations= 'alpha_code',
size = 'pop', hover_name= 'country', projection= 'natural earth')
fig.show()