Pandas Python Library = Excel On Steroids (tutorial for beginners)

People that never tried programming often think, that programming is something very complicated and hard to start. Honestly I had the same thinking until I tried python – one of the most intuitive programming languages with a low entry barrier. While learning programming it’s important to be able instantly try your knowledge in practice. In case of python it’s really easy – just start using python’s library “pandas” instead of excel in your everyday tasks.

In order to run python code – you need an environment (program) where you will be able to execute your code. For beginners i would definitely recommend Jupyter Notebook – an open-source application where you can execute your code, create visualisations and notes. The easies way to install Jupyter Notebook on your computer – download it as a part of Anaconda distribution. You can download it here:
https://www.anaconda.com/distribution/

Download Anaconda for your OS with 3 python’s version, and launch it – you should see panel like this:

Run Jupyter Notebook

Click “Launch” button under “Jupyter Notebook” logo – and Anaconda should open Jupyter Notebook in your browser:

Here I would suggest you to create a specific folder for your pyton scripts:

Create a new folder and then rename it – I called it “Python Scripts”:

Now navigate to that folder and create a new Python 3 file there:

If you see something like this – congratulations, it was the hardest part:

You can also rename your “Untitled” notebook to something like “My first script”, and start coding. =)

Pandas Data Frame

In order to use pandas library we need to import it at first – just add
“import pandas as pd” at the beginning of your script.
In pandas the main object we usually interact with – called “data frame”, think of it like an excel table. Let’s try to create our own simple data frame:

import pandas as pd
df = pd.DataFrame({'col1': ["a",  "b", "c", "c"], 'col2': [1, 2, 3, 3]})
display(df)

Paste this code to Jupyter cell and click “Run” button:

An output should be similar to this:

As you can see in code – we’ve created a variable called “df” that contains our table. The table consist of two columns – “col1” and “col2”. At the left side of it you can see indexes started from 0 – think of them like row numbers.

Now let’s try to perform simple operations with our table. Let’s find out the sum of column “col2”:

import pandas as pd
df = pd.DataFrame({'col1': ["a",  "b", "c", "c"], 'col2': [1, 2, 3, 3]})
print(sum(df['col2']))

As you can see – our table have two similar last rows. We also can make a deduplication in order to delete duplicated rows before calculating the sum of column “col2”:

import pandas as pd
df = pd.DataFrame({'col1': ["a",  "b", "c", "c"], 'col2': [1, 2, 3, 3]})
df = df.drop_duplicates()
print(sum(df['col2']))

Also we can group values in “col1” and find the sum of values in “col2” for each group:

import pandas as pd
df = pd.DataFrame({'col1': ["a",  "b", "c", "c"], 'col2': [1, 2, 3, 3]})
print(df.groupby(['col1']).sum())

Try with your own Excel / CSV files

If you want to open your own files in pandas – place them in the same folder as your Jupyter Notebook:

As you can see – I have two files: “names.csv” and “names.xlsx”. Let’s try to open csv first:

import pandas as pd
df = pd.read_csv('names.csv')
display(df)

Remember that some csv (comma separated values) files actually are not actually separated with commas. My csv file “names.csv” was separated with ;

If you have the same problem – just add “sep” parameter to read_csv:

import pandas as pd
df = pd.read_csv('names.csv', sep=';')
display(df)

Now your file should be opened properly and transformed to pandas data frame:

With this code I’ve opened my “names.xlsx” file:

import pandas as pd
df = pd.read_excel('names.xlsx')
display(df)

Now let’s open “name.csv”, remove rows where column “Name” contains “Jane” value, and save the result as a new csv.

import pandas as pd
df = pd.read_csv('names.csv', sep=';')
display(df)
df = df[df["Name"] != "Jane"]
display(df)
df.to_csv('names_2.csv')

As we can see the new csv “names_2.csv” was created and saved to the folder where we run our code:

Try to do similar manipulations with your data. The possibilities of pandas library are huge, I just wanted to show you were to start and some basic syntax.

Why I should use Pandas instead of Excel?

Pandas library is much more faster than Excel, especially on big amounts of data.
Pandas library is much more flexible than Excel, provided you’ll learn some syntax.
There are various data visualisation libraries like Seaborn or Plotly which will allow you to visualise your pandas data frames in an efficient manner.
It is a great opportunity to start learning python and instantly use it in your everyday work.