People that never tried programming often think, that programming is something very complicated and hard to start. Honestly I had the same thinking until I tried python – one of the most intuitive programming languages with a low entry barrier. While learning programming it’s important to be able instantly try your knowledge in practice. In case of python it’s really easy – just start using python’s library “pandas” instead of excel in your everyday tasks.
data:image/s3,"s3://crabby-images/7f804/7f80469de0df39e249818e895c03160edc878d6c" alt=""
In order to run python code – you need an environment (program) where you will be able to execute your code. For beginners i would definitely recommend Jupyter Notebook – an open-source application where you can execute your code, create visualisations and notes. The easies way to install Jupyter Notebook on your computer – download it as a part of Anaconda distribution. You can download it here:
https://www.anaconda.com/distribution/
Download Anaconda for your OS with 3 python’s version, and launch it – you should see panel like this:
data:image/s3,"s3://crabby-images/86791/86791da4f677c430a08909e8ad674545f99a6fda" alt=""
Run Jupyter Notebook
Click “Launch” button under “Jupyter Notebook” logo – and Anaconda should open Jupyter Notebook in your browser:
data:image/s3,"s3://crabby-images/9ec50/9ec5040819e2944c0611f081868b2a8d72023110" alt=""
Here I would suggest you to create a specific folder for your pyton scripts:
data:image/s3,"s3://crabby-images/e13fc/e13fc6a268964b8a90f734231bf2fe892076e194" alt=""
Create a new folder and then rename it – I called it “Python Scripts”:
data:image/s3,"s3://crabby-images/f1138/f1138c5c615bc9d65d14c19d61ee2411657af36f" alt=""
Now navigate to that folder and create a new Python 3 file there:
data:image/s3,"s3://crabby-images/0d092/0d092cd5e6de2ebb1ac76622cae1723d3cae4f37" alt=""
If you see something like this – congratulations, it was the hardest part:
data:image/s3,"s3://crabby-images/b7aee/b7aee19c3cd4821ce0d803daa126f97331042629" alt=""
You can also rename your “Untitled” notebook to something like “My first script”, and start coding. =)
Pandas Data Frame
In order to use pandas library we need to import it at first – just add
“import pandas as pd” at the beginning of your script.
In pandas the main object we usually interact with – called “data frame”, think of it like an excel table. Let’s try to create our own simple data frame:
import pandas as pd
df = pd.DataFrame({'col1': ["a", "b", "c", "c"], 'col2': [1, 2, 3, 3]})
display(df)
Paste this code to Jupyter cell and click “Run” button:
data:image/s3,"s3://crabby-images/4e170/4e1704d7849889bf9dd7e0b62acfc6a96df63749" alt=""
An output should be similar to this:
data:image/s3,"s3://crabby-images/b4824/b48242a4b12bb46a48613bb784b63add0a5a04c8" alt=""
As you can see in code – we’ve created a variable called “df” that contains our table. The table consist of two columns – “col1” and “col2”. At the left side of it you can see indexes started from 0 – think of them like row numbers.
Now let’s try to perform simple operations with our table. Let’s find out the sum of column “col2”:
import pandas as pd
df = pd.DataFrame({'col1': ["a", "b", "c", "c"], 'col2': [1, 2, 3, 3]})
print(sum(df['col2']))
As you can see – our table have two similar last rows. We also can make a deduplication in order to delete duplicated rows before calculating the sum of column “col2”:
import pandas as pd
df = pd.DataFrame({'col1': ["a", "b", "c", "c"], 'col2': [1, 2, 3, 3]})
df = df.drop_duplicates()
print(sum(df['col2']))
Also we can group values in “col1” and find the sum of values in “col2” for each group:
import pandas as pd
df = pd.DataFrame({'col1': ["a", "b", "c", "c"], 'col2': [1, 2, 3, 3]})
print(df.groupby(['col1']).sum())
Try with your own Excel / CSV files
If you want to open your own files in pandas – place them in the same folder as your Jupyter Notebook:
data:image/s3,"s3://crabby-images/071e4/071e485d95053cd6b49909cfc4e5269064feb438" alt=""
As you can see – I have two files: “names.csv” and “names.xlsx”. Let’s try to open csv first:
import pandas as pd
df = pd.read_csv('names.csv')
display(df)
Remember that some csv (comma separated values) files actually are not actually separated with commas. My csv file “names.csv” was separated with ;
data:image/s3,"s3://crabby-images/2cdbe/2cdbe2ab13453621b7f6419cd6e74458e82fab18" alt=""
If you have the same problem – just add “sep” parameter to read_csv:
import pandas as pd
df = pd.read_csv('names.csv', sep=';')
display(df)
Now your file should be opened properly and transformed to pandas data frame:
data:image/s3,"s3://crabby-images/c0924/c09246ed7425ba24f66962c4d16c959bf507ba3c" alt=""
With this code I’ve opened my “names.xlsx” file:
import pandas as pd
df = pd.read_excel('names.xlsx')
display(df)
Now let’s open “name.csv”, remove rows where column “Name” contains “Jane” value, and save the result as a new csv.
import pandas as pd
df = pd.read_csv('names.csv', sep=';')
display(df)
df = df[df["Name"] != "Jane"]
display(df)
df.to_csv('names_2.csv')
As we can see the new csv “names_2.csv” was created and saved to the folder where we run our code:
data:image/s3,"s3://crabby-images/30a5b/30a5b04dad30d38fef03ab204592d32d8b78ae1b" alt=""
Try to do similar manipulations with your data. The possibilities of pandas library are huge, I just wanted to show you were to start and some basic syntax.
Why I should use Pandas instead of Excel?
- Pandas library is much more faster than Excel, especially on big amounts of data.
- Pandas library is much more flexible than Excel, provided you’ll learn some syntax.
- There are various data visualisation libraries like Seaborn or Plotly which will allow you to visualise your pandas data frames in an efficient manner.
- It is a great opportunity to start learning python and instantly use it in your everyday work.
data:image/s3,"s3://crabby-images/6d512/6d512e130a7255b40302e38269b236e9f4f131ce" alt=""