Working with Pandas dataframes It`s often needed to reshape the data, in order to prepare it for further analysis, visualization or library that requires particular data form.
I`ve decided to share with you some functions that I used a lot
working with customer journey paths, exported from Google Analytics. For experienced data scientist that could be basics, but for the beginner in Python world they can be really helpful.
Ungrouping
That function will help You to ungroup grouped data. Parameters of this functions – name of actual dataframe and name of column with values count.
Actual function:
def ungrouping(df, cnt):
df = df.copy(deep=True)
df = df.loc[np.repeat(df.index.values,df[cnt])
].reset_index(drop=True)
df[cnt] = 1
return df
Example of use:
import pandas as pd
import numpy as np
def ungrouping(df, cnt):
df = df.copy(deep=True)
df = df.loc[np.repeat(df.index.values,df[cnt])
].reset_index(drop=True)
df[cnt] = 1
return df
d = {'name': ['a', 'b', 'c'], 'cnt': [1, 2, 2]}
df = pd.DataFrame(d)
print(df)
df2 = ungrouping(df, 'cnt')
print(df2)
Dimensions explode
This function will help you to create rows based on separated values in particular column. Parameters – dataframe, column which you want to split and separator.
Actual function:
def dimensions_explode(df, col, sep):
df['id'] = df.index
df = df.join(pd.DataFrame(df[col].str
.split(sep, expand=True).stack()
.reset_index(level=1, drop=True)
,columns=[col + ' '])
).drop(col,1).rename(columns=str.strip
).reset_index(drop=True)
df[col] = df[col].str.strip()
return df
Example of use:
import pandas as pd
import numpy as np
def dimensions_explode(df, col, sep):
df['id'] = df.index
df = df.join(pd.DataFrame(df[col].str
.split(sep, expand=True).stack()
.reset_index(level=1, drop=True)
,columns=[col + ' '])
).drop(col,1).rename(columns=str.strip
).reset_index(drop=True)
df[col] = df[col].str.strip()
return df
d = {'source': ['a, b', 'b, c', 'c'], 'cnt': [1, 2, 2]}
df = pd.DataFrame(d)
print(df)
df2 = dimensions_explode(df, 'source', ',')
print(df2)
Dimensions concatenation
The last tip is a way opposite to previous one – concatenation dimensions into rows. That is not a function, but still very useful working with customer journey paths.
Example of use:
import pandas as pd
import numpy as np
d = {'cnt': [1, 1, 2, 2, 2], 'id': [0, 0, 1, 1, 2], 'source': ['a', 'b', 'b', 'c', 'c'], }
df = pd.DataFrame(d)
print(df)
df2 = df.groupby(['id','cnt'])['source'
].apply(lambda x: ', '.join(x)).reset_index()
print(df2)
took me 2h to find panda ungrouping… thanks!