3 Helpful Functions for Data Manipulation (python)

Working with Pandas dataframes It`s often needed to reshape the data, in order to prepare it for further analysis, visualization or library that requires particular data form.
I`ve decided to share with you some functions that I used a lot
working with customer journey paths, exported from Google Analytics. For experienced data scientist that could be basics, but for the beginner in Python world they can be really helpful.

Ungrouping

That function will help You to ungroup grouped data. Parameters of this functions – name of actual dataframe and name of column with count.

Actual function:

def ungrouping(df, cnt):
    df = df.copy(deep=True)
    df = df.loc[np.repeat(df.index.values,df[cnt])
               ].reset_index(drop=True)
    df[cnt] = 1
    return df

Example of use:

import pandas as pd
import numpy as np

def ungrouping(df, cnt):
    df = df.copy(deep=True)
    df = df.loc[np.repeat(df.index.values,df[cnt])
               ].reset_index(drop=True)
    df[cnt] = 1
    return df

d = {'name': ['a', 'b', 'c'], 'cnt': [1, 2, 2]}
df = pd.DataFrame(d)
print(df)

df2 = ungrouping(df, 'cnt')
print(df2)

Dimensions explode

This function will help you to create rows based on separated values in particular column. Parameters – dataframe, kolumn that you want to devide and separator.

Actual function:

def dimensions_explode(df, col, sep):
    df['id'] = df.index
    df = df.join(pd.DataFrame(df[col].str
        .split(sep, expand=True).stack()
        .reset_index(level=1, drop=True)
        ,columns=[col + ' '])
        ).drop(col,1).rename(columns=str.strip
        ).reset_index(drop=True)
    df[col] = df[col].str.strip()
    return df

Example of use:

import pandas as pd
import numpy as np

def dimensions_explode(df, col, sep):
    df['id'] = df.index
    df = df.join(pd.DataFrame(df[col].str
        .split(sep, expand=True).stack()
        .reset_index(level=1, drop=True)
        ,columns=[col + ' '])
        ).drop(col,1).rename(columns=str.strip
        ).reset_index(drop=True)
    df[col] = df[col].str.strip()
    return df

d = {'source': ['a, b', 'b, c', 'c'], 'cnt': [1, 2, 2]}
df = pd.DataFrame(d)
print(df)

df2 = dimensions_explode(df, 'source', ',')
print(df2)

Dimensions concatenation

The last tip is a way opposite to previous one - concatenation dimensions into rows. That is not a function, but still very useful working with customer journey paths.

Example of use:

import pandas as pd
import numpy as np

d = {'cnt': [1, 1, 2, 2, 2], 'id': [0, 0, 1, 1, 2], 'source': ['a', 'b', 'b', 'c', 'c'], }
df = pd.DataFrame(d)
print(df)

df2 = df.groupby(['id','cnt'])['source'
    ].apply(lambda x: ', '.join(x)).reset_index()
print(df2)

Leave a Reply

Your email address will not be published. Required fields are marked *

www.000webhost.com