Regular expressions are great helpers for every data analyst because they allows to transform dimensions applying quite complicated logic rules. In Google Data Studio I often use REGEXP text functions – so I decided to share with you some my most often use cases. In addition – we’ll find out the difference between REGEXP_REPLACE, REGEXP_EXTRACT and REGEXP_MATCH based on examples.
REGEXP_REPLACE – is often used for text cleaning. It replaces values that matches our rule to any other value we want. For example we want to remove symbols “-“, “_” and numbers from our text. We can do it by applying this regex:
REGEXP_REPLACE(raw_text, "-|_|[0-9]", "")
It says: “Find all _, -, and numbers, and replace it with empty text”. This is what we’ll get in result:
Another case – for example we want to get the part of text before the dash. We could use REGEXP_EXTRACT but also it can be done with REGEXP_REPLACE:
REGEXP_REPLACE(raw_text, "- .*", "")
It says: “Get all text after “- ” and replace it with empty text”. As you can see – combination of symbols “.*” means “everything”.
REGEXP_EXTRACT – is used for extracting data that matches our rule. For example we want to extract the domain name from url which have this pattern:
We can use this formula in order to do that:
With this regex we say something like: “Get all text between // and the first occurrence of /”. This is how this regex affects different types of links in Data Studio:
Another use case – we have a solid text “first|second|third” which we want to split on separate dimensions: “first”, “second” and “third”.
In order to do that – we can use this formulas:
REGEXP_EXTRACT(regex orig, "(.*)\\|.*\\|.*") #first REGEXP_EXTRACT(regex orig, ".*\\|(.*)\\|.*") #second REGEXP_EXTRACT(regex orig, ".*\\|.*\\|(.*)") #third
Some of them can be written shorter, but I want you to see visually the difference between these three formulas. As you can see – we only changing the position of (.*) – this way we let regex know which logical part of text we want to extract.
Symbols “\\” – helps us to transform special characters to usual text, so in formulas above we let regex know that “|” is not a logical operator “or”, but a simple text. It usually called the escape character, and it may differ a bit in regular expressions of different languages/environments. In case of Google Data Studio it’s two backslash characters \\.
REGEXP_MATCH – function that returns True when text matches the rule, or False if not. For example we want to know if given word contains letter n – we can use this regex formula for that:
With symbols “.*” we say that there can be anything before and after “n”.
REGEXP_MATCH is often used in Data Studio case formulas for dimensions grouping. This is an example of such formula which copies Google Analytics Default Channel Grouping:
CASE WHEN ((Source="direct" AND Medium="(not set)") OR Medium="(none)") THEN "Direct" WHEN Medium="organic" THEN "Organic Search" WHEN (Social Source Referral="Yes" OR REGEXP_MATCH(Medium,"^(social|social-network|social-media|sm|social network|social media)$")) THEN "Social" WHEN Medium="email" THEN "Email" WHEN Medium="affiliate" THEN "Affiliates" WHEN Medium="referral" THEN "Referral" WHEN (REGEXP_MATCH(Medium,"^(cpc|ppc|paidsearch)$") AND Ad Distribution Network!="Content") THEN "Paid Search" WHEN REGEXP_MATCH(Medium," ^(cpv|cpa|cpp|content-text)$") THEN "Other Advertising" WHEN (REGEXP_MATCH(Medium,"^(display|cpm|banner)$") OR Ad Distribution Network="Content") THEN "Display" ELSE "(Other)" END
You can read more about Data Studio case formulas in my article:
Hope my examples will help you to understand regex text formulas in Google Data Studio better. I encourage you to get familiar with them and I promise – they will save you lots of time and will take your dashboards to the new level.