You want to become a data scientist? Hooray!
As a data scientist at Intuit, I use Structured Query Language (SQL) on a daily basis to extract data. In this blog, I will highlight some tweaks to tackle some frequent data scientist challenges.
Why is SQL still relevant?
- Data sources are massive, so data should be carefully picked. We try to move the computation as close as possible to the source of the data and take exactly what we need.
- Data servers possess great computing power, so it is very efficient to use SQL for simple calculations.
- SQL allows fast delegation to data engineers when moving to production. SQL is very common among developers, so it is easy to delegate your code to other people in the organization to maintain and run on a periodic basis.
SQL is straightforward and there are dozens of guides, books and courses online. I recommend w3schools, it has great guides and a very light interface to try your code. I will use the w3schools database throughout the post to give examples that you can try yourself.
Write readable code
This always rings true, especially for SQL. I use this styling guide. My top rules are:
- Use indentation. Indentation helps us understand the code.
- Keep the river aligned.
- Use aliases.
- For complex queries, use multiple temp tables and stitch them up in the last one.
Use temp tables using “with” statements to store some interim calculations. In each subtable, try to keep not more than one join. Then, join all temp tables together. If you keep consistent with the join keys, you can use the fast syntax “join <table name> using ( <join_keys_names>)”
Common queries for data scientists
1. Finding the current / most recent status or event.
Databases tend to contain a log of events. To deduct what is the current situation, you will need to find the most recent event that happened. This could be the last balance that we have on our user, how many children she currently has, or what was the value of the latest transaction that she made. These all require you to extract the information at a specific point in time.
The challenge is that most SQL engines require you to apply group by or an aggregation method (i.e max, sum, average, etc.) on each of the columns that you are requesting to retrieve. This means that you cannot just simply:
- Select user, max(date), status
- From myTable
- Group by user
You will need to first fish the max date for each user and then find the value in that exact date for that user.
For example, suppose you have a table that records orders. It contains the employee that made the order, the date, and the customer in that transaction.
Suppose you want to find who was the customer in each employee’s last order? There are several ways you can solve this. Here is my recommendation:
2. Calculate some measures that were known at specific time.
As data scientists, we usually need to train our model with data that was known in each specific point in time. If you don’t pay close attention, you may encounter data leakage, which means data from the future was linked into training set and makes your model look amazing on the testing set. Of course, this will happen only in your lab and will not maintain the same result once you try to predict real life events.
Here is a cool tweak to calculate, for each week, the number of orders we had until that week:
The trick is to remember that you are not obligated to use the ‘=’ where stating the ‘on’ keys. Here I use the ‘>=’ to do that and the meaning is to first join all records that contain week that is smaller and then count the number of orders.
I hope this made SQL a little easier! Be sure you use your new skills to make our world a better place.
All code and more are available at my git repo.