TLDR: This tutorial (taken from my post here) is meant to be an entry point into SQL: We show you how to create a table in a database and load data from a CSV file into it using basic SQL commands. We use the CREATE DATABASE, CREATE TABLE and COPY commands to create a table with the corresponding data types and then load our climate change data into it. We will then also answer some preliminary questions about that data to get a feel of basic SQL syntax as well as what to look for in a new dataset.
From the climate change dataset and the scope of the project we defined previously (go here please if this makes no sense), we are now going to dive into a file from the data and find our way around dissecting it, one step at a time. In this particular post, we’re going to load a CSV file from the climate change dataset into a table in a database and run some rudimentary SQL queries against it.
SQL, or Structured Query Language, is the standard for interacting with relational databases (tabular datasets with information that relates them to each other and other tables in the database). Among the many different dialects of SQL (go here for a comprehensive discussion along with recommendations), I use PostgreSQL on Linux (installation instructions here and here for installation on Windows) in this demonstration. Important to note might be that all of them support at least the major commands in a similar manner.
A. Create a database on local host, create a table and load data into it
- The CREATE DATABASE statement is used to create a new SQL database, the ‘connect’ statement connects to the database and ‘\dt’ lists all tables in a PostgreSQL database. Note: SQL statements are terminated by default with a semicolon.
- Before importing data into our table, we need to define our table with the required data types; the CREATE TABLE command creates a table of chosen name in the database with indicated columns names and corresponding data types (for the full syntax, go here and here for some SQL data types).
- With the data structure now in our database, we use the COPY command. Below is the copy command we used.
\COPY <table> FROM ‘file name with full path’ DELIMITER ‘,’ CSV HEADER;
COPY and FROM are SQL keywords, the rest are to be filled in as follows: table name, full path of csv file, the delimiter (usually a comma in a CSV file) and ‘CSV HEADER’ indicates that the first row (headers) should be ignored (notice that we defined our own column names in our table).
B. Run simple SQL queries and get a sense of our loaded data
- An important initial question to ask would be about the size of the climate change dataset that we loaded in. In the previous section, note that we used the SELECT statement to select data from the table to make sure that our CSV file is read appropriately (read about the SELECT statement here). We now use the COUNT() function with the SELECT statement to retrieve the number of rows in our table. Note that this number matches the result from our COPY statement in the previous section which showed us how many rows were inserted from said CSV file.
- From our data types printed out in the previous section (‘\d GlobalLandTemperaturesbyCity’), another rudimentary question to ask might be the range of dates and temperatures and the number of unique values. The MAX() and MIN() functions allows us to extract the maximum and minimum in our date ranges and temperatures. DISTINCT() helps us extract unique values which in combination with the COUNT() function gives us the number of unique values.
- Given that we now know the maximum temperature in our dataset (39.651 degrees celsius), what is we want to extract the row that contains this value? Or what if I want to know the maximum temperature in Berlin (where I live)? The WHERE clause comes to the rescue which allows us to filter data (read about it in detail here); multiple conditions can be quoted using the AND/ OR operators. Finally, to end our little tour, we want to order our data based on the temperature value; the ORDER BY statement allows us to do just this. We limit our output to 5 rows using the LIMIT clause.
Alright folks, that is it. Well done for sticking around. We’ve seen how to load our data into a database and got a feel for what kind of questions to ask (and how to write simple queries to answer them). In the next series of SQL posts, we will dig deeper and see how we can use SQL to get build insights into our data.