313 views
Special Homework SPR23: Extending SQL Island with Window Functions === *Group N3 Authors: Fiona Tiefenbacher, Maurus Fritsche, Noah Lüchinger, Pablo Kleger, Thomas Haas* ## Motivation and Overview This is a document about SQL Window Functions as part of the special homework of MSE module Data Management in spring 2023. The document begins with an overview of Window Functions, including examples and limitations. This is followed by an introduction to SQL Island, and afterwards it will focus on how SQL Window Functions can be used to extend the SQL Island story. The conclusion is the final part of the document. ## What are Window Functions? SQL Window Functions, also known as Analytic Functions, are a powerful feature of SQL that allow you to perform calculations over a subset of rows in a query result, called a window. Window Functions are used to perform tasks like running totals, moving averages, and ranking data without having to use self-joins or subqueries. [1,6] **PostgreSQL SQL Syntax:** ```sql <window function> OVER ([PARTITION BY <partition column(s)>] [ORDER BY <sort column(s)>] [ROWS <frame specification>]) ``` The `OVER` clause defines the window for which the window function will be applied. The `PARTITION BY` clause is used to partition the result set into smaller groups based on one or more columns, and the `ORDER BY` clause is used to sort the rows within each partition. The `ROWS` clause specifies a frame of rows within each partition to include in the calculation of the window function. [1] There are various window functions available, shortly described in Figure 1. ![im:window functions](https://md.coredump.ch/uploads/upload_730f47ff7a300429838221d3cabf6930.png) *Figure 1: Available Window Functions [3]* In addition to these functions in Figure 2, any built-in aggregate functions, like `avg()`, `sum()`, `max()` can be used also as a window function - except *"ordered-set aggregate functions"* [8] >When an aggregate function is used as a window function, it aggregates over the rows within the current row's window frame. An aggregate used with ORDER BY and the default window frame definition produces a “running sum” type of behavior, which may or may not be what's wanted. To obtain aggregation over the whole partition, omit ORDER BY or use ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. Other frame specifications can be used to obtain other effects. [7] ![](https://md.coredump.ch/uploads/upload_65c55caf819dcc9e59b0d9a7909b495b.png) *Figure 2: difference of aggregate and window function [5]* ### Window Function Calls >"A window function call represents the application of an aggregate-like function over some portion of the rows selected by a query. Unlike regular aggregate function calls, this is not tied to grouping of the selected rows into a single output row — each row remains separate in the query output. However the window function is able to scan all the rows that would be part of the current row's group according to the grouping specification (PARTITION BY list) of the window function call. The syntax of a window function call is one of the following:" ```sql function_name ([expression [, expression ... ]]) OVER window_name function_name ([expression [, expression ... ]]) OVER ( window_definition ) function_name ( * ) OVER window_name function_name ( * ) OVER ( window_definition ) ``` [4] Window functions are permitted only in the `SELECT` list and the `ORDER BY` clause of the query. They are forbidden elsewhere, such as in `GROUP BY`, `HAVING` and `WHERE` clauses. This is because they logically execute after the processing of those clauses. Also, window functions execute after regular aggregate functions. This means it is valid to include an aggregate function call in the arguments of a window function, but not vice versa. [1, 6] ### Example Window Function query ```sql SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM empsalary; ``` ```sql= depname | empno | salary | avg -----------+-------+--------+---------- develop | 11 | 5200 | 5020.0000 develop | 7 | 4200 | 5020.0000 develop | 9 | 4500 | 5020.0000 develop | 8 | 6000 | 5020.0000 develop | 10 | 5200 | 5020.0000 personnel | 5 | 3500 | 3700.0000 personnel | 2 | 3900 | 3700.0000 sales | 3 | 4800 | 4866.6667 sales | 1 | 5000 | 4866.6667 sales | 4 | 4800 | 4866.6667 (10 rows) ``` >"The first three output columns come directly from the table empsalary, and there is one output row for each row in the table. The fourth column represents an average taken across all the table rows that have the same depname value as the current row." [1] ### Limitations It should be noted that while PostgreSQL's implementation currently limits the use of *"ordered-set aggregate functions"* with window function syntax, this is a limitation that may be improved with each new release. Additionally, it is important to note that functions such as `width_bucket()` are mathematical functions and not considered to be aggregate functions or window functions. However, even though `width_bucket()` is not an aggregate function, it can still accept an aggregate function as a parameter. For example, `max()` can be used as a parameter for `width_bucket()`, as demonstrated below. **Query:** ```sql= with emp(dname, ename, sal) as (values ('Research', 'Smith', 801), ('Sales', 'James', 950), ('Research', 'Adams', 1100), ('Sales', 'Ward', 1250), ('Sales', 'Martin', 1250), ('Accounting', 'Miller', 1300), ('Sales', 'Turner', 1500), ('Sales', 'Allen', 1600), ('Accounting', 'Clark', 2450), ('Research', 'Jones', 2975), ('Research', 'Ford', 3000), ('Research', 'Scott', 3000), ('Sales', 'Blake', 3850), ('Accounting', 'King', 5000) ) select dname, ename, sal, round(avg(sal) over (partition by dname)) as average, round(var_samp(sal) over (partition by dname)) as variance, ntile (3) over (partition by dname order by sal) as ntile_3, width_bucket(sal, 0, max(sal+1) over (partition by dname), 3) as width_bucket_3 from emp where dname = 'Research' order by 3,2; ``` **Result:** | dname | ename | sal | average | variance | ntile_3 | width_bucket_3 | | -------- | ----- | ---- | ------- | -------- | ------- | -------------- | | Research | Smith | 801 | 2175 | 1261188 | 1 | 1 | | Research | Adams | 1100 | 2175 | 1261188 | 1 | 2 | | Research | Jones | 2975 | 2175 | 1261188 | 2 | 3 | | Research | Ford | 3000 | 2175 | 1261188 | 2 | 3 | | Research | Scott | 3000 | 2175 | 1261188 | 3 | 3 | [View on DB Fiddle](https://www.db-fiddle.com/f/btPVGEGQdZVcs1DUznTr5n/0) --- ## SQL Island - the SQL Adventure “The adventure game [SQL Island](https://sql-island.informatik.uni-kl.de/) is controlled by sending SQL queries which teach the fundamentals of the language in an entertaining and funny way. The player learns how to search and manipulate data in a relational database. Prior knowledge of the SQL language is not required.” [2] ### Storyline The player crashes on an island with various villages. The player needs to get gold to buy a sword in order to protect oneself from the bad villagers. Thus he collects items, sells them and additionally bakes bread to earn enough gold. The player then wants to leave the island, needing therefore a pilot who can escort him. Unfortunately this pilot is held by a bad villager. The player need to get rid of the bad villager and his sister in order to free the pilot. ## SQL Island Storyline continued... ```mermaid flowchart LR A[free Pilot]-.->B{Is there a fa:fa-plane?}; B -->|No| C[Get money /<br> Get fa:fa-plane -items] B -->|Yes| D[find it] subgraph build[build fa:fa-plane] C --> E{Who helps?} E --> L[Person1] E --> K[Person2] E --> M[...] end subgraph find[find fa:fa-plane] D --> F{Where is it?} F --> G[Person1] F --> J[Person2] F --> N[...] end build --> I[Teamwork fa:fa-users - makes the dream work fa:fa-star] find --> I I --> O[flee the island] style A fill:lightgrey,stroke:grey,stroke-width:1px style B fill:#ccd5ae,stroke:grey,stroke-width:1px style C fill:#e9edc9,stroke:grey,stroke-width:1px style D fill:#e9edc9,stroke:grey,stroke-width:1px style L fill:#e9edc9,stroke:grey,stroke-width:1px style F fill:#ccd5ae,stroke:grey,stroke-width:1px style E fill:#ccd5ae,stroke:grey,stroke-width:1px style K fill:#e9edc9,stroke:grey,stroke-width:1px style G fill:#e9edc9,stroke:grey,stroke-width:1px style J fill:#e9edc9,stroke:grey,stroke-width:1px style M fill:#e9edc9,stroke:grey,stroke-width:1px style N fill:#e9edc9,stroke:grey,stroke-width:1px style I fill:#faedcd,stroke:grey,stroke-width:1px style O fill:#d4a373,stroke:grey,stroke-width:1px style build fill:#e9edc9 style find fill:#e9edc9 ``` ### Get a Plane After the pilot is freed, the castaway discovers that the plane is missing. To find the plane, they need to analyze the data they have collected about the island. They have a database that contains information about the plane's last known location, as well as data from various sensors and surveillance equipment scattered around the island. Assuming that the data for the plane's location was being recorded by GPS sensors, the castaway uses a window function to identify the last known location of the plane. For example, let's say we have a table called sensor_data with columns sensor_id, reading_time, sensor_type, and sensor_value. The castaway can use the following query to find the last known location of the plane: <!-- ```sql CREATE TABLE sensor_data ( sensor_id SERIAL PRIMARY KEY, reading_time TIMESTAMP, sensor_type VARCHAR(50), sensor_value VARCHAR(50) ); INSERT INTO sensor_data (reading_time, sensor_type, sensor_value) VALUES ('2023-05-09 12:00:00', 'Altitude', '30000'), ('2023-05-09 12:00:00', 'Temperature', '-30'), ('2023-05-09 12:00:00', 'Velocity', '500'), ('2023-05-09 12:00:00', 'Heading', '27'), ('2023-05-09 12:00:00', 'GPS', '19.238649, 166.629635'), ('2023-05-09 12:09:00', 'Altitude', '8000'), ('2023-05-09 12:09:00', 'Temperature', '-15'), ('2023-05-09 12:09:00', 'Velocity', '500'), ('2023-05-09 12:09:00', 'Heading', '25'), ('2023-05-09 12:09:00', 'GPS', '19.259846, 166.647957'), ('2023-05-09 12:18:00', 'Altitude', '2000'), ('2023-05-09 12:18:00', 'Temperature', '0'), ('2023-05-09 12:18:00', 'Velocity', '200'), ('2023-05-09 12:18:00', 'Heading', '8'), ('2023-05-09 12:18:00', 'GPS', '19.272048, 166.655232'), ('2023-05-09 12:27:00', 'Altitude', '0'), ('2023-05-09 12:27:00', 'Temperature', '35'), ('2023-05-09 12:27:00', 'Velocity', '0'), ('2023-05-09 12:27:00', 'Heading', '5'), ('2023-05-09 12:27:00', 'GPS', '19.273692, 166.655202'), ('2023-05-09 22:00:00', 'Altitude', '10'), ('2023-05-09 22:00:00', 'Temperature', '28'), ('2023-05-09 22:00:00', 'Velocity', '2'), ('2023-05-09 22:00:00', 'Heading', '276'), ('2023-05-09 22:00:00', 'GPS', '19.278431, 166.652381'), ('2023-05-09 23:00:00', 'Altitude', '8'), ('2023-05-09 23:00:00', 'Temperature', '26'), ('2023-05-09 23:00:00', 'Velocity', '0'), ('2023-05-09 23:00:00', 'Heading', '186'), ('2023-05-09 23:00:00', 'GPS', '19.278850, 166.648340'); ``` --> ```sql SELECT * FROM ( SELECT sensor_id, sensor_type, reading_time, sensor_type, sensor_value, ROW_NUMBER() OVER (PARTITION BY sensor_type ORDER BY reading_time DESC) AS row_num FROM sensor_data WHERE sensor_type = 'GPS' ) AS subquery WHERE subquery.row_num = 1; ``` [View on DB Fiddle](https://www.db-fiddle.com/f/5y5axgtApTcozLHmkA36AS/2) This query first filters for sensor data of type 'GPS', and then uses a window function to assign a row number to each row within each partition (group) of sensor_type. The rows are ordered by sensor_reading_time in descending order, so the row with the most recent reading for each sensor_type will have row_num equal to 1. The outer query then filters for rows where row_num is equal to 1, giving us the last known location of the plane. ### Who is Involved If the plane is not found, the castaway may need to find people to build a plane. They can use SQL window functions to identify and count the people with the necessary skills and experience: <!-- ```sql CREATE TABLE inhabitant ( person_id SERIAL PRIMARY KEY, name VARCHAR(50), village_id VARCHAR(50), gender VARCHAR(50), job VARCHAR(50), skill_type VARCHAR(50), gold INTEGER, state VARCHAR(50) ); INSERT INTO inhabitant (name, village_id, gender, job, skill_type, gold, state) VALUES ('Paul Bakerman', 1, 'm', 'baker', 'food', 850, 'friendly'), ('Ernest Perry', 3, 'm', 'weapon smith', 'mechanical', 280, 'friendly'), ('Rita Ox', 1, 'f', 'baker', 'food', 350, 'friendly'), ('Carl Ox', 1, 'm', 'merchant', 'sales', 250, 'friendly'), ('Dirty Dieter', 3, 'm', 'smith', 'mechanical', 650, 'evil'), ('Gerry Slaughterer', 2, 'm', 'butcher', 'food', 4850, 'evil'), ('Peter Slaughterer', 3, 'm', 'butcher', 'food', 3250, 'evil'), ('Arthur Tailor', 2, 'm', 'pilot', 'aviation', 490, 'kidnapped'), ('Tiffany Drummer', 1, 'f', 'baker', 'food', 550, 'evil'), ('Peter Drummer', 1, 'm', 'smith', 'mechanical', 600, 'friendly'), ('Dirty Diane', 3, 'f', 'farmer', 'mechanical', 10, 'evil'), ('Otto Alexander', 2, 'm', 'dealer', 'sales', 680, 'friendly'), ('Fred Dix', 3, 'm', 'author', 'creation', 420, 'friendly'), ('Enrico Carpenter', 3, 'm', 'weapon smith', 'mechanical', 510, 'evil'), ('Helen Grasshead', 2, 'f', 'dealer', 'sales', 680, 'friendly'), ('Ivy Hatter', 1, 'f', 'dealer', 'sales', 770, 'evil'), ('Edward Grasshead', 3, 'm', 'butcher', 'food', 990, 'friendly'), ('Ryan Horse', 3, 'm', 'black smith', 'mechanical', 390, 'friendly'), ('Ann Meaty', 2, 'f', 'butcher', 'food', 2280, 'friendly'); ``` --> ```sql SELECT person_id, name, job, COUNT(*) OVER (PARTITION BY skill_type) AS skill_count FROM inhabitant WHERE skill_type IN ('mechanical', 'aviation') ``` [View on DB Fiddle](https://www.db-fiddle.com/f/7fmqDv522Y8ZBJXmp9sJgd/0) This query returns a table that shows the number of people with mechanical or aviation skills. The castaway can use this information to identify the most skilled and qualified people to help build a plane. ### Flee the Island Once the plane is ready, the castaway can use SQL window functions to plan their escape from the island. For example, they could use the ROW_NUMBER function to rank the available flight paths based on distance, fuel consumption, and potential other factors: <!-- ```sql CREATE TABLE available_flight_paths ( flight_id SERIAL PRIMARY KEY, flight_path VARCHAR(50), distance INTEGER, fuel_consumption INTEGER ); INSERT INTO available_flight_paths (flight_path, distance, fuel_consumption) VALUES ('Hawaii', 3695, 44400), ('Fiji', 4283, 51400), ('New Guinea', 3838, 46100), ('Cairns', 4623, 55500), ('Auckland', 6317, 75800), ('Tokio', 3192, 38300), ('Los Angeles', 7470, 89700), ('Philippines', 4628, 55600); ``` --> ```sql SELECT flight_path, ROW_NUMBER() OVER (ORDER BY distance ASC, fuel_consumption ASC) AS path_rank, distance, fuel_consumption FROM available_flight_paths ``` [View on DB Fiddle](https://www.db-fiddle.com/f/khVJju2gnmgGFD727SBjQX/2) This query returns a table that shows the available flight paths ranked by distance and fuel consumption. The castaway can use this information to choose the best flight path and escape the island safely. --- ## Conclusion In conclusion, window functions in PostgreSQL are powerful tools for advanced data analysis, offering versatile capabilities for calculations, aggregations, and querying. They provide valuable insights, enable informed decision-making, and facilitate efficient data processing. Window functions are essential for extracting meaningful information and solving complex problems, making them a valuable asset in PostgreSQL's data manipulation arsenal. In the given story, the castaway relied on window functions to extract valuable information from their database and make critical decisions, such as locate the missing plane and identify skilled individuals. Later, they used these functions to plan their escape by ranking flight paths based on various factors. This highlights the versatility and usefulness of window functions in analyzing data and making informed decisions. ## Quiz ### 1. Select all optional clauses of the window function - [ ] OVER - [ ] PARTITION BY - [ ] ORDER BY <details> <summary style='color: green'>Solution</summary> - [ ] OVER - [x] PARTITION BY - [x] ORDER BY </details> ### 2. Select all clauses of the query in which the window functions are allowed - [ ] SELECT - [ ] GROUP BY - [ ] ORDER BY - [ ] WHERE <details> <summary style='color: green'>Solution</summary> - [x] SELECT - [ ] GROUP BY - [x] ORDER BY - [ ] WHERE </details> ### 3. which query returns the cumulative sum of of all salaries - [ ] SELECT sum(salary) OVER () FROM empsalary; - [ ] SELECT sum(salary) OVER (PARTITION BY depname) FROM empsalary; - [ ] SELECT sum(salary) OVER (ORDER BY salary) FROM empsalary; <details> <summary style='color: green'>Solution</summary> - [ ] SELECT sum(salary) OVER () FROM empsalary; - [ ] SELECT sum(salary) OVER (PARTITION BY depname) FROM empsalary; - [x] SELECT sum(salary) OVER (ORDER BY salary) FROM empsalary; </details> ### 4. How will the result of this query be ordered? ```sql SELECT * FROM empsalary ORDER BY rank() OVER ( PARTITION BY depname ORDER by depname, salary DESC); ``` <details> <summary style='color: green'>Solution</summary> The highest salaries of all departments are listed first, followed by the second highest... </details> ## References [1] The PostgreSQL Global Development Group (2023). PostgreSQL Documentation. Tutorial Window Function. https://www.postgresql.org/docs/current/tutorial-window.html [28.05.2023] [2] Schildgen, Johannes (2014). SQL Island: An Adventure Game to Learn the Database Language SQL. [3] Data Geekery GmbH (2016). SQL Masterclass. [4] The PostgreSQL Global Development Group (2023). PostgreSQL Documentation. Window Function Calls. https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS [28.05.2023] [5] SQL Tutorial (2022). SQL Window Functions. https://www.sqltutorial.org/sql-window-functions/ [28.05.2023] [6] Learn SQL (2017). SQL Window Function Example With Explanations https://learnsql.com/blog/sql-window-functions-examples/ [28.05.2023] [7] The PostgreSQL Global Development Group (2023). PostgreSQL Documentation. Window Functino https://www.postgresql.org/docs/current/functions-window.html [28.05.2023] [8] The PostgreSQL Global Development Group (2023). PostgreSQL Documentation. Aggregate Functions https://www.postgresql.org/docs/current/functions-aggregate.html [28.05.2023] ## Bibliography - Zhiyanov, Anton (2023). SQL Window Functions Explained. https://antonz.org/sql-window-functions-book/ [28.05.2023] - Microsoft (2023). SELECT - WINDOW - (Transact-SQL). https://learn.microsoft.com/en-us/sql/t-sql/queries/select-window-transact-sql?view=sql-server-ver16 [28.05.2023] - GeeksForGeeks (2023). Window functions in SQL. https://www.geeksforgeeks.org/window-functions-in-sql/ [28.05.2023] ![](https://openschoolmaps.ch/bilder/license.png) Free to use under license (c) [CC0 1.0](http://creativecommons.org/publicdomain/zero/1.0/) unless otherwise noted.