SQL Interview Questions for Data Analysts in 2025 – Top Questions (and Answers)

Sat, Mar 22, 2025

This is Part 1 of this SQL interview questions series. Do check out Part 2 where we discuss SQL test cases, behavioral SQL questions and how to ace your SQL interview.

Preparing for a data analyst interview in 2024? One of the key skills you'll be tested on is SQL. Whether you're a fresher stepping into your first data role or an experienced analyst brushing up your skills, mastering SQL is crucial for extracting insights from data.

SQL questions for a data analyst interview can range from simple definitions to writing complex queries and solving real-world problems. Interviewers typically focus on a mix of theoretical questions and practical tasks to ensure you can handle data scenarios end-to-end.

In this 2024 guide, we'll cover a wide range of SQL interview questions and answers for data analyst roles, complete with clear explanations and examples.

We'll start with basic SQL interview questions to make sure your fundamentals are solid, then move on to advanced SQL questions and scenario-based challenges that test deeper knowledge.

You'll also encounter real-world SQL scenarios for interviews, where we discuss how to approach typical tasks a data analyst might face..

Pro Tip: As you prepare, don't just memorize queries—practice writing and running them. Many candidates find it helpful to use online SQL editors or take mock tests.

For instance, Refonte Learning offers excellent programs for upskilling in SQL and data analytics, where you can work on projects and get hands-on practice.

Now, let's dive into these SQL interview questions for data analyst interviews in 2024 – from basic to advanced!

Basic SQL Interview Questions

Let's start with the fundamentals. These basic questions cover concepts every data analyst should know, especially if you're a beginner.

1. What is SQL and why is it important for data analysts?
SQL stands for Structured Query Language. It's the standard language used to communicate with relational databases – allowing you to query and manipulate data. For a data analyst, SQL is essential because most organizational data lives in databases.

Knowing SQL means you can extract exactly the data you need for analysis, join multiple tables to gather insights, and perform aggregations (sums, averages, counts) to answer business questions.

2. What are the different types of SQL statements (DDL, DML, DCL, TCL)?
SQL commands are grouped into categories based on their purpose:

DDL (Data Definition Language): Defines or alters the structure of the database (e.g., CREATE, ALTER, DROP).
DML (Data Manipulation Language): Manipulates the data in the database (e.g., SELECT, INSERT, UPDATE, DELETE).
DCL (Data Control Language): Controls access to data (e.g., GRANT, REVOKE permissions).
TCL (Transaction Control Language): Manages transactions (e.g., COMMIT, ROLLBACK for saving or undoing groups of operations).

3. What is a primary key in SQL?
A primary key is a column (or a set of columns) in a table that uniquely identifies each row in that table. Primary keys must be unique for every record and they cannot be NULL.

For example, a customer_id column can serve as a primary key in a Customers table to ensure each customer row is distinct.

The primary key allows efficient look-ups and is often used to relate that table to others (via foreign keys).

4. What is a foreign key in SQL?
A foreign key is a column in one table that references the primary key of another table. This link between tables enforces referential integrity – it ensures that the relationship between two tables remains consistent.

For example, an Orders table might have a customer_id column that is a foreign key referencing the Customers table's primary key.

This means an order’s customer_id must match an existing customer in the Customers table (preventing, say, an order for a non-existent customer).

5. Explain the different types of JOINs in SQL.
JOINs are used to combine data from two or more tables based on a related column between them:

INNER JOIN: Returns only the rows where there is a match in both tables (the “intersection” of the tables).
LEFT JOIN (LEFT OUTER JOIN): Returns all rows from the left table, and the matching rows from the right table. If there's no match, the result will have NULLs for the right table's columns.
RIGHT JOIN (RIGHT OUTER JOIN): Returns all rows from the right table, and the matching rows from the left table. If no match, NULLs for the left table's columns. (Right joins are less common; you can usually swap the table order and use a left join instead.)
FULL JOIN (FULL OUTER JOIN): Returns all rows from both tables, matching them where possible. Unmatched rows from either side will still appear (NULLs will fill in for missing data).
CROSS JOIN: Returns the Cartesian product of two tables – every combination of a row from the first table with a row from the second table. (This doesn’t require a matching key and is rarely used unless you need all combinations.)

6. What is the difference between the WHERE and HAVING clauses?
Both WHERE and HAVING are used to filter query results, but at different stages of the query execution:

WHERE filters rows before any grouping or aggregation occurs. It applies to individual rows of the table (you can’t use aggregate functions here).
HAVING filters groups after the GROUP BY has been applied. It applies to aggregated results (you use HAVING to filter based on aggregate values, like HAVING COUNT(*) > 5).

In short, use WHERE to restrict which rows are considered in the query (e.g., WHERE region = 'EU'), and use HAVING to restrict which groups appear in the results (e.g., HAVING COUNT(*) > 5 to get only groups with more than 5 items).

7. How does the GROUP BY clause work in SQL?
The GROUP BY clause is used to arrange identical data into groups. In a query with aggregation, GROUP BY collates rows that have the same values in specified column(s) into a single group, so that aggregate functions (COUNT, SUM, AVG, MAX, MIN) can be applied to each group.

For example, to get total sales per region, you'd write:

sql

Copy

SELECT region, SUM(amount)

FROM Sales

GROUP BY region;

This groups the Sales records by region and produces one summary row per region with the total sales amount. Every column in your SELECT that’s not inside an aggregate function must be listed in the GROUP BY. Essentially, GROUP BY + aggregates will give you one result per group.

8. What is a NULL value in SQL and how do you handle NULLs?
NULL represents the absence of a value (unknown or missing data) in SQL. It is not the same as zero or an empty string. Handling NULLs properly is important:

To check if a value is NULL, use IS NULL or IS NOT NULL (e.g., WHERE address IS NULL finds rows with no address).
Any comparison with NULL (like = NULL or > NULL) will result in neither true nor false (it's "unknown"), so those comparisons won’t match anything. That's why the special IS NULL syntax is needed.
Use functions like COALESCE(column, default_value) to replace NULL with a default. For example, COALESCE(phone, 'N/A') would return "N/A" when phone is NULL, otherwise it returns the phone value.
Be careful with NULLs in JOINs and aggregates. For instance, in an OUTER JOIN, if a match isn’t found the joined columns will be NULL. And COUNT(column) counts only non-NULL values in that column.

9. What is an index in SQL and why would you use it?
An index is like a lookup that makes data retrieval faster. By indexing a column (or set of columns), the database can find matching rows without scanning the entire table.

For example, an index on customer_id in Orders lets the database quickly locate all orders for a given customer. In practice, you add indexes on columns that are frequently filtered or joined on, to improve query performance.

10. What is the difference between DELETE, TRUNCATE, and DROP commands?

DELETE: Removes specific rows from a table (with an optional WHERE clause). The data is gone, but the table remains. (Can often be rolled back if within a transaction.)
TRUNCATE: Removes all rows from a table very quickly, without logging each delete. It's like a bulk delete of the whole table. The table remains (now empty). You can’t use a WHERE clause with TRUNCATE, and it usually cannot be undone with rollback.
DROP: Completely deletes the table (schema and data) from the database. Use this to entirely remove a table’s definition and its contents.

11. What is a view in SQL and why might you use one?
A view is a virtual table defined by a query. It doesn’t store data itself; it shows data from underlying tables according to the query. You create a view with a statement like CREATE VIEW view_name AS SELECT ....

Why use a view? Mainly to simplify or secure data access. For example, you can create a view to simplify a complex query (the view encapsulates the JOINs/subqueries, so others can select from the view directly) or to limit access to data (the view can show only certain columns or filter out sensitive rows, enforcing a form of security). Views essentially provide a saved query that you can treat as a table for convenience.

Advanced SQL Questions & Scenarios

Now let's tackle some advanced SQL interview questions. These delve into trickier concepts or scenarios you might encounter as a data analyst.

Interviewers ask these to see if you can write complex queries, handle large datasets, and optimize performance when needed. Many are scenario-based, requiring you to think through how to solve a problem with SQL.

1. What are window functions in SQL, and can you give an example?
Window functions allow you to perform calculations across rows without collapsing them into groups. In other words, you can compute values like running totals, averages, or rankings over a "window" of rows that relates to the current row. They use an OVER clause to define that window (often partitioned by some column and ordered by another).

For example, you could show each employee's salary and also the average salary of their department on the same row using a window function:

sql

Copy

SELECT name, department, salary,

AVG(salary) OVER(PARTITION BY department) AS dept_avg_salary

FROM Employees;

This will list each employee, and the window function computes the department average salary and displays it alongside each employee (instead of one result per department like GROUP BY would do).

Window functions are great for tasks like running totals, moving averages, or ranking. They let you answer questions like "what is this employee's salary rank within the company?" or "what is this order's total compared to the sum of all orders so far this month?" in a straightforward way.

2. What is a Common Table Expression (CTE), and when would you use it?
A Common Table Expression (CTE) is a way to define a temporary result set within your query using the WITH clause. It's essentially a named subquery that you can reference multiple times in the main query, which helps break complex queries into parts and improves readability.

For example, instead of writing a complicated subquery inline, you could do:

sql

Copy

WITH SalesByDept AS (

SELECT department_id, SUM(amount) AS total_sales

FROM Sales

GROUP BY department_id

)

SELECT department_id, total_sales

FROM SalesByDept

WHERE total_sales > 100000;

Here, SalesByDept is a CTE that computes total sales per department, and then the main query uses that result to filter departments with sales over 100k.

CTEs make long queries easier to understand (especially if you need to use the same derived data multiple times or do recursive queries). In terms of performance, they often behave like normal subqueries, but they enable a cleaner query structure.

3. What is a self join, and can you give an example scenario for using one?
A self join is when a table is joined to itself. This is useful when the table has a hierarchical or relational structure within itself.

A common scenario is an Employee table where each employee has a manager_id that references another row in the same table (the manager). To list employees and their manager names, you'd self join the table:

sql

Copy

SELECT e.employee_name AS employee,

m.employee_name AS manager

FROM Employees e

JOIN Employees m

ON e.manager_id = m.employee_id;

Here, Employees is used twice (aliased as e for the employee and m for the manager). The join matches an employee’s manager_id to the manager’s employee_id in the same table.

Self joins can also be used for comparing rows in a table to other rows in the same table. Whenever you have a relationship of a row to another row in the same table, a self join is a go-to solution.

4. What is the difference between UNION and UNION ALL?
Both UNION and UNION ALL combine the results of two or more SELECT queries. The difference is:

UNION: Eliminates duplicate rows in the combined result. It returns only distinct rows across both query results.
UNION ALL: Includes all rows from both queries, including duplicates. It does no deduplication.

Use UNION when you need to avoid duplicates in the result set. Use UNION ALL when you want all results combined (and it’s faster since it doesn’t have to remove duplicates). Remember, the SELECT statements being unioned must have the same number of columns with compatible data types.

5. How can you find duplicate records in a table?
To find duplicates, determine which column or combination of columns should be unique, and then find where that uniqueness is violated. The typical approach is:

sql

Copy

SELECT column1, column2, ..., COUNT(*) as cnt

FROM YourTable

GROUP BY column1, column2, ...

HAVING COUNT(*) > 1;

Replace column1, column2, ... with the fields that should be unique (for a single-column key, just that column). This query groups by the identifying fields and uses HAVING COUNT(*) > 1 to filter to groups of rows that appear more than once, indicating duplicates.

For example, to find duplicate emails in a Users table:

sql

Copy

SELECT email, COUNT(*)

FROM Users

GROUP BY email

HAVING COUNT(*) > 1;

This will list any email that appears more than once, along with how many times it appears.

If you need to actually delete duplicates or see the full rows that are duplicated, you'd use this result as a subquery or join it back to the main table to pinpoint the rows. But in terms of interview answer, explaining the GROUP BY + HAVING approach is usually sufficient.

6. How would you get the second highest (or Nth highest) value from a table?
Finding the Nth highest value can be done in a few ways. For example, to get the second highest salary from an Employees table:

Using a subquery: One approach is:

sql
Copy
SELECT MAX(salary)

FROM Employees

WHERE salary < (SELECT MAX(salary) FROM Employees);

Here, the inner subquery finds the maximum salary overall, and the outer query finds the highest salary less than that (i.e., the second highest). For the third highest, you could nest another layer or adjust the logic (e.g., find max less than the second highest).

Using ORDER BY with OFFSET: Another method (in databases that support it) is:

sql
Copy
SELECT salary

FROM Employees

ORDER BY salary DESC

LIMIT 1 OFFSET 1;

This sorts salaries in descending order and skips the first row (highest salary) and then takes the next row, which is the second highest. Similarly, OFFSET 2 would get the third highest, and so on.
Using a window function: Another approach is to use ranking functions (e.g., DENSE_RANK()) in a subquery or CTE to pick the Nth highest value. This method can handle ties properly (for example, second highest distinct value).

Any of these methods are acceptable in an interview; what's important is showing you understand how to approach the problem. The subquery method is a common answer since it’s straightforward for explaining the second highest scenario.

7. How do you optimize a slow-running SQL query?
To optimize a slow query, consider:

Indexing: Make sure appropriate columns (used in WHERE filters or JOINs) are indexed so the database can quickly look up matches instead of scanning full tables.
Query simplification: Only select the data you need (avoid SELECT *), and filter data early in the query (in the WHERE clause) to avoid processing unnecessary rows. If a query is overly complex, break it into smaller parts or use a CTE to make it more efficient or easier to understand. Sometimes simplifying a subquery or removing an unnecessary JOIN can speed things up.
Check the execution plan: This will tell you how the database is executing the query (e.g., using indexes or doing full scans). If you see a full table scan where you expected an index seek, you might need to add an index or rewrite the query. Tools like EXPLAIN in MySQL/PostgreSQL or the execution plan in SQL Server can guide you to the bottleneck.

In an interview, mentioning these strategies (especially indexing and filtering) shows you understand performance. And if you've encountered slow queries before, you can talk about how you sped them up (e.g., adding an index reduced a query from 10 seconds to 1 second). Remember, as data grows, writing efficient SQL becomes increasingly important, so interviewers value insight into optimization.

8. What is a correlated subquery, and how does it differ from a regular subquery?
A correlated subquery is a subquery that depends on the outer query for some of its values. It means the subquery is executed for each row of the outer query. A regular subquery (non-correlated) can run independently of the outer query and is usually executed once, with the result fed into the outer query.

Example of a correlated subquery:

sql

Copy

SELECT e.name

FROM Employees e

WHERE e.salary > (

SELECT AVG(x.salary)

FROM Employees x

WHERE x.department = e.department

);

Here, the subquery calculates the average salary for the department of the current employee e. Because it refers to e.department from the outer query, it must be re-run for each employee. Each employee is compared to their department’s average. This is correlated because the inner query uses a value (e.department) from the outer query.

Example of a non-correlated subquery:

sql

Copy

SELECT name

FROM Employees

WHERE salary > (SELECT AVG(salary) FROM Employees);

The inner query here (SELECT AVG(salary) FROM Employees) computes one value (the overall average salary) independently. The outer query then compares each employee's salary to that single value. The subquery doesn’t need to know about the individual employee; it’s not correlated.

In summary, a correlated subquery runs repeatedly (once for each row of the outer query), using data from that row, whereas a regular subquery runs once and doesn’t depend on individual outer rows. Correlated subqueries can be less efficient if the outer result set is large, but sometimes they’re the most straightforward way to express a query. Often, you could rewrite a correlated subquery as a JOIN for performance.

9. What is normalization in databases, and what are the benefits of normalizing data?
Normalization means organizing data into multiple related tables to minimize duplication. In a normalized database, each piece of information is stored only once (instead of repeating, you link tables via keys).

Benefits: It eliminates redundant data and ensures consistency (e.g., a customer's info is stored in one table and referenced elsewhere, so updating it in one place updates it everywhere). This also prevents anomalies when inserting or deleting data, because dependencies are properly managed.

For example, instead of one big table with orders and customer details (with the same customer info repeated on each order), you'd have a Customers table and an Orders table related by a customer_id.

The trade-off is that you may need to JOIN tables more often to get a full picture, which can make queries a bit more complex or slightly slower in read-heavy scenarios.

However, for most transactional systems, normalization (usually up to 3NF) is ideal for maintaining clean and consistent data.

With SQL, there's often more than one way to solve a problem – if you choose a correct approach and explain it well, that's a success. And if you get to a solution efficiently, even better.

Read Part 2 of this article series for more insights on how to ace SQL interview and more.