01. What is the difference between WITH and SELECT?
The SELECT clause selects specific (or) all rows from a Table. On the contrary, the WITH clause creates Common Table Expression (a.k.a CTE), which is like a temporary table with specific columns, which avoids accessing all the columns.
02. What’s the difference between a clustered index and a non-clustered index?
A clustered index is faster since its physical storage is the same as the order of the index. In contrast, non-clustered index stores physically not in the order of the index (simply put, index structure and the storage of the actual data are distinct).
03. How to write an SQL query to find count=1 in a Table?
T1
===
id
===
1
2
1
1
3
SELECT ID FROm T1
GROUP BY ID
HAVING COUNT(ID) = 1;
Output
===
304. What’s star schema?
Star schema is a multi-dimensional relational database model. Start schemas are used in data warehouses and data marts. The schema is useful for querying large datasets.
05. What’s the snowflake schema?
It’s also multi multi-dimensional data model. It’s an extension of the star schema. Here, dimensions are broken down into subdimensions. The schema is useful for OLAP and relational databases.
06. How to create a Trigger and the use of it?
A trigger is a database object, which triggers automatically when a particular event occurs. Here, when we insert a new row into the “employees” table, it sets the system date for the newly inserted row. Triggers should be used judiciously; otherwise, they may lead to performance issues.
-- Creating a table for demonstration purposes
CREATE TABLE employees (
employee_id NUMBER PRIMARY KEY,
employee_name VARCHAR2(50),
hire_date DATE
);
-- Creating a trigger
CREATE OR REPLACE TRIGGER trg_employee_insert
BEFORE INSERT ON employees
FOR EACH ROW
DECLARE
BEGIN
-- Setting the hire_date to the current timestamp when a new record is inserted
:NEW.hire_date := SYSDATE;
END trg_employee_insert;
/07. What’s ACID compliance?
The ACID (Atomicity, Consistency, Isolation, Durability) compliance tells the database is stable and ensures no data loss.
Python Interview Questions
08. Why is a tuple faster than a List?
A tuple is faster than a List. Because of immutability, operations on tuples are minimal compared to lists, and creating a tuple is faster.
Also, Tuples internally are stored in a single block of memory. On the other hand, Lists internally are stored in two blocks – Fixed and variable blocks. The fixed block has the Python object’s information, and the variable block has actual data.
09. Can we have a List in Tuple?
Yes, you can have a List in a Tuple. Tuple supports heterogeneous data types.
10. What’s context manager in Python?
Context managers ensure resource cleanup and proper exception handling. With the “with” statement, __exit__() is called automatically, reducing the need for try-finally blocks. Here is a link.
11. Why are arrays faster? Will they have different data types or collection types?
The arrays are internally stored contagious, So accessing array elements is faster. Secondly, arrays can only have homogeneous numeric items (data types of all the elements should be the same).
PySpark Interview Questions
12. Can we write a PySpark query to pull id count=1?
In PySpark, to find count=1 of the table elements(id), you can write a query in the following way:
from pyspark.sql.functions import count, col
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("Test").getOrCreate()
# Sample data
data = [("A", 1), ("B", 2), ("A", 3), ("C", 4), ("B", 5), ("D", 6)]
# Create a DataFrame
columns = ["id", "value"]
df = spark.createDataFrame(data, columns)
# Perform the aggregation and filter
result = df.groupBy("id") \
.agg(count("id").alias("count_ids")) \
.where(col("count_ids") == 1)
# Show the result
result.show(truncate=False)13. Can we use SQL queries in Pyspark and how?
Yes, you can write SQL queries directly. For this, you need to create a Temporary view with some name(“people”). We can write SQL queries to get rows from the view. Here’s an example:
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("Example").getOrCreate()
# Sample data
data = [("Alice", 25), ("Bob", 30), ("Charlie", 22)]
# Create a DataFrame
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
# Create a temporary SQL table/view
df.createOrReplaceTempView("people")
# Execute SQL queries
result = spark.sql("SELECT Name, Age FROM people WHERE Age > 25")
# Show the result
result.show()Q.1). How do we transform the data as below using the PySpark?
1|A, B, C, D, E
2|E, F, G
As
1A
1B
1C
1D
1E
2E
2F
2G
Solution
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | %pythonfrom pyspark.sql import SparkSessionfrom pyspark.sql.functions import explode, split, concatfrom pyspark.sql.types import StringType, StructType, StructField# Create a Spark sessionspark = SparkSession.builder.appName("data_transformation").getOrCreate()# Given datadata = ["1|A, B, C, D, E", "2|E, F, G"]# Define the schemaschema = StructType([StructField("column", StringType(), True)])# Convert data to a DataFramerdd = spark.sparkContext.parallelize(data)df = spark.createDataFrame(rdd.map(lambda x: (x,)), schema)df.show()# Split the data based on the pipe (|) and explode the array of valuesdf_transformed = ( df.withColumn("id", split("column", "\\|")[0]) .withColumn("values", split("column", "\\|")[1]) .withColumn("value", explode(split("values", ", "))) .selectExpr("id", "trim(value) as value"))# Show the transformed datadf_concat = df_transformed.select(concat("id", "value").alias("Concated"))df_concat.show() |
Output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | +---------------+| column|+---------------+|1|A, B, C, D, E|| 2|E, F, G|+---------------++--------+|Concated|+--------+| 1A|| 1B|| 1C|| 1D|| 1E|| 2E|| 2F|| 2G|+--------+ |
Q.2). How many rows are displayed in Innner join?
Table T1
1
1
1
2
null
Table T2
1
1
null
null
2
Output
The answer will be 7 rows
Q.3). How many rows will be displayed with UNION?
Output
Only 3 rows
-----
ID
1
2
-
Download CSV
3 rows selected.Q.1) What’s the role of a leader?
A Data Engineer in a leadership role manages the design, development, and maintenance of strong and scalable data systems. This position requires technical skills, project management, and effective communication to ensure data projects align with organizational goals.
01. How to sum all the values of an integer?
# Code to sum all the integers
a=12345
d=0
for i in str(a):
d += int(i)
print(d)Output
15
** Process exited - Return Code: 0 **
Press Enter to exit terminal02. How to reverse a string?
s="srinivas"
k=s[::-1]
print(k)Output
savinirs
** Process exited - Return Code: 0 **
Press Enter to exit terminal03. How to write Pyspark code, to create a DataFrame using Data and Schema. And, extract only the date from the DateTime column?
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType,StringType, TimestampType
from pyspark.sql.functions import col, to_date
spark=SparkSession.builder.appName("Date extract").getOrCreate()
schema=StructType(
[StructField("date_column", StringType(), True) ]
)
data = [("2024-02-28 08:30:00",), # Sample data
("2024-02-28 12:45:00",),
("2024-02-29 10:15:00",)]
df=spark.createDataFrame(data, schema)
df=df.withColumn("date_column", col("date_column").cast(TimestampType()))
df=df.withColumn("Date_only", to_date("date_column"))
df.show()Output
+-------------------+----------+
| date_column| Date_only|
+-------------------+----------+
|2024-02-28 08:30:00|2024-02-28|
|2024-02-28 12:45:00|2024-02-28|
|2024-02-29 10:15:00|2024-02-29|
+-------------------+----------+04. What is the “get root directory” in PySpark?
In PySpark, the “root directory” typically refers to the directory from which your Spark application is running. It’s the base directory where your SparkContext or SparkSession is created. check the link for more information.
05. Can we modify Tuples and Strings in Python?
We cannot modify Tuples and String in place. But, we can modify by assigning those to new Tuple and New string.
my_tuple=(1,2,3,4)
new_tuple=my_tuple[0:1] + (7,) + my_tuple[2:]
print(new_tuple)Output
(1, 7, 3, 4)
** Process exited - Return Code: 0 **
Press Enter to exit terminalmy_string="Amazon Operations"
new_string="India " + my_string[0:]
print(new_string)Output
India Amazon Operations
** Process exited - Return Code: 0 **
Press Enter to exit terminal

