HI WELCOME TO KANSIRIS
Showing posts with label data science. Show all posts
Showing posts with label data science. Show all posts

PySpark Interview Questions

Leave a Comment

01. What is the difference between WITH and SELECT?

The SELECT clause selects specific (or) all rows from a Table. On the contrary, the WITH clause creates Common Table Expression (a.k.a CTE), which is like a temporary table with specific columns, which avoids accessing all the columns.

02. What’s the difference between a clustered index and a non-clustered index?

A clustered index is faster since its physical storage is the same as the order of the index. In contrast, non-clustered index stores physically not in the order of the index (simply put, index structure and the storage of the actual data are distinct).

03. How to write an SQL query to find count=1 in a Table?

T1
===
id
===
1
2
1
1
3

SELECT ID FROm T1
GROUP BY ID
HAVING COUNT(ID) = 1;
Output
===
3

04. What’s star schema?

Star schema is a multi-dimensional relational database model. Start schemas are used in data warehouses and data marts. The schema is useful for querying large datasets.

05. What’s the snowflake schema?

It’s also multi multi-dimensional data model. It’s an extension of the star schema. Here, dimensions are broken down into subdimensions. The schema is useful for OLAP and relational databases.

06. How to create a Trigger and the use of it?

A trigger is a database object, which triggers automatically when a particular event occurs. Here, when we insert a new row into the “employees” table, it sets the system date for the newly inserted row. Triggers should be used judiciously; otherwise, they may lead to performance issues.

-- Creating a table for demonstration purposes
CREATE TABLE employees (
employee_id NUMBER PRIMARY KEY,
employee_name VARCHAR2(50),
hire_date DATE
);
-- Creating a trigger
CREATE OR REPLACE TRIGGER trg_employee_insert
BEFORE INSERT ON employees
FOR EACH ROW
DECLARE
BEGIN
-- Setting the hire_date to the current timestamp when a new record is inserted
:NEW.hire_date := SYSDATE;
END trg_employee_insert;
/

07. What’s ACID compliance?

The ACID (Atomicity, Consistency, Isolation, Durability) compliance tells the database is stable and ensures no data loss.

Python Interview Questions 

08. Why is a tuple faster than a List?

A tuple is faster than a List. Because of immutability, operations on tuples are minimal compared to lists, and creating a tuple is faster.

Also, Tuples internally are stored in a single block of memory. On the other hand, Lists internally are stored in two blocks – Fixed and variable blocks. The fixed block has the Python object’s information, and the variable block has actual data. 

09. Can we have a List in Tuple?

Yes, you can have a List in a Tuple. Tuple supports heterogeneous data types.

10. What’s context manager in Python?

Context managers ensure resource cleanup and proper exception handling. With the “with” statement, __exit__() is called automatically, reducing the need for try-finally blocks. Here is a link.

11. Why are arrays faster? Will they have different data types or collection types? 

The arrays are internally stored contagious, So accessing array elements is faster. Secondly, arrays can only have homogeneous numeric items (data types of all the elements should be the same).

PySpark Interview Questions

12. Can we write a PySpark query to pull id count=1?

In PySpark, to find count=1 of the table elements(id), you can write a query in the following way:

from pyspark.sql.functions import count, col
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("Test").getOrCreate()
# Sample data
data = [("A", 1), ("B", 2), ("A", 3), ("C", 4), ("B", 5), ("D", 6)]
# Create a DataFrame
columns = ["id", "value"]
df = spark.createDataFrame(data, columns)
# Perform the aggregation and filter
result = df.groupBy("id") \
    .agg(count("id").alias("count_ids")) \
    .where(col("count_ids") == 1)
# Show the result
result.show(truncate=False)

13. Can we use SQL queries in Pyspark and how?

Yes, you can write SQL queries directly. For this, you need to create a Temporary view with some name(“people”). We can write SQL queries to get rows from the view. Here’s an example:

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("Example").getOrCreate()
# Sample data
data = [("Alice", 25), ("Bob", 30), ("Charlie", 22)]
# Create a DataFrame
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
# Create a temporary SQL table/view
df.createOrReplaceTempView("people")
# Execute SQL queries
result = spark.sql("SELECT Name, Age FROM people WHERE Age > 25")
# Show the result
result.show()

01. How do we retrieve positive, negative, and zero values from a Table in SQL?

Here are the SQL queries that know the count of positive, negative, and zero values.

Table1
-----
NUMBER1
-------
1
2
3
-1
-2
0
0
1

-- SQL Queries
SELECT COUNT(*) FROM TABLE1 WHERE NUMBER < 0;
SELECT COUNT(*) FROM TABLE1 WHERE NUMBER > 0;
SELECT COUNT(*) FROM TABLE1 WHERE NUMBER = 0;

-- PySpark queries
# Assuming df is your DataFrame representing TABLE1

# Perform the counts
count_negative = df.filter(df['NUMBER'] < 0).count()
count_positive = df.filter(df['NUMBER'] > 0).count()
count_zero = df.filter(df['NUMBER'] == 0).count()

# Print the counts
print("Count of negative numbers:", count_negative)
print("Count of positive numbers:", count_positive)
print("Count of zeros:", count_zero)
02. How do we remove list duplicates in Python?

my_list=[1,2,3,1,2,3,5,6]

# 01.using set built-in
my_list=[1,2,3,1,2,3,5,6]
output=set(my_list)
print(output)

# 02.using for loop
my_list = [1, 2, 2, 3, 4, 4, 5]

unique_elements = []
for item in my_list:
    if item not in unique_elements:
        unique_elements.append(item)

# 03.From collections
from collections import OrderedDict

my_list = [1, 2, 2, 3, 4, 4, 5]
unique_elements = list(OrderedDict.fromkeys(my_list))

# 04.From NumPy Unique
import numpy as np
my_list = [1, 2, 2, 3, 4, 4, 5]
unique_elements = np.unique(my_list)
03. How do we join two data frames in PySpark?

from pyspark.sql import SparkSession
# Initialize SparkSession
spark = SparkSession.builder \
    .appName("DataFrameJoinExample") \
    .getOrCreate()

# Assuming df1 and df2 are your DataFrames
# Joining df1 and df2 on a common column

joined_df = df1.join(df2, df1['common_column'] == df2['common_column'], 'inner')

# Show the joined DataFrame
joined_df.show()
04. How do we get the top n rows from the joined dataframe?

from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
windowDept = Window.partitionBy("department").orderBy(col("salary").desc())

df.withColumn("row", row_number().over(windowDept))
  .filter(col("row") <= n)
  .drop("row").show()
05. How do you decide whether a SQL query is a narrow transformation?

Here is a link for more details.

Narrow Transformation: Operations like filter and adding a column using withColumn can be performed on a single RDD partition. These operations do not need shuffling data across partitions. These transformations, known as Narrow transformations, are less costly. They do not need data to be moved between executor or worker nodes.

Wide Transformations: These are the operations that need shuffling data across partitions. This means that the data needs to be moved between executor or worker nodes. Some examples of wide transformations in Spark include eg. Joins, repartitioning, groupBy, etc.


Q.1). How do we transform the data as below using the PySpark?

1|A, B, C, D, E
2|E, F, G

As
1A
1B
1C
1D
1E
2E
2F
2G

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
%python
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, concat
from pyspark.sql.types import StringType, StructType, StructField
 
# Create a Spark session
spark = SparkSession.builder.appName("data_transformation").getOrCreate()
 
# Given data
data = ["1|A, B, C, D, E", "2|E, F, G"]
 
# Define the schema
schema = StructType([StructField("column", StringType(), True)])
 
# Convert data to a DataFrame
rdd = spark.sparkContext.parallelize(data)
df = spark.createDataFrame(rdd.map(lambda x: (x,)), schema)
 
df.show()
 
# Split the data based on the pipe (|) and explode the array of values
df_transformed = (
    df.withColumn("id", split("column", "\\|")[0])
    .withColumn("values", split("column", "\\|")[1])
    .withColumn("value", explode(split("values", ", ")))
    .selectExpr("id", "trim(value) as value")
)
 
# Show the transformed data
df_concat = df_transformed.select(concat("id", "value").alias("Concated"))
df_concat.show()

Output

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
+---------------+
|         column|
+---------------+
|1|A, B, C, D, E|
|      2|E, F, G|
+---------------+
 
+--------+
|Concated|
+--------+
|      1A|
|      1B|
|      1C|
|      1D|
|      1E|
|      2E|
|      2F|
|      2G|
+--------+

Q.2). How many rows are displayed in Innner join?

Table T1
1
1
1
2
null

Table T2
1
1
null
null
2

Output

The answer will be 7 rows

Q.3). How many rows will be displayed with UNION?

Output

Only 3 rows
-----
ID
1
2
-

Download CSV
3 rows selected.

Q.1) What’s the role of a leader?

A Data Engineer in a leadership role manages the design, development, and maintenance of strong and scalable data systems. This position requires technical skills, project management, and effective communication to ensure data projects align with organizational goals.


01. How to sum all the values of an integer?

# Code to sum all the integers
a=12345

d=0
for i in str(a):
d += int(i)
print(d)

Output

15

** Process exited - Return Code: 0 **
Press Enter to exit terminal

02. How to reverse a string?

s="srinivas"
k=s[::-1]
print(k)

Output

savinirs

** Process exited - Return Code: 0 **
Press Enter to exit terminal

03. How to write Pyspark code, to create a DataFrame using Data and Schema. And, extract only the date from the DateTime column?

from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType,StringType, TimestampType
from pyspark.sql.functions import col, to_date

spark=SparkSession.builder.appName("Date extract").getOrCreate()
schema=StructType(
[StructField("date_column", StringType(), True) ]
)
data = [("2024-02-28 08:30:00",), # Sample data
("2024-02-28 12:45:00",),
("2024-02-29 10:15:00",)]
df=spark.createDataFrame(data, schema)
df=df.withColumn("date_column", col("date_column").cast(TimestampType()))
df=df.withColumn("Date_only", to_date("date_column"))
df.show()

Output

+-------------------+----------+
| date_column| Date_only|
+-------------------+----------+
|2024-02-28 08:30:00|2024-02-28|
|2024-02-28 12:45:00|2024-02-28|
|2024-02-29 10:15:00|2024-02-29|
+-------------------+----------+

04. What is the “get root directory” in PySpark?

In PySpark, the “root directory” typically refers to the directory from which your Spark application is running. It’s the base directory where your SparkContext or SparkSession is created. check the link for more information.

05. Can we modify Tuples and Strings in Python?

We cannot modify Tuples and String in place. But, we can modify by assigning those to new Tuple and New string.

my_tuple=(1,2,3,4)
new_tuple=my_tuple[0:1] + (7,) + my_tuple[2:]
print(new_tuple)

Output

(1, 7, 3, 4)

** Process exited - Return Code: 0 **
Press Enter to exit terminal
my_string="Amazon Operations"
new_string="India " + my_string[0:]
print(new_string)

Output

India Amazon Operations

** Process exited - Return Code: 0 **
Press Enter to exit terminal