Delta Lakehouse Questions

1. Describe the relationship between the data lakehouse and the data warehouse.

A data lakehouse combines the scalability and flexibility of a data lake with the data management and ACID transaction support of a data warehouse. It allows both structured and unstructured data to coexist while supporting BI and ML workloads with strong governance.

2. Identify the improvement in data quality in the data lakehouse over the data lake.

Data lakehouses improve quality by supporting ACID transactions, schema enforcement, and data governance, which traditional data lakes lack. This leads to more reliable, consistent, and queryable data.

3. Compare and contrast silver and gold tables, and which workloads use bronze vs. gold tables.

Table Layer	Purpose	Used In Workloads
Bronze	Raw ingestion layer; stores unprocessed data.	Used in data ingestion, archiving, and audit trail workloads.
Silver	Cleansed and enriched data; applies transformations.	Used in analytics and operational reporting.
Gold	Business-level aggregates; optimized for BI.	Used in dashboards, ML models, and executive reports.

4. Identify elements of the Databricks Platform Architecture (data plane vs. control plane).

Plane	Description	Location
Control Plane	Manages metadata, notebooks, jobs, cluster configs, etc.	Databricks-managed cloud account.
Data Plane	Runs user code and processes data; contains customer data.	Customer’s cloud account.

5. Differentiate between all-purpose clusters and jobs clusters.

Cluster Type	Purpose	Use Case
All-purpose	Interactive and multi-user collaboration.	Ad hoc analysis, notebooks.
Jobs	Optimized for automated, scheduled workloads.	Production jobs, ETL pipelines.

6. Identify how cluster software is versioned using the Databricks Runtime.

Cluster software is versioned using Databricks Runtime versions (e.g., 11.3 LTS, 13.0). Each version includes specific Spark versions, libraries, and optimizations.

7. Identify how clusters can be filtered to view those accessible by the user.

In the Clusters UI, users can filter by “All Clusters”, “My Clusters”, or “Shared Clusters” to see only those accessible based on permissions.

8. Describe how clusters are terminated and the impact.

Clusters can be terminated manually or automatically (idle timeout). Termination releases compute resources, but session state and in-memory data are lost unless persisted.

9. Identify a scenario in which restarting the cluster will be useful.

Restarting a cluster is useful if:

Libraries are updated
Environment variables or cluster configs change
Memory leaks or performance issues occur

10. Describe how to use multiple languages within the same notebook.

Use magic commands like:

%python
%spark
%sql
%scala
%r

Each cell runs in the context of the specified language.

11. Identify how to run one notebook from within another notebook.

Use %run magic command:

%run ./path/to/another_notebook

This includes all variables and functions from the target notebook.

12. Identify how notebooks can be shared with others.

Notebooks can be shared by:

Sharing links
Setting workspace permissions
Adding collaborators with view/edit rights

13. Describe how Databricks Repos enables CI/CD workflows in Databricks.

Repos integrate with Git providers (GitHub, GitLab, etc.), enabling source control, branch management, pull requests, and automated CI/CD pipelines directly within Databricks.

14. Identify Git operations available via Databricks Repos.

Available Git operations include:

Clone repo
Pull, commit, push changes
Create and switch branches
View commit history
Resolve merge conflicts

15. Identify limitations in Databricks Notebooks version control functionality relative to Repos.

Databricks Notebooks’ built-in versioning:

Is automatic but limited
Lacks branching, merging, pull requests·
Is less suitable for collaboration than full Git-based Repos

KANSIRIS