1. Describe the relationship between the data lakehouse and the data warehouse.
A data lakehouse combines the scalability and flexibility of a data lake with the data management and ACID transaction support of a data warehouse. It allows both structured and unstructured data to coexist while supporting BI and ML workloads with strong governance.
2. Identify the improvement in data quality in the data lakehouse over the data lake.
Data lakehouses improve quality by supporting ACID transactions, schema enforcement, and data governance, which traditional data lakes lack. This leads to more reliable, consistent, and queryable data.
3. Compare and contrast silver and gold tables, and which workloads use bronze vs. gold tables.
| Table Layer | Purpose | Used In Workloads |
| Bronze | Raw ingestion layer; stores unprocessed data. | Used in data ingestion, archiving, and audit trail workloads. |
| Silver | Cleansed and enriched data; applies transformations. | Used in analytics and operational reporting. |
| Gold | Business-level aggregates; optimized for BI. | Used in dashboards, ML models, and executive reports. |
4. Identify elements of the Databricks Platform Architecture (data plane vs. control plane).
| Plane | Description | Location |
| Control Plane | Manages metadata, notebooks, jobs, cluster configs, etc. | Databricks-managed cloud account. |
| Data Plane | Runs user code and processes data; contains customer data. | Customer’s cloud account. |
5. Differentiate between all-purpose clusters and jobs clusters.
| Cluster Type | Purpose | Use Case |
| All-purpose | Interactive and multi-user collaboration. | Ad hoc analysis, notebooks. |
| Jobs | Optimized for automated, scheduled workloads. | Production jobs, ETL pipelines. |
6. Identify how cluster software is versioned using the Databricks Runtime.
Cluster software is versioned using Databricks Runtime versions (e.g., 11.3 LTS, 13.0). Each version includes specific Spark versions, libraries, and optimizations.
7. Identify how clusters can be filtered to view those accessible by the user.
In the Clusters UI, users can filter by “All Clusters”, “My Clusters”, or “Shared Clusters” to see only those accessible based on permissions.
8. Describe how clusters are terminated and the impact.
Clusters can be terminated manually or automatically (idle timeout). Termination releases compute resources, but session state and in-memory data are lost unless persisted.
9. Identify a scenario in which restarting the cluster will be useful.
Restarting a cluster is useful if:
- Libraries are updated
- Environment variables or cluster configs change
- Memory leaks or performance issues occur
10. Describe how to use multiple languages within the same notebook.
Use magic commands like:
| %python %spark %sql %scala %r |
Each cell runs in the context of the specified language.
11. Identify how to run one notebook from within another notebook.
Use %run magic command:
| %run ./path/to/another_notebook |
This includes all variables and functions from the target notebook.
12. Identify how notebooks can be shared with others.
Notebooks can be shared by:
- Sharing links
- Setting workspace permissions
- Adding collaborators with view/edit rights
13. Describe how Databricks Repos enables CI/CD workflows in Databricks.
Repos integrate with Git providers (GitHub, GitLab, etc.), enabling source control, branch management, pull requests, and automated CI/CD pipelines directly within Databricks.
14. Identify Git operations available via Databricks Repos.
Available Git operations include:
- Clone repo
- Pull, commit, push changes
- Create and switch branches
- View commit history
- Resolve merge conflicts
15. Identify limitations in Databricks Notebooks version control functionality relative to Repos.
Databricks Notebooks’ built-in versioning:
- Is automatic but limited
- Lacks branching, merging, pull requests·
- Is less suitable for collaboration than full Git-based Repos


0 comments:
Post a Comment
Note: only a member of this blog may post a comment.