As any Data practitioner in the data science, analytics or engineering areas will tell you, the industry is moving forward with technological developments at a lightning pace with algorithmic development leading the charge in what the industry can do in the digital era. If anything, the COVID pandemic has accelerated the pace of development as business migrates to the digital space and Databricks is one of many hyper scaling vendors riding the crest of this data industry boom. As an Infrastructure and DevOps Engineer, I can tell you that Databricks is in an exciting phase of growth and maturing at different rates depending on where you work in the stack.
Databricks for Azure is more mature than Databricks for AWS thanks to the focus and investment Microsoft has given in developing its integration for Databricks. That said, Databricks for AWS was of more interest given its capabilities, which led me to the Databricks Lakehouse Essentials Accreditation when completing the Databricks Platform administrator course via Databricks Academy. I recommend it for all entering the data industry noting the content is basic but solid. The delivery however is somewhat rough, which I am sure will resolve over time given the newly minted nature of this course. The nice part about it was the very wide breadth of content covering all things Databricks starting with the problems around cost in mining traditional data warehouses versus technical drawbacks around the cheaper data lake assuming it does not turn into a data swamp with easy to make errors around hierarchies and indexing. The managerial aspects of the Databricks abstraction focus on Delta Lake, alongside the impressive cluster related features that integrate well with AWS, were strongly featured. Databricks is deployable via Databricks account user interface, API or CLI along with having a range of machine learning features and libraries that are nicely spun into this managed service. These are just some of the areas the platform administration course covers briefly noting the deeper dives are in other courses at Databricks Academy. The governance and data integrity aspects via Delta Lake over the Data Lake were covered in somewhat more detail, which is understandable given the problems it solves with that single layer of technology in the platform.
As someone who has worked on Databricks in the past, I can tell you that it is a versatile ETL tool using parquet as a default data type and integrates nicely with the likes of TensorFlow, MLFlow, Tableau, Power BI and other key tools for the data industry. It also works well with many cloud IdPs for SSO and SCIMs with AD for example. If you are in the industry and looking for an ACID guaranteed Data Lakehouse ETL/Analytics solution, I would advise stopping by Databricks and checking out what they have to offer. Stay tuned for more on infrastructure in this blog along with articles on other areas of interest in the writing and DevOps arenas. To not miss out on any updates on my availability, tips on related areas or anything of interest to all, sign up for one of my newsletters in the footer of any page on Maolte. I look forward to us becoming pen pals!