December 22, 2021
Data Engineering Skills
Even in the context of the tech industry, data engineering is a relatively new set of responsibilities. Companies recognize the value of promptly collected and well-managed data, so data engineering skills are in high demand. The number of available vacancies snowballing, and the offered compensation is higher than ever. This demand, among other reasons, makes the role attractive to software engineers looking for a change of focus or just for new challenges.
I was once in this position myself; now I am leading a team of data engineers. In the past 12 months, I have been hiring engineers for my team and other teams within the Data Engineering department. I have reviewed hundreds of CVs and LinkedIn profiles, had dozens of interviews with applicants and was lucky enough to find several excellent developers who are now my colleagues.
In this post I will outline a list of skills that will get you a data engineering position and help you succeed in this role. We will also discuss what defines data engineering and makes it different from broader software development and related fields of data analytics and data science.
Data Engineering vs Software Development
Data engineering is a specialized subset of software development. Data Engineers utilize the same skills: designing systems, writing code, testing, and deploying software. The distinction is that most DE projects focus on data management: retrieving it, transforming it into formats required by the business, storing it, and exposing it to interested parties. A lot of data engineers used to be software developers in the past. I am one example!
Data Engineering vs Data Analytics
The difference between these two roles has been more pronounced in the past. Data analysts used to query the data prepared by engineers and build dashboards using UI-based tools. Recently the situation has changed. Modern data analysts know how to write complicated SQL queries, but they also use Python and R to set up advanced reports and dashboards. One thing is still valid, though. Data analysts are your best friends if you work as a data engineer. Out of all teams in the company, you will probably talk to them the most.
Data Engineering vs Data Science and Machine Learning Engineering
These domains received much attention in recent years, both from businesses and from people interested in building relevant skillsets. Data scientists apply statistical methods of varying complexity to create prediction and classification models. Data scientists need access to datasets specially prepared for the task, and usually, a team of data engineers helps them by setting up these datasets. In some companies, this role is called Machine Learning engineer.
Data Engineering skillset
Data Warehousing
This is the original data skill, the one where it all began. A data warehouse is a specialized database configured to store large amounts of data and run analytical queries on subsets of this data. Data engineers are responsible for loading the data into the warehouse, setting up the correct schema, and, usually, helping users write efficient queries.
Typical tools: Snowflake, Databricks, Presto, Amazon Redshift, Google BigQuery, Apache Hive.
Concepts to learn: SQL, OLAP vs OLTP databases, star schema and snowflake schema.
Object Storage and Data Lake
Object storage allows engineers to store large volumes of data in various file formats. This way of storing data is usually not optimized for frequent random queries; its primary goal is to store the data reliably. Object storage is frequently used for backups and as a staging area where data resides before being loaded into a data warehouse. When object storage performs the former role, it is called a data lake.
Common tools: Amazon S3, Google Cloud Storage, HDFS.
Serialization Formats
There are several file formats used for storing and sharing data. The most common formats data engineers use are: Parquet, Avro, CSV, JSON.
Distributed Data Processing
Once the size of your datasets starts to grow beyond the capabilities of a single computer, you need tools that can distribute processing onto a cluster of servers. Standard tools: Apache Spark, Apache Beam, Google Dataflow, Hadoop.
Streaming Data Processing
In modern tech companies engineers have to deal with data that arrives as a continuous flow. Businesses benefit from the reduced reaction time that streaming processing allows. Common tools: Apache Kafka, Apache Flink, Apache Spark, Apache Beam.
Cloud
Most companies these days choose to run workloads in the cloud. Common platforms: AWS, GCP, Azure.
Programming Languages
The most commonly used: Python (if you have to choose one, choose Python), Scala, Java.
Data Management
Common concepts: Privacy and security of business and personal data, data governance, data lineage, data quality and observability.
This concludes the list of skills that form the foundation of data engineering. In smaller companies engineers are expected to cover most, if not all, items on this list. In larger companies engineers usually specialize.
You don’t have to know all of the tools I’ve listed above to land a job in data engineering, especially an entry-level job. But it helps to be familiar with common concepts out of all these areas and to try out one or two tools out of each.
If you have questions or would like to ask for advice concerning a career in data engineering, please hit me up on Twitter @Heliocene or email me at theelderscripts@gmail.com.