3 Steps to Setup a Data Engineer Project

Igor Comune
4 min readApr 15, 2024

--

How to setup a data engineer standard project in 3 simples steps using Cookie Cutter.

Made with ideogram.ai. Prompt: “background modern technology data engineering”

Intro

Setting up a standard project framework for data engineering is paramount in ensuring efficiency, scalability, and maintainability in data-driven endeavors. A standardized project structure establishes consistency across projects, facilitating seamless collaboration among team members and streamlining the development process. It enables data engineers to adhere to best practices in data ingestion, processing, storage, and analysis, ensuring data integrity and reliability. Moreover, a standardized framework simplifies troubleshooting and debugging, as it provides a clear blueprint for project organization and workflow. By implementing consistent standards, data engineers can accelerate project delivery, reduce errors, and enhance the overall quality of data solutions, ultimately driving informed decision-making and unlocking the full potential of data assets.

Without standards, there can be no improvement - Taiichi Ohno

Index

  1. Tools
  2. How To

Tools

  1. VS Code
  2. Git and Github Desktop
  3. Anaconda
  4. Cookie Cutter

How To

I’ll be pretty straight forward in this post, I don’t wanna show how to install those tools because there are hundreds of tutorials all over the web teaching how to do it. So, let’s go!

1. Create a local repo

Open Github Desktop and create a repo wherever you want to.

Open anaconda and run CMD.exe Prompt:

Inside the cmd, run pip install cookiecutter

2. Setup Cookie Cutter

Open VS Code inside the created folder or type code . in the address bar.

Open the terminal and type: cookiecutter https://github.com/drivendata/cookiecutter-data-science

Type Y and press ENTER.

Fill the question as you want and hit ENTER.

And that’s it. The project is ready.

3. Setup Python Virtual Environment

I always setup a python venv inside the folder to make it easier.

In order to do it, I created a python code called create_venv.py.

import sys
import subprocess
from pathlib import Path

def create_virtual_environment(venv_path):
try:
subprocess.run([sys.executable, '-m', 'venv', str(venv_path)])
print(f"Virtual environment created successfully at {venv_path}")
except Exception as e:
print(f"Error creating virtual environment: {e}")

if __name__ == "__main__":
# Get the desired virtual environment path
venv_name = input("Enter the name for the virtual environment: ")
venv_path = Path.cwd() / venv_name

# Check if the virtual environment already exists
if venv_path.exists():
print(f"Virtual environment '{venv_name}' already exists. Exiting.")
else:
# Create the virtual environment
create_virtual_environment(venv_path)


# venv\Scripts\activate
# python -m pip install --upgrade pip
# pip freeze > requirements.txt
# pip install -r requirements.txt

Move this file inside the project folder, like this:

In the terminal type: cd project_name, in my case cd medium_tutorial.

Open create_venv.py and run it.

It’ll ask you to select the name of the VENV, I always call it venv. If you want you can delete the create_venv.py file.

Now on the terminal that you just used to create the venv, type: venv\Scripts\activate. Then type python -m pip install — upgrade pip to upgrade pip. And whenever you want to save the requirements just type on the terminal: pip freeze > requirements.txt.

The .gitignore file already comes configured.

Now you can access the documentation and learn how to use it:

Home — Cookiecutter Data Science (drivendata.github.io)

If you want to see the results:

medium_data_engineer_tutorial/medium_tutorial at main · IgorComune/medium_data_engineer_tutorial (github.com)

If I helped you, don’t forget to comment, share and clap.

Igor Comune | LinkedIn

--

--

Igor Comune

An "under construction" Data Scientist with a love for numbers and analysis!