Our Blog
Welcome to our blog! We'll be sharing insights on data engineering, software development, and the latest trends in technology.
The Challenge of "Works on My Machine"
By Kalle Siukola | September 20, 2025Every developer has faced it: the application works perfectly on their local machine, but when it's deployed to another environment, it breaks. This is often due to mismatched dependencies, operating system differences, or configuration issues. The solution to this problem is containerization, and the most popular tool for it is Docker.
A Docker image is a static, read-only template that contains your application and all its dependencies, from libraries and system tools to your own code. It's the blueprint. A container is a runnable instance of an image. You can have multiple containers running from the same image, each completely isolated from the others.
To build an image, you use a Dockerfile, a simple text file with a list of instructions. A basic Dockerfile might look like this:
# Start from a base image
FROM node:18-alpine
# Set the working directory
WORKDIR /app
# Copy package.json and install dependencies
COPY package.json .
RUN npm install
# Copy the rest of the application code
COPY . .
# Expose a port
EXPOSE 3000
# Run the application
CMD ["node", "app.js"]
For security, it is critical to avoid running your application as the root user inside the container. This practice, known as dropping privileges, prevents potential security vulnerabilities from escalating to the host machine.
Scaling and Orchestration with ECS and Fargate
Once your application is packaged in a container, you need a way to manage it at scale. Manually running and managing hundreds of containers is not feasible. This is where a container orchestrator comes in. On AWS, Elastic Container Service (ECS) is a fully managed service that simplifies the deployment, management, and scaling of containers.
ECS has several core components: a Task Definition describes your container, including its image and resource requirements. A Task is a running instance of that Task Definition. A Service is a long-running collection of tasks, ensuring you always have the desired number of instances running and automatically handling failed tasks.
When you use ECS, you can choose between two launch types: EC2 or Fargate. With the EC2 launch type, you provision and manage the underlying virtual machines where your containers run. This offers fine-grained control but requires more operational overhead. Fargate, on the other hand, is a serverless compute engine for containers. You simply specify your CPU and memory requirements, and AWS handles the provisioning, scaling, and maintenance of the underlying infrastructure. It's the ideal choice for developers who want to focus on their applications, not the servers.
Defining Infrastructure with the AWS CDK
To provision and manage your cloud resources in a repeatable and efficient way, you use Infrastructure as Code (IaC). The AWS Cloud Development Kit (CDK) is a modern IaC framework that lets you define your cloud resources using familiar programming languages like TypeScript or Python.
With the CDK, you can define your entire ECS infrastructure, including clusters, services, load balancers, and security groups, in a single codebase. This is a significant improvement over manual configuration or even JSON/YAML-based templates, as it provides a higher level of abstraction and allows for code reusability. For example, a single CDK file can define an entire scalable, containerized application and its supporting infrastructure.
By combining containers, ECS (with Fargate), and the CDK, you achieve a deployment pipeline that is:
- Fast and Reliable: Developers can deploy applications more quickly and with fewer errors because the environments are consistent.
- Cost-Effective: With Fargate, you only pay for the resources your containers use, without the wasted cost of idle servers.
- Scalable: Your applications can automatically scale to handle increased traffic without manual intervention.
- Maintainable: Your infrastructure is managed like code, making it version-controlled, reusable, and much easier to maintain over time.
This approach makes deployment a simple, repeatable process, allowing your team to focus on building features and innovation instead of battling with complex infrastructure. It is the gold standard for modern, cloud-native application deployment.
The principles of DevOps have revolutionized software development, but data teams have been slow to adopt them...
Why Data Teams Need DevOps
Data pipelines are often complex and brittle. Without proper version control and automated testing, a small change in one part of the pipeline can cause a cascade of failures. Adopting DevOps principles helps data teams manage their code and infrastructure with the same rigor as a software development team.
Key Practices
A key tool is dbt (data build tool), which allows you to define your data transformations as code. This means you can version control your data models, run tests to ensure data quality, and easily deploy changes using a CI/CD pipeline.
CI/CD (Continuous Integration/Continuous Deployment) pipelines automate the process of building, testing, and deploying your data models. This ensures that every change is tested and that only validated code makes it into production.
Bridging the gap between data engineering and DevOps is crucial for building scalable, reliable, and trustworthy data platforms. By treating data pipelines as software, teams can move faster and deliver higher-quality data products.