The skills a data engineer should learn — in order. From Data Modeling and SQL, through OLAP systems, dbt, and data formats, to processing engines, orchestration, Kafka, and stream processing. A practical roadmap built from real experience.
🎓 Student or 🇻🇳 Vietnamese user? Only $5 a month ↓
From junior engineers to veterans — here's why they subscribe.
"You have an amazing Substack, full of in-depth knowledge. I've been doing data for decades, and I still learn new things from you. Keep up the good work."
"I've been reading your writings for a while, and I really appreciate the level of depth you are exploring. On top of that the illustrations are top notch and support a visual understanding of the tech behind."
"We're building out our company's first data lakehouse and we're a team of mostly juniors. Your writing and insight has so far been invaluable, and I hope we can use what we learn from your articles to steward a better culture of data at the org!"
"I love your articles about DE that go deep, like I've never seen before from other people."
"You break things down to the level that I find easy to comprehend. You deserve to be paid for your content. You help me in my professional life."
"I'm practically using your articles as a roadmap to become a better data engineer."
Right on your laptop or in your browser. Like playing a game: read, code, verify, move on.
A web-based app that lets you practice both Spark SQL and the DataFrame API without any setup — helping you prepare for interviews and sharpen your Spark skills at your own pace.
👉 You can visit spark.vutrinh.net to try the first 5 problems — no sign-up required.
💡 Already a paid member or just joined? Visit spark.vutrinh.net and sign up with your Substack email to unlock all problems.
Everything above is included for $7/month.
Get started for $7/mo →A taste of what's inside — sharp, technical, no fluff.
The skills a data engineer should learn — in order. From Data Modeling and SQL, through OLAP systems, dbt, and data formats, to processing engines, orchestration, Kafka, and stream processing. A practical roadmap built from real experience.
RDD, architecture, execution modes, planning, scheduling, resource allocation, memory management, cache, and joins. In short: everything about Spark.
How does Parquet organize data? Why the hybrid format? How do read/write processes work? And how does it help with OLAP workloads?
In this article, I sat down and relearned Git. It's not only about some Git commands, but also about what happens under the hood.
A completely new user would be overwhelmed by the diversity of cloud services. If you're a data engineer already overwhelmed by everything to learn, entering the Cloud without prior experience would leave you 2x as overwhelmed. Here's a vendor-agnostic guide to start.
Data architecture 101 — warehouse, lake, lakehouse, data mesh. Plus clarifications on Medallion, data modeling, and the Modern Data Stack.
📚
200+ more articles
Deep dives on Spark, data formats, orchestration, cloud, Git, data modeling, and more.
Browse all →← swipe to browse →
One subscription. Every article, every tool, everything I build next.
🎓 Student with a university email? Get 50% off the annual plan →
🇻🇳 Vietnamese user having payment issues? DM me on Substack or LinkedIn for 50% off the annual plan.
billed annually
What's included
Already a subscriber? Activate your GitHub access here.