Career DishReal jobs, real talk

Day in the Life of a Data Scientist: Three Real Days

~20 min read · 3 voices

Three data scientists wrote down everything they did on one ordinary workday. Not a model launch. Not an outage. A normal Wednesday in the middle of a regular sprint. The pattern that emerges: more waiting for queries, more explaining numbers to people who wanted different numbers, and less machine learning than any of them expected when they accepted the job.

These characters are composites, built from dozens of real accounts, interviews, and community threads. The people aren't real. The experiences are.

Thalia, 27
Data Scientist at a 120-person healthcare analytics startup in Pittsburgh, Pennsylvania · Wednesday · 18 months in the role · Biostatistics master's from Pitt, first industry job
7:20 AM
I wake up to my phone buzzing. It's a PagerDuty alert from the churn prediction pipeline. The Airflow DAG that retrains the model nightly failed at 3:14 AM. I check the error log from bed. It's a Snowflake connection timeout. Not a code problem, an infrastructure problem. I tag it for our data engineer Kellen in Slack and close my eyes for another twenty minutes. This happens maybe twice a month. Kellen usually fixes it before I finish breakfast.
8:05 AM
Coffee. I use a French press because it's the only thing I can do in the morning that doesn't involve a screen. Five minutes of standing in my kitchen watching water darken. My roommate Jess, she's a nurse at UPMC, is already gone. Her shift started at 6. I used to think my job was intense and then Jess told me about her Tuesday and I never used the word "intense" about data science again.
8:25 AM
Open laptop. Kellen already fixed the pipeline. Retraining completed at 7:48 AM. The churn model is current. I pull up the model monitoring dashboard in Datadog and check the feature drift metrics. Everything within normal bounds. AUC is 0.81, which is where it's been for six weeks. I note the date and the metric in a tracking spreadsheet I keep. My manager Prashant doesn't require this, but I started doing it after a model degraded slowly over three months and nobody noticed until a VP asked why our churn predictions were wrong. Now I check every morning. Takes four minutes.
8:40 AM
Standup. Seven people on the call, it takes eleven minutes. Prashant asks me where I am on the patient readmission model. I tell him I'm blocked on getting access to the ADT data from our hospital partner. The data sharing agreement was signed two weeks ago but the SFTP credentials haven't been set up yet. This is the third standup in a row where I've reported this same blocker. Prashant says he'll escalate. I believe him. I also know that "escalate" in a 120-person company means he sends a Slack message to someone who sends an email to someone at the hospital.
9:00 AM
Since I'm blocked on the readmission model, I pivot to the ad-hoc request queue. There are four requests in the Jira backlog. I pick the one from Elena, our head of customer success: "Can you pull the average time-to-value for customers onboarded in Q1 vs Q4 of last year?" Time-to-value is defined as the number of days between contract signing and the customer's first logged analysis in our platform. I know where this data lives because I've pulled it three times before. I open DataGrip and start writing the query.
9:35 AM
The query runs in 38 seconds, which is fast for this table. Q1 cohort: median 14 days. Q4 cohort: median 23 days. The Q4 number is higher because we onboarded a cluster of large enterprise customers in November who took longer to configure. I know this because I remember the onboarding calls. But Elena doesn't know that context. If I just send her "14 vs 23" she'll think Q4 onboarding got worse. So I segment by customer size: for SMB customers, it's 12 vs 13. For enterprise, it's 31 vs 34. For the November cluster specifically, it's 42. The story is "enterprise takes longer, and we had more enterprise in Q4," not "onboarding got worse." Writing this up in a Slack message that's clear without being condescending takes me about fifteen minutes. The query took 38 seconds. The explanation took fifteen minutes. That ratio is roughly representative of my job.
The query took 38 seconds. The explanation took fifteen minutes. That ratio is roughly representative of my job.
— Thalia
10:10 AM
Next ad-hoc: "What's the correlation between feature usage depth and renewal rate?" This one is more interesting. I pull the usage data from our product analytics warehouse and the renewal data from Salesforce. The join key is account_id, but the product analytics table uses a hashed version and Salesforce uses the raw ID. I have a mapping table but it hasn't been updated since February. I spend 25 minutes reconciling the two, find 14 accounts that are in Salesforce but not in the product analytics data, which means they have contracts but have never logged in. That's a finding worth flagging. I note it separately.
11:00 AM
I run the correlation analysis. Pearson r = 0.43 between feature usage depth (measured as distinct features used per month) and 12-month renewal probability. Moderate positive correlation, not surprising. I build a quick scatter plot in Python, add a trend line, and note the outliers: three accounts with high feature usage that churned anyway. I look into them. Two had leadership changes. One had a budget cut. Usage depth doesn't protect against organizational change. I include that caveat in my writeup.
11:45 AM
Lunch. I walk to a place two blocks away that does Korean rice bowls. I eat at a table by the window and read a blog post about causal inference that my grad school friend Yun sent me. It's about instrumental variables. I understand about 70% of it and save the rest for later. This is how I learn now. Not textbooks. Blog posts over lunch.
12:40 PM
Back at my desk. I have two hours of unscheduled time before my 2:45 meeting. This is rare and I protect it. I open my Jupyter notebook for the churn model v2 experiment. The current model uses logistic regression because it's interpretable and our clinical advisory board can audit the coefficients. But I want to test whether a gradient-boosted model would improve accuracy enough to justify the interpretability tradeoff. I've been running this experiment in 30-minute increments for two weeks. Today I tune the hyperparameters using Optuna. I set up 100 trials and let it run.
1:15 PM
Optuna finishes. Best trial: AUC 0.86 vs the current model's 0.81. Five percentage points. That's meaningful. But the clinical advisory board meets quarterly, and Prashant warned me that they pushed back hard the last time someone proposed a non-interpretable model. The board chair is a physician named Dr. Lourdes who said, and I'm quoting from the meeting notes, "If I can't explain why the model flagged this patient, I won't use it." I need to think about how to present this. SHAP values might bridge the gap. I make a note to build SHAP explanations for the top 20 features and bring those to the next advisory meeting.
2:00 PM
Slack from Elena: "That usage analysis was really helpful, thank you. Quick follow-up: can you break it down by industry vertical?" I can. It will take about 45 minutes because the industry tags are in yet another table that needs reconciliation. I tell her I'll have it by end of day tomorrow. She says "perfect." The follow-up request is always where the real time goes.
2:45 PM
Weekly product sync. Twelve people in the room. The product manager, two engineers, the design lead, customer success, and me. I present the churn model's latest performance metrics: precision 0.74, recall 0.69 at the current threshold. The product manager asks if we can lower the threshold to catch more at-risk accounts. I explain that lowering the threshold would increase recall to 0.82 but drop precision to 0.58, which means 42% of the accounts flagged as at-risk would actually be fine, and customer success doesn't have the bandwidth to follow up on that many false positives. This is the conversation I have every three weeks. The answer is always the same. But each time someone new in the room hears it for the first time, so I explain it again.
3:30 PM
Meeting ends. I spend 40 minutes documenting the hyperparameter experiment in Notion. Model version, dataset hash, feature list, best parameters, performance metrics, next steps. Prashant reads these. Nobody else does. But six months from now when someone asks "did we try a GBM for churn," the answer will be findable.
4:15 PM
Start on Elena's industry breakdown. I pull the vertical tags from the CRM export, join to the usage data, and realize that 23% of accounts don't have an industry tag. I message Elena about this. She says "tag the untagged ones based on their website." I do not want to do this. But I also know that presenting results with a 23% "unknown" category will undermine the analysis. I pull up 15 of the untagged accounts, look at their websites, and assign verticals. It takes 35 minutes. This is the least scientifically rigorous thing I do on a regular basis, and it's also probably the most practically useful.
5:10 PM
I close my laptop. The readmission model is still blocked. The churn v2 experiment needs a presentation strategy. Elena's follow-up is half done. The ad-hoc queue still has two items. I made progress on exactly none of the things I was supposed to prioritize this sprint and meaningful progress on three things that weren't on the sprint board. This is a normal Wednesday.
6:30 PM
Jess gets home. She asks how my day was. I say "fine, I explained the same precision-recall tradeoff for the fourth time and manually tagged 15 companies based on their websites." She says "I had a patient code in the elevator." We eat pad see ew on the couch and watch a renovation show. Neither of us talks about work again.

Kieran, 33
Senior Data Scientist at a regional banking institution in Charlotte, North Carolina · Tuesday · 4 years in the role · Economics PhD dropout (ABD) from Duke, joined the bank straight from the program
7:45 AM
I get to the office at 7:45. The bank still does in-person five days a week. There's a coffee machine in the breakroom that produces something between espresso and dishwater. I drink it because it's free and because the ritual of walking to the breakroom and standing there for 90 seconds is the only thing separating "arriving" from "working." My badge still says "Analyst III" because the bank's HR system doesn't have a "data scientist" title. My actual job is building credit risk models. My HR title is the same as someone who reconciles spreadsheets. This bothers me less than it used to.
8:00 AM
Open SAS. Yes, SAS. The bank's model validation team only accepts models built in SAS or validated against a SAS benchmark. I also use Python for exploration and prototyping, but anything going into production runs through SAS Enterprise Miner. My friend Anika, she's a data scientist at a startup in Raleigh, asks me sometimes how I tolerate SAS. I tell her it's like driving a car from 1997. It works. It's slow. The radio only gets AM stations. But it's paid off and nobody questions whether it's safe.
8:20 AM
I pull up the credit card delinquency model I've been rebuilding for three months. This model predicts the probability that a credit card holder will go 30+ days past due in the next 90 days. The current production model was built in 2019 and its Gini coefficient has degraded from 0.62 to 0.51 because consumer behavior shifted during and after the pandemic. People who used to pay on time started carrying balances. People who used to carry balances paid off their cards with stimulus money and now have different spending patterns. The model can't see any of that because it was trained on 2016 to 2018 data.
9:00 AM
Model governance meeting. Four people: me, my manager Delphine, the model risk officer Len, and a compliance analyst named Pilar. Len reviews my validation report for the rebuilt delinquency model. He has seventeen comments. Twelve are formatting: he wants the tables in a specific template, the coefficients rounded to four decimal places, the p-values listed even for variables that were removed during selection. Four are substantive: he questions why I removed a feature called "months_since_last_address_change" and whether removing it could create fair lending concerns. One is a genuine catch: he noticed that my holdout test set has a slightly different class balance than the development set, which could bias my lift chart. That one's valid. I need to resample. The other sixteen range from useful to bureaucratic but they all have to be addressed before the model can go to production. This is what model governance looks like at a bank.
10:15 AM
Back at my desk. I fix the holdout sampling issue first because it's the substantive one. Re-stratify the test set to match the development set's 4.2% delinquency rate. Rerun the validation suite. Gini goes from 0.64 to 0.63. Minor change but I need to update every table in the 47-page validation document. I open the Word template and start updating.
11:00 AM
Delphine stops by my desk. She says the Chief Risk Officer wants to know if we can build a model that predicts which customers are likely to close their accounts in the next 60 days. I ask if this is attrition modeling or if there's a regulatory driver. She says "Conrad asked for it in the executive committee meeting yesterday, I don't know the context." Conrad is the CRO. He asks for things in meetings and then they become projects. I tell Delphine I need to scope it, which means I need to understand what data we have on account closures, whether the data is labeled, and whether this is a supervised or unsupervised problem. She says "just send me a one-pager by Friday." I add it to my list.
11:30 AM
I eat lunch at my desk. Turkey sandwich from home. I read through the OCC's updated model risk management guidance that came out last month. Thirty-eight pages. I'm on page 14. There's a section about AI and machine learning models that basically says "if you use ML, your validation burden increases proportionally to the model's complexity." This is why the bank uses logistic regression for almost everything. Not because logistic regression is the best model. Because it's the most defensible model to a regulator.
12:15 PM
I spend 90 minutes addressing Len's formatting comments on the validation document. Moving decimal places. Reorganizing tables. Adding a section header he requested. Reformatting the lift chart legend. The substance of the model hasn't changed. The presentation of the model has changed seventeen times. I think about what my PhD advisor would say about this. He'd say "this is why you should have finished the dissertation." He might be right. But I also make $127,000 and he makes $94,000, so.
The substance of the model hasn't changed. The presentation of the model has changed seventeen times. That's model governance at a bank.
— Kieran
2:00 PM
I start scoping the account closure model for Delphine. Pull a sample of closed accounts from the past two years. 8,400 closures out of roughly 340,000 active accounts. That's a 2.5% event rate, which is low but workable. The data is labeled: I know which accounts closed and when. I join account-level features: tenure, balance trends, transaction frequency, product holdings, number of service calls. Thirty-seven features before selection. I run a quick correlation matrix and immediately spot a problem: "zero_balance_flag" and "days_since_last_transaction" are 0.89 correlated. One of them has to go. I keep "days_since_last_transaction" because it's more granular. This kind of decision takes five minutes and I've made three like it already and I haven't written a single line of model code yet.
3:30 PM
I've built a rough logistic regression on the account closure data. In-sample Gini: 0.58. Not bad for a first pass. The top three predictive features are days since last transaction, declining monthly balance trend, and number of service complaints in the past 90 days. This makes intuitive sense, which is good because Delphine will ask me "does it make sense" before she asks me about the Gini. I write it up in a one-page summary. Feature importance, top-line accuracy, recommended next steps (add more features, test on a holdout, formal validation if we go to production). I send it to Delphine at 3:52.
4:00 PM
Delphine responds in eight minutes: "This is great. Can you present it to Conrad on Thursday?" Thursday. Two days. A rough first-pass model, no validation, no holdout testing, presented to the Chief Risk Officer. I type "sure" and then sit with the dissonance for a moment. In my PhD program, presenting unvalidated results would have been unthinkable. Here, the executive wants directional signal fast, not methodological perfection. I've learned to hold both standards in my head. The Thursday version will have caveats. The production version, if it gets there, will have rigor. They're different deliverables for different audiences at different stages. I'm still getting used to that.
4:45 PM
I pack up. Drive home. 22 minutes. The drive is the only time in my day when I'm not looking at a screen or talking to someone. I listen to a podcast about economic history. Not data science. I have heard enough about data science today.
6:00 PM
My wife Renata asks how it went. I say "I spent ninety minutes reformatting a document and then I built a model in two hours that I'll present to the CRO in two days." She says "is the model good?" I say "it's directional." She says "what does that mean?" I say "it means it's probably right but I can't prove it yet." She says "that sounds like half of medicine." She's a pharmacist. She has a point.

Emery, 36
Senior Data Scientist at a mid-size e-commerce company in Los Angeles, California · Wednesday · 6 years in role across two companies · Applied math undergrad from UCLA, self-taught ML through Coursera and Kaggle
8:50 AM
I'm remote. Always have been at this company. I open Slack at my kitchen table with a bowl of granola and there's a message from our VP of Marketing, Lorraine. Wait, not Lorraine. She left last month. The new VP is somebody named Gretchen. The message says: "Hi Emery! I'm looking at the attribution dashboard and the numbers look off. Can we chat today?" I've been through four marketing VPs in six years. Every single one's first week involves questioning the attribution numbers. I reply "sure, I'm open at 11."
9:10 AM
I open our dbt project in VS Code. I've been migrating the company's analytics from a nest of stored procedures to dbt models for the past eight months. There are 147 models in the project. Forty-two of them are what I call "trust models," the ones that feed dashboards that executives look at. The rest are intermediate transformations and staging tables. Today I'm working on the customer lifetime value model, which breaks each customer's revenue into first-purchase, repeat-purchase, and subscription components. The current SQL was written by a contractor two years ago and there's a CASE WHEN statement that's 94 lines long. I need to understand it before I can refactor it.
9:45 AM
I trace through the 94-line CASE WHEN. It's handling fifteen product categories and assigning each to a revenue bucket. Seven of the categories have been renamed since the SQL was written. The SQL still references the old names, which means it's silently miscategorizing about 12% of transactions. They're falling through to the ELSE clause and being labeled "Other." Nobody noticed because "Other" has always been a bucket. It just grew from 3% to 15% and nobody asked why. I found it because the CLV numbers for the "Home Essentials" category seemed too low when I compared them to order volume. This is how data bugs work. They don't break. They degrade quietly and the dashboard still loads fine.
10:30 AM
I fix the CASE WHEN. Map the old category names to the new ones. Run the updated model in dbt against last month's data to verify. The CLV for "Home Essentials" jumps from $47 to $62. That's a meaningful difference for a category that represents 18% of revenue. I document the fix in a pull request with a table showing the before-and-after category distributions. Our analytics engineer, a junior person named Wren, will review it. I try to make my PRs educational because Wren is good but new and this kind of silent data bug is something she'll need to catch on her own eventually.
11:00 AM
Call with Gretchen, the new marketing VP. She shares her screen and points at the attribution dashboard. "This says email drove $2.1 million last month but my team's internal tracking says $3.4 million. Which one is right?" I explain that our dashboard uses a last-touch attribution model and her team's tracking uses first-touch. In last-touch, if someone clicks an email but then comes back through Google and buys, the sale goes to Google. In first-touch, it goes to email. Neither is wrong. They're answering different questions. She says "can you build a multi-touch model?" I say "yes, and I built one two years ago for your predecessor, but she decided last-touch was simpler to explain to the board and asked me to shelve it." Gretchen says "unshelf it." I say OK and make a note. This is the third time this model has been requested, shelved, and re-requested. I still have the code.
This is the third time this attribution model has been requested, shelved, and re-requested. I still have the code. I always keep the code.
— Emery
11:40 AM
I dig up the multi-touch attribution model from a private GitHub repo where I keep my work-in-progress code. Last modified fourteen months ago. I read through the notebook. It's a Shapley value-based approach that distributes credit across all touchpoints in a customer's journey weighted by their marginal contribution. The math is sound. The data pipeline it depends on is probably broken because the marketing team changed their UTM taxonomy in September. I check. Yes, broken. I need to update the regex that parses UTM parameters. This is always what happens. The model is fine. The plumbing changes underneath it and the model stops working. Not because the model is wrong but because someone renamed "utm_source=facebook" to "utm_source=meta" and didn't tell anyone downstream.
12:30 PM
Lunch. I make a sandwich and eat it on my balcony. My neighbor's dog is barking at a squirrel. I think about the CLV bug. Twelve percent of transactions miscategorized for probably eighteen months. The dashboard never broke. The numbers were always there, always wrong, always trusted. I wonder how many other bugs like that are sitting in the other 147 dbt models. The answer is probably "several" and the only reason I found this one is because I happened to compare two numbers that should have been closer. You don't find data bugs by looking for data bugs. You find them by accident while doing something else. That's a problem that I don't know how to systematize.
1:15 PM
I spend two hours updating the UTM parsing logic for the multi-touch model. The old regex handled five sources. The current UTM taxonomy has eleven sources and four new medium types. I update the mapping, test it against last month's marketing data, and verify that every campaign resolves to a known source/medium pair. Seven campaigns don't resolve. I check the raw data. They're from an influencer program that uses a custom UTM structure that doesn't match any convention. I add a handler for it. This is the kind of work that never shows up in a job description. "Updated a regex to handle influencer UTM parameters" is not a line item anyone evaluates in a performance review. But if I don't do it, the model produces wrong numbers, and wrong numbers in a dashboard erode trust, and once trust erodes you're not a data scientist anymore, you're just someone with a Python notebook that nobody looks at.
3:30 PM
Wren reviews my CLV pull request. She catches a typo in one of the category mappings, "Home Essential" instead of "Home Essentials," which would have created a new phantom category. Good catch. I fix it, approve her review, and merge. The updated CLV model will run overnight and the new numbers will appear in the dashboard tomorrow morning. Nobody will notice the change because nobody memorized the old numbers. The dashboard will just be slightly more right than it was yesterday, and the world will proceed.
4:00 PM
I write up a brief doc for Gretchen summarizing the multi-touch attribution model: what it does, what data it needs, what the timeline looks like to get it running again. I estimate two weeks. One week to update the pipeline, one week to validate the outputs against known campaigns. I send it. She responds in three minutes: "Can we do it in one week?" I say "the validation week is what keeps the numbers trustworthy." She says "fair." This is the correct answer and I appreciate that she accepted it. The last VP would have pushed for one week and then questioned the numbers when they didn't match her intuition.
5:15 PM
I close my laptop. I fixed a bug that nobody knew existed, resurrected a model that's been requested three times, and spent two hours updating a regex. My LinkedIn says "Senior Data Scientist, specializing in machine learning and predictive analytics." My actual day was SQL, regex, Slack, and a conversation about what "attribution" means. The gap between the title and the work is a canyon. But the work is real and it matters and I'm weirdly good at finding phantom categories in 94-line CASE WHEN statements. So.

Frequently Asked Questions

What does a data scientist do all day?
A typical data scientist's day includes some combination of: writing and debugging SQL queries, cleaning and transforming data, building or maintaining models, attending meetings to explain analysis results, responding to ad-hoc data requests, and documenting their work. The actual modeling work is often a small fraction of the day. Most data scientists report spending 50 to 70 percent of their time on data preparation, stakeholder communication, and pipeline debugging.
How many hours do data scientists work?
Most data scientists at established companies work 40 to 50 hours per week. At startups, hours can stretch to 50 to 60 during critical periods. Remote and hybrid arrangements are common, which gives flexibility in when those hours happen, but on-call responsibilities for production models can mean occasional off-hours work.