How many seasons of data do I need before I can defend a factory allocation change?

Four seasons of consistently scored data is the threshold where a trend line becomes defensible. One season is recency bias, two can be coincidence, three starts to show a pattern. Single-season outliers within that window should drive root cause conversations rather than volume changes. A multi-season drift, even if no single season looked like a crisis, justifies cutting volume.

Should I score factories against the original ex-factory date or the revised one?

Always the original. Scoring against revised dates hides the drift that the metric exists to capture, because dates get moved to accommodate the factory. Track revisions in a separate field with reason codes so you can see how often dates moved and why, but on-time-in-full should always be measured against the date locked at PO issuance.

What metrics belong on a factory scorecard beyond on-time delivery?

At minimum: on-time-in-full against the original date, PO acknowledgment latency, sample-to-bulk accuracy, defect rate at inbound QC, short-ship percentage, markdown attribution for factory-controllable causes, communication responsiveness, and compliance pass rate. Unit cost is a sourcing input, not a performance metric. A cheaper factory that drives markdowns is not actually cheaper once you have three seasons of attribution data.

Why do most factory scorecards fall apart after one or two seasons?

Because they live in spreadsheets separate from the production record, and the data has to be re-entered manually from email, QC reports, and the 3PL portal each season. For a $15M brand already losing 6 to 9 hours a week to inventory reconciliation, a separate scorecard process gets attempted in Q1 and abandoned by Q2. It only survives if scoring is a byproduct of how the PO moves through the system.

Who should populate the factory scorecard fields?

The brand's own systems, not the factory or the sourcing agent. PO issuance dates from the production module, receipt dates from the warehouse module, defect rates from the inspection record. If the factory grades its own homework, the grades will be fine and the trend line will be flat. The data has to come from records the factory does not control.

How often should I run a formal factory review?

Per-PO scoring should happen continuously as POs move through their lifecycle. Formal vendor reviews should happen at the close of each season's delivery window, with conversations held within two weeks while the data is fresh. A four-season rolling view drives annual allocation decisions. Annual-only reviews are too slow because two more seasons of POs will already be placed before the data is acted on.

Production

How to Track Factory Performance Across Multiple Seasons

By Shubham Singh · Reviewed by Isabelle Feyerabend · May 30, 2026 · 10 min read

A production manager at a $15M contemporary brand opens her season wrap-up deck the Monday after market. Three factories. Eleven POs. She has on-time data in a Google Sheet her coordinator updates from email, defect notes in a shared Dropbox folder the QC contractor uploads to, and a separate tab where she tracks the styles that came in 2 cm short on the body length. Her CEO asks which factory to give the bigger Fall buy to. She knows the answer in her gut. She cannot defend it with anything that would survive a second question. Last season’s data lives in a different sheet. The season before that, in an inbox.

What does it actually mean to track factory performance across multiple seasons?

To track factory performance apparel brands need to do something most are not doing: score the same vendor on the same metrics, the same way, across at least four consecutive seasons, with the data living in one production record per PO rather than scattered across email threads, spreadsheets, and QC reports.

The phrase gets used loosely. Most brands mean something closer to “I remember who was a problem last season.” That is not tracking. That is recall, weighted heavily toward the most recent and most painful events. A factory that missed one ship window in Spring 25 will be remembered as unreliable even if their three preceding seasons were clean. A factory that quietly drifted from 94 percent on-time in Fall 23 to 81 percent in Fall 24 will not be flagged, because no one is looking at the trend line.

Real multi-season tracking is structural. It requires that every PO carries the same set of scored fields, that those fields are populated on the same cadence, and that the scorecard can be pulled for any vendor across any range of seasons in under a minute. If pulling it takes a week of coordinator time, it will not get pulled, and sourcing decisions will continue to be made on feel.

Why does factory performance data fragment so quickly?

This is Breakpoint 2 of the 6 Breakpoints of Apparel Operations: production and supply execution drift from the plan, and the system of record fragments alongside it. The original PO sits in one place. The vendor’s acknowledgment sits in email. The revised ex-factory date sits in a WhatsApp thread with the sourcing agent. The inspection report sits in a PDF the QC firm uploaded to a shared drive. The actual receipt date sits in the 3PL’s portal. By the time the season closes, reconstructing what happened on a single PO takes 40 minutes of clicking.

When I am sitting across from a buyer comparing vendors, the question I ask first is not which ERP they are evaluating. It is where the production record lives today. Nine times out of ten, the answer is “it depends on the PO.” That answer is the diagnosis. Until there is one production record per PO that carries every state change from issuance through receipt, no scorecard built on top will hold up across seasons, because the underlying data was never collected consistently in the first place.

The fragmentation compounds. Each season, the team intends to start tracking more rigorously. Each season, market hits, samples are late, and the discipline slips. By Fall, the Spring data has gaps. By Spring of the next year, the Fall data has gaps. The trend line everyone wants does not exist because the y-axis was never the same number two seasons running.

What metrics actually matter for a multi-season factory scorecard?

The industry default is a vague “on-time delivery percentage” pulled from whoever updated the sheet most recently. That is not enough to make an allocation decision worth six figures. A defensible scorecard tracks at minimum these eight fields per PO, scored the same way every season:

On-time-in-full against the original ex-factory date (not the revised one). Revisions hide drift.
PO acknowledgment latency. Days between PO issuance and signed vendor confirmation.
Sample-to-bulk accuracy. Percentage of approved samples that match the bulk on fit, color, and construction at first inspection.
Defect rate at inbound QC. Units rejected or reworked as a percentage of the PO quantity.
Short-ship percentage. Units actually received versus units on the original PO.
Markdown attribution. Styles from this PO that took unplanned markdowns within 60 days of landing, weighted to the factory’s controllable causes (late delivery, wrong color, fit issues) rather than market causes.
Communication responsiveness. Median hours to respond to a production query during business hours in the factory’s timezone.
Compliance pass rate. Audits passed without major findings, across whatever framework the brand operates under.

Notice what is not on that list. Unit cost is not a performance metric. It is a sourcing input. A factory that is 8 percent cheaper but drives a 3 percent markdown rate on the styles they produce is not cheaper. You only know that if markdown attribution is on the scorecard and you have at least three seasons of data to look at.

How do you build a scorecard that survives across seasons?

The scorecard has to live where the PO lives. If the PO is in a system and the scorecard is in a spreadsheet, the spreadsheet will rot. What I see from prospects who have already shortlisted three vendors is that they often start by asking which tool has the prettiest vendor scorecard dashboard. That is the wrong question. The right question is which system forces the PO record to carry the scoring fields as native attributes, populated automatically as the PO moves through its lifecycle.

For on-time-in-full, the original ex-factory date has to be locked at PO issuance and never overwritten. Revisions go into a separate field with reason codes. For sample-to-bulk accuracy, the inspection result has to be tied to the same style record the sample sat against in PLM, not a free-text style name. For defect rate, the QC report has to post units back to the PO line, not a summary number to an email. None of this is exotic. All of it requires that PLM, production, and inventory are connected at the record level, which is exactly the structural fix the 6 Breakpoints framework points at.

For a $15M brand running wholesale and DTC with a 3PL, the team is already losing 6 to 9 hours a week reconciling inventory across Shopify, the 3PL, and wholesale orders. Adding a manual factory scorecard process on top of that load does not happen. It gets attempted in Q1, abandoned by Q2, and the season ends without it. The scorecard only gets maintained if it is a byproduct of how the PO moves, not a separate task assigned to a coordinator.

What cadence should you score factories on?

Annual reviews are too slow. By the time the year-end review happens, two more seasons of POs have been placed with the same factory, and the decisions for the next year are already locked. Quarterly reviews are better but still lag the operational rhythm of an apparel brand, where the meaningful unit of time is a season, not a quarter.

The right cadence is per-PO scoring continuously, with a formal vendor review at the close of each delivery window for each season. Spring deliveries close, Spring scorecards get pulled, conversations happen with vendors in the next two weeks while the data is fresh and before Fall production ramps. Fall deliveries close, Fall scorecards get pulled, same cycle. Once a year, the four-season rolling view drives allocation decisions for the following calendar year.

This is the same logic as running OTB weekly during selling season rather than monthly. The decisions get made at the cadence the data changes, not at the cadence the calendar suggests. Monthly OTB is too slow during a buying window. Annual factory reviews are too slow during a sourcing cycle.

How many seasons of data do you actually need before changing allocation?

One season is recency bias. Two seasons can be coincidence. Three seasons starts to show a pattern. Four seasons is where the trend line is defensible enough to move volume.

The practical implication is that a brand starting to track factory performance properly today is making allocation decisions on real data 18 to 24 months from now, depending on whether they run two or four deliveries a year. That timeline horrifies most operators when they first hear it. It should not. The alternative is making allocation decisions on no data, which is what is happening today. Starting the clock late is better than not starting it.

Within that window, single-season outliers should drive conversations, not allocation changes. A factory that has been at 92 percent OTIF for three seasons and drops to 78 percent in one season needs a root cause conversation, not a volume cut. A factory that drifts from 92 to 88 to 84 to 80 across four seasons needs a volume cut, even though no single season looked like a crisis.

What are the anti-patterns that kill multi-season tracking?

Three show up repeatedly across the brands I talk to.

First, scoring against the revised ex-factory date instead of the original. This makes every factory look like they hit their dates, because the dates moved to accommodate them. The whole point of the metric is to capture drift. Hiding the drift defeats the metric.

Second, letting the sourcing agent or the factory itself populate the scorecard fields. The data has to come from the brand’s own systems: the PO issuance date from the production module, the receipt date from the warehouse module, the defect rate from the QC inspection record. If the factory is grading its own homework, the grades will be fine and the trend line will be flat.

Third, treating the scorecard as a sourcing artifact rather than an operational one. The scorecard does not just inform who to buy from next season. It informs which POs to escalate this week, which factories need a production visit before the next delivery window, and which styles should be dual-sourced because the historical risk on the primary factory is too high. If the scorecard only comes out at year-end, it is doing one job out of four.

What this means for an apparel operations team

Factory performance tracking is not a sourcing project. It is a production data project, and it sits squarely inside Breakpoint 2. The reason most brands cannot answer the CEO’s question about which factory to give the bigger Fall buy to is not that they lack analytical sophistication. It is that the underlying PO records were never structured to carry the scoring fields in the first place, so the data does not exist to analyze.

The practical first step is to audit a single recent PO end to end. Pull the original PO, the acknowledgment, every date revision, the inspection report, the receipt record, and any markdown attributable to that PO. If reconstructing that single PO takes more than 15 minutes, the system of record is the problem, and no scorecard built on top will be defensible. Fixing the record structure has to come before fixing the scoring.

The brands that get this right do not have prettier dashboards than the ones that do not. They have a single production record per PO, scored consistently, rolling up across enough seasons that the trend line means something. That is the asset. The dashboard is just how the asset is read.

6 Breakpoints Framework

Where is your operation on the 6 Breakpoints curve?

The assessment scores your apparel operation across all six breakpoints (product data, production, inventory truth, order flow, warehouse execution, reporting) and identifies which one is hurting you most.

Take the 6 Breakpoints assessment Read the framework

Frequently asked questions

Where this fits in the Uphance platform

Written by

Shubham Singh

Solutions Consultant, Apparel Operations, Uphance

Shubham writes about evaluating ERP fit, assessing operational complexity, and how apparel brands can tell whether their current systems are helping or holding them back. As a Solutions Consultant at Uphance, he runs discovery conversations and fit assessments for apparel brands moving off patchwork stacks of PLM, PIM, inventory, and B2B tools. His articles cover ERP selection, vendor RFPs, comparison frameworks, and the operational signals that tell a brand it has outgrown spreadsheets and point solutions. He focuses on how mid-market apparel teams evaluate connected platforms against the cost of staying with what they have.

Reviewed by

Isabelle Feyerabend

Customer Success and Onboarding Manager, Uphance

Isabelle writes about onboarding, workflow enablement, and how apparel teams build confidence in connected operations during rollout and beyond. As a Customer Success and Onboarding Manager at Uphance, she partners with apparel brands through their first three weeks on the platform: configuration, training, and the tactical playbooks that get day-to-day workflows running. Her articles focus on how-to guidance for product, inventory, and order operations, written for the people who actually run the workflows. She covers when to use which configuration, how to write the training docs, and what the first thirty days inside a connected platform look like in practice.