Sixsense.ai

In Defect Classification, Data Quality Outweighs Model Complexity

Technical

By Prakriti ChaturvediLast updated: 24th Jun 2026

On a wafer-level bumping and RDL line, a small, carefully selected image set produced a more accurate model than one trained on a hundred times more data. This article examines why that occurred and what it means for how defect-classification projects should be run.

A result worth pausing on

On one bumping and RDL line, three defect classifiers were trained from the same pool of historical images. The first used 200 images, selected by hand. The second used 2,000, drawn at random. The third used 20,000, taken directly from the tool with no curation. The conventional expectation is that the largest dataset performs best. It did not.

The 200-image model reached 98% accuracy, with zero escapes and 0.1% overkill.
The 20,000-image model reached only 89%, with 1.8% of defects escaping and 2.4% of good die discarded.

One hundred times more data produced a less accurate model. This outcome is not specific to a single production line. It reflects a broader pattern in how defect-classification projects succeed.

The effort is often directed at the wrong problem

When a classifier underperforms on a bumping line, the common response is to change the model: a different architecture, additional layers, or further tuning. A great deal of time is spent here. The factor that most limits the model, however, usually goes unaddressed — the images it was trained on.
The cost of this misdiagnosis is pretty straightforward to quantify. A typical ADC project takes:

four to six weeks to release a first model into production;
five to twenty retrains a year to keep it effective;
accuracy that plateaus between 91% and 93%;
escape rates of 0.5–1.5%; and
overkill of 1.5–2.5%.

The last two appear directly in the yield report. None of these are limitations of the model. They are limitations of the data.

Why bumping and RDL is difficult for AI

Bumping and RDL is a demanding environment for defect classification because the boundary between accept and reject is narrow and shifts with context. The same defect can appear very different from one wafer to the next, and the decision to reject a die often depends on a detail that the image conveys only faintly: a small difference in size, the position of the defect, or the area it covers.
Several cases illustrate this:

An undersized bump within tolerance is acceptable. Once it falls below the tolerance threshold, a near-identical bump becomes a reject.
A particle beside a bump can resemble a bump with a deformed shape. The particle may be harmless, whereas the deformed bump is a critical defect.
Foreign material is not a single category. It ranges from isolated particles to bump-attached material, bridging material, long strands, and large-area contamination, each with its own accept or reject criterion. The model must see all of these to learn the boundary correctly.
A barely visible die crack must be rejected, while a larger and more obvious piece of dirt is often acceptable. Here severity, not size, governs the decision.
For bump surface damage, the difficulty is not detecting damage but identifying the narrow range in which a small mark is acceptable and a slightly larger one is not.

Standard data preparation makes this worse.
Selecting 5,000 images at random from one million selects the most common cases.. The borderline examples that define the decision boundary are rare, so they are underrepresented, and a model trained on such a set never observes the boundary clearly.
Inconsistent labelling adds a second limit. When the same defect is labelled differently across shifts, accuracy is capped at 91–93% before training begins.
The model is not underpowered. It is given too little of the right data, and too many conflicting labels.

A different approach: preparing the data first

SixSense Defect Classification concentrates effort on the data before focussing on the model — specifically, on selecting the right images and ensuring their labels are correct. Two tools carry out most of this work, and the engineer operates them directly.

Selecting the images

The Unique Images Selector reviews the full image history and decides two things: how many images are required, and which specific ones to use. It deliberately includes the borderline accept-versus-reject examples that random sampling misses.
On one advanced packaging line, it reduced more than one million images to approximately 4000 — less than half a percent of the original volume — without losing defect variety.
It also matches the number of samples to how variable each defect type is:

High-variability defects (foreign material, bump surface damage, RDL foreign material): 200–300 images each, to cover the range of accept and reject scenarios.
Moderate-variability defects (undersized bump, bump shape anomaly): 150–200 images each.
Consistent defects (missing bump): 50–100 images each.

Defects with greater variation receive more samples, because the model must see both sides of the boundary; simpler defects require fewer. The engineer does not estimate this distribution — the tool determines it.

Correcting the labels

Once the 4,000 images are selected, the next question is whether their labels are reliable. The clustering tool answers this visually.
It places every image on a two-dimensional map: similar images sit close together, and each defect type is given a distinct colour. Consistent labelling appears as well-separated, single-colour groups. Where two colours overlap, the labelling is inconsistent.

Embedding Space: Clusters and confusions

On this line, two overlaps were apparent at once:

approximately 320 images labelled inconsistently between passivation defect and contamination; and
approximately 210 images where foreign material and bump shape anomaly were used interchangeably, depending on the reviewer.

Together, these accounted for roughly 680 incorrect labels out of 4,500 — about 15% of the set, which is enough to reduce accuracy on its own.
Rather than reviewing the entire dataset, the team examined only the flagged images, agreed on clear accept and reject criteria for the ambiguous cases, and corrected the labels in a single pass. The work took about five hours, against the four to five days needed to review images individually.

Before and after on the same line

The following comparison uses the same line and the same starting images.

The earlier experiment fits this pattern. The 200-image model did not succeed because 200 is an optimal number; it succeeded because those images were well chosen, correctly labelled, and representative of both sides of each boundary.
The 20,000-image model performed worse because volume without curation conceals the cases that matter beneath many that do not.

The approach generalises across the fab

Bumping and RDL is one example;However, a large historical image volume, inconsistent labels, and frequent borderline cases describe nearly every inspection step in the fab.
The same two tools already operate at scale:

more than 100 inspection steps and 200 defect types;
across black-and-white and colour optical images, SEM, X-ray, and wafer maps; and
on more than 20 tool types, from vendors including KLA, Applied Materials, Onto (NSX/Dragonfly), and Camtek.

The workflow that resolved one step extends to the rest.

Conclusion

Most of the work described as "training a model" for defect classification is, in practice, data work: selecting images, correcting labels, and adding new data whenever production presents an unfamiliar case. Two tools: one to select the images, one to verify the labels — handle the majority of it.
For this customer, more than one million images and four to six weeks of work were reduced to 4,500 images and two days. The resulting model was more accurate and required substantially less retraining. The decisive factor was not a more sophisticated model, but the quality of the data provided to it.