Explore

SemanticE2E: Solving the Fragility Problem in Large-Scale B2B Web E2E Testing

Preface

This article is for teams running large-scale E2E test suites on complex B2B web products — where a small QA team needs to maintain meaningful coverage across dozens of business modules, and where the real cost isn't writing tests, it's keeping them alive.

If your team has ever spent a sprint fixing broken tests that had nothing to do with actual bugs, or watched an entire test suite become obsolete after a framework migration, the problem this article addresses will be familiar.

The root cause is straightforward: standard E2E tooling couples test logic to implementation details — CSS selectors, XPath, DOM hierarchies, class names — that were never meant to be stable. SemanticE2E is our attempt to fix this at the layer where the problem lives, so that a small team can own a large test suite with confidence, today and as the product grows.

All technical concepts, architectural patterns, and figures cited in this article are generalized for public sharing. No proprietary systems, internal tooling names, or confidential business data are disclosed.

The Problem We Were Trying to Solve

At Meituan's Hotel & Travel business unit, we operate a large-scale B2B web platform — a merchant system used by hundreds of thousands of hotel partners to manage their listings, pricing, inventory, and promotions. The frontend is a complex, data-intensive PC web application with dozens of business modules and hundreds of interactive pages.

Like many teams at this scale, we relied heavily on E2E automated testing to gate releases. Our test suite ran on an internal Cypress-based CI platform. Over 2024 and into early 2025, as the merchant system underwent a major consolidation, we invested heavily in expanding our E2E test coverage (500+).

But the more tests we wrote, the more we felt the pain of a fundamental problem: test fragility.

The Numbers Tell the Story

After analyzing our test execution data, we found that roughly 20% of test runs were being skipped at the CI gate — meaning engineers were bypassing the automated checks entirely. Of those skips, about 40% were caused by unstable or unmaintained test cases, not by actual product bugs. We were spending enormous effort maintaining tests that were supposed to save us effort.

Two specific failure patterns kept recurring:

Pattern 1 — Routine iteration breaks tests at scale. A product manager changed the class name of a button inside a modal dialog. The business logic was untouched. But because our test cases used CSS selectors like:

div.vxe-table--body-wrapper.body--wrapper > tr:nth-child(2)

div.inventory-box > div.display-count.has-goods-event > span.display-hover-text

that single class name change caused 30+ test cases to fail simultaneously. Each one had to be manually re-recorded.

Pattern 2 — Framework migrations render entire test suites obsolete. When we migrated a major module from Vue to React, the DOM structure changed so fundamentally that our existing E2E tests were simply useless. We couldn't use them to verify that the rewritten pages still worked correctly — the very scenario where automated regression testing is most valuable.

⁠

Why Existing Solutions Didn't Fit

Before building our own solution, we evaluated three directions:

Coordinate-based tests — Record element positions (x, y) instead of DOM selectors. This decouples tests from DOM structure, but the resulting test code is completely unreadable, impossible to debug, and breaks whenever the UI layout changes even slightly.

data-test-id attributes — A well-known industry practice: embed stable test identifiers directly in the source code. This works well in theory, but in our context it required every business team to adopt a new discipline of maintaining test IDs in their code, and our shared component library didn't natively support it. The coordination cost across dozens of teams was prohibitive.

AI Vision / Natural Language agents — Tools like visual regression testing or natural-language-driven browser agents. These are exciting, but as of our evaluation period, they were unreliable on complex, dynamic PC web applications with dense data tables, multi-level modals, and context-sensitive UI states. The failure rate was too high for production use.

None of these options gave us what we needed: stability without requiring business code changes, and maintainability without sacrificing readability.

⁠

The Insight: A Semantic Tag Middleware Layer

The breakthrough idea came from an unexpected source: Vimium, a Chrome extension that lets you navigate web pages using only the keyboard. When you press F in Vimium, it scans the page, identifies all interactive elements, and overlays a short label on each one. You type the label to click the element.

What struck us about Vimium was this: you never interact with DOM elements directly. You interact with abstract tags. The labels are generated at runtime, based on what's currently visible and interactive on the page.

We asked: what if we applied this same principle to E2E testing — but made the labels semantic and stable, powered by an LLM?

The core idea of SemanticE2E is:

Instead of test cases selecting elements via CSS selectors or XPath, they select elements via semantic labels — human-readable identifiers like "btn-modify-inventory" or "input-check-in-date" — that are generated at runtime by an AI model and cached for reuse.

A test case written with SemanticE2E looks like this:

cy.initSemanticSdk({ projectId: "hotel-product", schemaUid: "inventory-mgmt", env: "prod" });

cy.getElementBySemanticId("btn-expand-all-room-types").click();

cy.getElementBySemanticId("btn-modify-inventory-by-product").click();

cy.getElementBySemanticId("checkbox-advanced-settings").click();

cy.getElementBySemanticId("input-select-date-range").click();

cy.getElementBySemanticId("option-one-month").click();

cy.getElementBySemanticId("checkbox-differentiate-by-weekday").click();

cy.getElementBySemanticId("radio-open-room").click();

cy.getElementBySemanticId("input-room-count").clear().type("199");

cy.getElementBySemanticId("btn-confirm").click();

Compare this to the equivalent traditional Cypress test:

cy.get("img.room-icon").first().click();

cy.get("td.col_6 > div.c--ellipsis > div.invalid > div.content > div.price-box")

.first().click();

cy.wait(7000);

cy.get('input[type="text"][placeholder="Please enter."]').click();

cy.get('input[type="text"][placeholder="Please enter."]').clear().type("200");

cy.get("div.new-single-price-change-modal > div.lz-modal-center > div.lz-modal > div.lz-modal-footer")

.click();

The semantic version reads like a specification. The traditional version is a fragile implementation detail that will break the moment any of those class names change.

The Contrast in Plain Terms

⁠

How It Works: Architecture Deep Dive

SemanticE2E consists of three main components: a JavaScript SDK injected into the test runner, a backend labeling service, and a configuration management platform. A Chrome DevTools extension handles the authoring side.

⁠

The Labeling Pipeline

When cy.getElementBySemanticId("btn-modify-inventory") is called during test execution, the following happens:

⁠

The Configuration System

A key design decision was to make the labeling rules configurable per business domain, not hardcoded. Configuration is organized in a three-level hierarchy: Project → Business Module → Feature Scenario.

Each Feature Scenario configuration contains:

Interactive element whitelist: Specifies which elements on the page should be labeled. Without this, the LLM would attempt to label every element on a dense data table — slow and noisy.

Always-visible element whitelist: For elements that are technically hidden (off-screen, behind overlays) but need to be interacted with in tests.

Semantic labeling rules: Business-specific prompts that guide the LLM. For example: "An element with class price-box whose text matches a price pattern like 199.99 should be labeled btn-modify-price." This is essential for handling dynamic content where the element's displayed text changes with live business data.

The minimum responsibility principle governs how Feature Scenarios are divided: each scenario should be owned by engineers who understand all the test cases within it. Merging unrelated business flows into one scenario creates a maintenance risk — if the labeling rules break, the owner may not know how to fix tests for unfamiliar business logic.

The Chrome Extension Recorder

To make test authoring practical, we built a Chrome DevTools extension that integrates directly into the browser's developer tools panel. The recorder:

Connects to the configuration platform for real-time label preview

Records click and input events, generating semantic test code automatically

Supports live debugging: after recording, you can replay the test immediately in the browser to verify correctness

Allows switching between staging and production configurations during recording

The test cases creation workflow looks like this:

⁠

Results and Benchmarks

Stability

We measured LLM labeling stability on complex pages in the hotel product domain. With well-maintained configuration rules, overall labeling stability exceeded 99.8% — meaning that across thousands of label generation calls, fewer than 0.2% produced an incorrect or inconsistent label.

Performance

Execution speed is the main trade-off. Compared to traditional Cypress tests:

⁠

Cache hit (DOM structure unchanged since last run): milliseconds to a few seconds per labeling call, depending on page element count.

Cache miss (first run, or after DOM changes): 10–40 seconds per labeling call.

End-to-end, semantic test cases in the hotel product domain run in 30–75 seconds, compared to 15–50 seconds for traditional tests. This is a meaningful slowdown, but in our experience it's acceptable for a test suite that runs as a pre-release gate rather than in a tight development loop.

Business Impact Targets

For projects that have adopted SemanticE2E, our targets are:

CI skip rate reduced by 40% (from ~20% to ~12%)

Test maintenance cost reduced by 80%

100% test case reuse rate during code-layer refactors (framework migrations, component library upgrades)

⁠

Comparison with Industry Approaches

Dimension

SemanticE2E

Traditional Cypress (cy.get)

Playwright getByRole

Visual / Screenshot Testing

Natural Language Agents

Core idea

Business semantics → dynamic runtime test-id

DOM selector driven

ARIA role + accessible name

UI visual diffing

Natural language → AI infers DOM

Stability

High — semantics don't change with UI

Low — class names and DOM structure break tests

Medium-High — stable until ARIA attributes or component library changes

Low — style/position changes cause false positives

Medium — AI inference is non-deterministic

Maintenance cost

Low — tests survive UI changes

High — selectors need updating

Low-Medium — readable, but breaks on component library upgrades or ARIA changes

Medium — descriptions must be precise

Business code changes required

None — runtime injection

None

None, but relies on components having correct ARIA attributes

None

Auditability

High — rules are version-controlled

Medium — technically auditable

High — deterministic and traceable

Low — hard to trace

Low — black-box AI

Cross-team readability

High — business language

Low — technical selectors

Medium — role-based, but not always business-meaningful

High — natural language

Handles dynamic content

Yes — LLM rules handle context-dependent labels

Partial — works if accessible name is stable

Partial

Best fit

Large projects with frequent UI changes or framework migrations

Stable, rarely-refactored codebases

Projects with strong accessibility discipline and stable component libraries

UI consistency verification

Recording lightweight, linear flows

There are no rows in this table

⁠

It's worth dwelling on Playwright's getByRole, since it's the closest industry analogue to what SemanticE2E is trying to achieve. getByRole selects elements by their ARIA role and accessible name — for example, getByRole('button', { name: 'Confirm' }) — which is far more readable and resilient than a raw CSS selector. We considered this direction seriously.

The limitation we ran into is that getByRole is still a static, authoring-time declaration. It works well when your component library consistently exposes correct ARIA roles and when button labels are stable strings. In our case, neither condition held reliably: our internal component library had inconsistent ARIA support, many interactive elements had labels that were dynamically computed from live business data (prices, room counts, dates), and during framework migrations the accessible name of an element could change even when its visual appearance and business function stayed the same.

SemanticE2E's runtime labeling approach handles these cases by letting the LLM reason about what an element does in the context of the page, rather than relying on a pre-declared accessible name. The trade-off is execution speed and the need to maintain a labeling knowledge base — costs that getByRole doesn't have. For projects with strong accessibility discipline and a stable component library, getByRole is likely the better choice. SemanticE2E is designed for the messier reality of large, long-lived B2B applications.

Cypress also released cy.prompt (natural language test authoring) in October 2025. It's an exciting development, but it operates differently from SemanticE2E: cy.prompt uses AI to infer DOM selectors at authoring time, while SemanticE2E generates semantic labels at execution time. The runtime approach means SemanticE2E's stability guarantees hold even when the page changes after the test was written.

⁠

Organizational Design for Long-Term Test Maintenance

Most discussions of E2E testing focus on the tooling. But in practice, the harder problem is organizational: who owns the tests, who fixes them when they break, and how do you scale coverage without scaling the QA headcount proportionally?

This is especially acute in large B2B products. Our merchant system spans dozens of business modules, each owned by a different engineering team. A central QA team cannot realistically author and maintain deep test coverage across all of them — the domain knowledge required is simply too distributed.

SemanticE2E's configuration model was designed with this constraint in mind, and it enables an ownership structure that we've found to work well in practice.

The Division of Responsibility

The model separates two distinct concerns:

QA's responsibility: define the critical user journeys. The QA team owns the high-level test strategy. They identify the core business flows that must be covered — the paths through the product that, if broken, would cause real merchant harm. They document these flows in plain language and communicate them to the engineering teams. They do not need to know how to implement the tests.

Engineering's responsibility: implement and maintain the tests for flows they own. Each Feature Scenario in the configuration platform is owned by the engineering team responsible for that business domain. They record the test cases, maintain the labeling rules, and fix failures when they occur. Because they built the feature, they have the domain knowledge to do this correctly and efficiently — something a central QA team could never replicate at scale.

This mirrors how code ownership works in most engineering organizations. Just as engineers are expected to write unit tests for the code they ship, they are expected to maintain E2E coverage for the user flows they own.

Isolation by Design

A critical enabler of this model is that Feature Scenario configurations are fully isolated from each other. A change to the labeling rules for the "modify pricing" flow has no effect on the "manage inventory" flow, even if both flows exist on the same page. Each scenario's configuration is versioned and deployed independently, with its own approval process.

This isolation has two important consequences. First, it contains the blast radius of configuration changes — a mistake in one team's rules cannot break another team's tests. Second, it makes ownership unambiguous. When a test fails, the responsible team is immediately clear from the scenario it belongs to.

⁠

Why Readability Is Vital

This entire model depends on QA being able to review test cases written by engineers they don't sit with, covering business logic they may not know deeply. In a traditional Cypress test suite, this review is effectively impossible without running the tests and watching the execution video — a slow, asynchronous process that doesn't scale.

Semantic test cases change this equation entirely. Consider the difference:

Traditional test case (what QA has to review):

cy.get("td.col_6 > div.c--ellipsis > div.invalid > div.content > div.price-box")

.first().click();

cy.wait(7000);

cy.get('input[type="text"][placeholder="Please enter."]').clear().type("200");

cy.get("div.new-single-price-change-modal > div.lz-modal-footer").click();

A QA engineer reading this cannot tell what business action is being performed without running it. Is this modifying a price? Changing inventory? Submitting a form? The intent is completely opaque.

Semantic test case (what QA has to review):

cy.getElementBySemanticId("btn-open-calendar-pricing").click();

cy.getElementBySemanticId("btn-modify-price-for-room-type-1").click();

cy.getElementBySemanticId("input-new-price").clear().type("200");

cy.getElementBySemanticId("btn-confirm-price-change").click();

A QA engineer can read this in seconds and immediately verify: does this test cover the right steps? Does it match the user journey I defined? Is anything missing? The review becomes a reading task, not an execution task.

In our experience, this shift reduces the per-test-case review cost by roughly an order of magnitude. It's what makes it realistic for a small QA team to maintain oversight of a large, distributed test suite without becoming a bottleneck.

Scaling the Model

The practical result of this structure is that test coverage scales with engineering headcount, not QA headcount. As new features are built and new teams are formed, each team takes on ownership of their own scenarios. The QA team's role remains constant: define journeys, review cases, and monitor overall coverage health.

In our merchant backend, a QA team of fewer than five engineers maintains oversight of a test suite spanning more than ten business modules, with test cases authored and maintained by the engineering teams themselves. The configuration isolation ensures that this distributed ownership doesn't create coordination overhead — teams work independently, and the blast radius of any mistake is contained to the scenarios they own.

This is, we think, the right long-term model for E2E testing in large B2B products: QA as architects of quality strategy, engineering as owners of quality execution.

Lessons Learned

1. The minimum responsibility principle for configuration ownership is non-negotiable. Early in the project, we experimented with broader Feature Scenario groupings. When labeling rules broke, engineers couldn't fix tests for business flows they didn't own. Strict ownership boundaries are essential at scale.

2. Cache design is the key to acceptable performance. The LLM is only called when the page's DOM structure actually changes. For stable pages, the cache hit rate is very high and the overhead is negligible. Investing in a robust feature-hashing and caching layer paid off significantly.

3. Semantic tests are dramatically easier to review. One unexpected benefit: QA engineers can now review test cases by reading the source code, without needing to watch execution videos. The semantic labels make the test's intent immediately clear, meaningfully reducing the time cost of test acceptance.

4. The configuration knowledge base is a living artifact. The LLM's labeling quality is only as good as the business rules you provide. Teams that invest in maintaining detailed, accurate labeling rules see much better stability. This is a new kind of maintenance burden, but it's far lighter than maintaining CSS selectors — and it's shared across all tests in a scenario rather than duplicated in each individual test case.

5. Zero business code changes is a genuine competitive advantage. The data-test-id approach is theoretically sound, but in a large organization with many teams and a shared component library, getting everyone to adopt a new practice is a multi-quarter effort. SemanticE2E's runtime injection model meant we could roll it out to existing test suites without any coordination with product engineering teams.

⁠

What's Next

The semantic label infrastructure opens up possibilities beyond just stabilizing existing tests:

AI-assisted test generation: Because semantic labels are stable, human-readable, and exportable, it becomes feasible to use an LLM to generate new test cases from a description of a user flow — without manual recording.

Cross-framework test portability: When a module migrates from one framework to another, semantic tests can potentially be reused with only configuration adjustments, not full re-recording.

Richer interaction support: The current implementation supports click and input events. We're extending the recorder and SDK to support hover events and semantic assertions (e.g., verify that element text-current-price displays a value matching a price pattern).

⁠

Conclusion

SemanticE2E is our answer to a problem that we believe is common across large-scale B2B web products: E2E tests that are too expensive to maintain and too fragile to trust. By introducing a semantic label middleware layer — powered by an LLM, cached aggressively, and configured per business domain — we've decoupled test cases from the implementation details of the pages they test.

The result is tests that read like business requirements, survive UI refactors, and can be understood and maintained by the whole team — not just the engineer who originally recorded them.

The core insight is simple: don't test against how a page is built. Test against what it means.

⁠

About the Author

Louis K is a frontend engineer at Meituan's Hotel & Travel business unit, where he focuses on test infrastructure, developer tooling, and the long-term maintainability of large-scale B2B web systems.

SemanticE2E is a project Louis is genuinely proud of — not because the idea is complicated, but because it isn't. The core insight (treat test element selection as a labeling problem, not a selector problem) took a while to arrive at, but once it did, the path forward felt clear. Watching the skip rate drop and hearing engineers say they actually enjoy reading the test cases now is the kind of feedback that makes the work feel worthwhile.

That said, this project was never a solo effort. We have several excellent engineers who work together tightly to get the job done. The Chrome recorder extension went through multiple rounds of iteration based on feedback from the engineers who used it daily — their patience with early rough edges made the final tool significantly better. And none of it would have reached production without the QA team's willingness to rethink how they collaborate with engineering, and to trust a new workflow before it had a long track record.

Good infrastructure is a team sport. The best outcome of this project isn't the tool itself — it's that the team now has a shared vocabulary for talking about test quality, a clear ownership model that scales, and a little more confidence that the tests they write today will still be working six months from now.

Louis K — Frontend Quality Engineering, Meituan Hotel & Travel

Appendix

A demo of SemanticE2E recorder

Record cases

⁠

Loading video: 录屏2025-11-2817.29.54.mov...

⁠

Replay cases

⁠

Loading video: 录屏2025-11-2818.00.32.mov...

⁠

Want to print your doc?
This is not the way.

Try clicking the ··· in the right corner or using a keyboard shortcut (

CtrlP

) instead.