Skip to main content
Home Docs Million Claim Challenge

The Million Claim Challenge

One million synthetic claims. Pre-computed expected outcomes. A single question: does your adjudication engine get them all right?

What Is It

The Million Claim Challenge (MCC) is an open benchmarking tool that generates a deterministic corpus of 1,000,000 synthetic healthcare claims — each with a pre-computed expected adjudication result. Feed them into any claims engine, compare the output to the answer key, and get a score.

Think of it as a unit test suite for your entire adjudication pipeline, except the test suite is a million claims long and covers professional, institutional, dental, and every edge case we could think of at 2 a.m. on a Wednesday.

Open by design

The generator, the corpus profile, and the scoring methodology are all open source. We built MCC to benchmark Cloud Health Office against itself — and then decided the industry could use a shared yardstick. Use it on our platform, use it on someone else's, use it on that mainframe in the basement that nobody wants to touch. We don't mind.

Why We Built It

Healthcare claims adjudication is one of those domains where “works on my machine” can mean a payer eats a six-figure overpayment or a member gets an incorrect denial on a surgery they already had. The stakes are too high for vibes-based QA.

We wanted a benchmark that was:

  • Comprehensive — not just the happy path, but COB birthday rule disputes, retro-eligibility swaps, newborn auto-adjudication, and Medicaid spend-down edge cases
  • Deterministic — same seed, same corpus, every time, on every machine
  • Fast — generates the full million on an M-series Mac in under five minutes
  • Honest — we run it on CHO too, and we publish the results

If you're evaluating payer platforms, MCC gives you something concrete to compare against. If you're building one, it tells you where your blind spots are before a real claim finds them for you.

Corpus Anatomy

The default corpus profile generates exactly 1,000,000 claims with a stratified distribution designed to mirror real-world payer volume:

Claim TypeCountShareWhat It Tests
Professional 600,000 60% CMS-1500 — E/M visits, modifier stacking, global surgery, bilateral, telehealth, lab/path
Institutional 250,000 25% UB-04 — inpatient DRG, outpatient per diem, ED, observation, stop-loss outliers, SNF
Dental 100,000 10% ADA — preventive, restorative, endodontics, periodontics, orthodontics (lifetime max), oral surgery
Edge Cases 50,000 5% The hard stuff — 29 named scenarios across 7 categories (see below)

The 60/25/10/5 split isn't arbitrary. It roughly tracks the volume distribution of a mid-size commercial payer with a dental rider and a Medicaid line of business. We tuned it so the benchmark hurts in the right places.

Professional Claims — 600K

Seven sub-types, weighted by frequency in real-world claims data:

Sub-TypeShareKey Challenge
Office Visits (E/M 99211–99215)40%Basic fee schedule lookup and cost-sharing
Multi-Line Procedures20%Modifier stacking (25, 59, 76, 77, RT/LT)
Global Surgery Packages10%10/90-day global period follow-up bundling
Bilateral (Modifier 50)5%150% payment rule
Assistant Surgeon (80/82)5%Reduced fee schedule application
Telemedicine (POS 02)10%Modifier 95, place-of-service override
Lab/Pathology (80000–89999)10%Panel bundling, fee-for-service pricing

Institutional Claims — 250K

Sub-TypeShareKey Challenge
Inpatient with DRG40%DRG grouper assignment, per-case pricing
Outpatient Per Diem25%Per diem rate application
Emergency Department15%ED copay rules, prudent layperson
Observation Stays10%Outpatient status, observation-to-inpatient conversion
Stop-Loss / Outlier5%Outlier thresholds, cost-to-charge ratios
Skilled Nursing Facility5%SNF day limits, coinsurance progression

Dental Claims — 100K

CategoryShareKey Challenge
Preventive (D0100–D1999)40%100% coverage, frequency limitations
Restorative (D2000–D2999)25%Basic/major classification, downcoding
Endodontics (D3000–D3999)10%Tooth-specific history
Periodontics (D4000–D4999)10%Quadrant-based billing
Orthodontics (D8000–D8999)10%Lifetime maximum tracking
Oral Surgery (D7000–D7999)5%Medical/dental crossover

The Edge Cases

Fifty thousand claims. Twenty-nine named scenarios. Seven categories of pain. These are the claims that make senior adjudication analysts reach for their coffee and say “oh, this one.”

CategoryCountScenarios
Coordination of Benefits 12,000 Primary, secondary, and tertiary payer; birthday rule; gender rule; Medicare Secondary Payer
Retro-Eligibility 8,000 Retroactive add, retroactive termination, retroactive coverage change
Newborn 6,000 Auto-adjudication under mother's coverage, mother-claim linkage, first 30 days
Prior Authorization 8,000 Auth on file, no auth, expired auth, wrong provider, wrong procedure
Subrogation 4,000 Accident-related, workers' comp, third-party liability
Behavioral Health 6,000 Carve-out, carve-in, mental health parity quantitative limit check
Medicaid Subprogram 6,000 TANF, SSI, CHIP, dual eligible, spend-down
Why 5% edge cases?

In production, edge cases are maybe 3–5% of volume. But they account for 40–60% of claim disputes, member complaints, and late-night pages. The MCC over-indexes on them on purpose. If your engine aces the 600K office visits but fumbles a COB birthday rule, you've got a problem that no amount of throughput will fix.

Expected Outcomes — The Answer Key

Every claim in the corpus ships with a pre-computed ExpectedOutcome that serves as the answer key. This includes:

FieldDescription
DispositionPaid, Denied, or Pended
DenialReasonCodeCARC/RARC code (e.g., 197 for missing prior auth, 22 for COB)
ExpectedAllowedAmountAfter fee schedule application
ExpectedPaidAmountAfter member cost-sharing
ExpectedCopayMember copay amount
ExpectedCoinsuranceMember coinsurance amount
ExpectedDeductibleDeductible applied
ExpectedDrgCodeDRG assignment (institutional only)
LineOutcomesPer-line disposition, allowed, paid, and reason codes

Disposition matching alone catches the big bugs. Amount matching catches the subtle ones — the off-by-a-penny coinsurance calculation, the deductible that applied when it shouldn't have, the bilateral modifier that paid at 100% instead of 150%.

CMS-0057-F Compliance Coverage

Every claim in the corpus includes CMS-0057-F compliance fields, making MCC double as a compliance readiness smoke test:

  • Prior Authorization Status — every claim carries a PriorAuthStatus (Required, OnFile, NotRequired, Expired) and an ExpectedPriorAuthDecision in the answer key
  • FHIR R4 Readiness — the FhirBundleWriter output format maps each claim to a FHIR Claim resource, so you can validate your Patient Access and Provider Access APIs against real-shaped data
  • Payer-to-Payer Exchange — the PayerToPayerReady flag tracks whether each claim meets the data completeness requirements for P2P exchange
Two birds, one corpus

Run MCC once and you get adjudication accuracy and a CMS-0057-F compliance artifact inventory. The prior auth edge cases alone (auth on file, no auth, expired auth, wrong provider, wrong procedure) map directly to the scenarios CMS will test during enforcement.

Running the Benchmark

Quick Start — CLI

The fastest way to generate a corpus is with the mcc-runner console tool:

# Clone the repo (if you haven't already)
git clone https://github.com/aurelianware/cloudhealthoffice.git
cd cloudhealthoffice

# Generate a 1,000-claim test corpus (takes < 1 second)
dotnet run --project src/tools/mcc-runner -- --claims 1000

# Generate the full million (takes ~5 minutes)
dotnet run --project src/tools/mcc-runner -- --claims 1000000

# Custom output directory and FHIR format
dotnet run --project src/tools/mcc-runner -- -n 5000 -f fhir -o ./fhir-output

# Custom seed for reproducibility
dotnet run --project src/tools/mcc-runner -- -n 10000 --seed 123

CLI Options

FlagDefaultDescription
-n, --claims1,000Number of claims to generate. Distribution scales proportionally.
-s, --seed42Random seed. Same seed = same corpus, every time, any machine.
-o, --output./mcc-outputOutput directory. Created automatically.
-f, --formatjsonOutput format: json (partitioned) or fhir (R4 Bundles).
What the output looks like

The generator creates a directory per claim type (Professional/, Institutional/, Dental/, EdgeCase/) plus a corpus-manifest.json with generation stats. Each JSON file contains a batch of fully-populated claims with expected outcomes — ready to feed into any claims engine via API or batch import.

Using the Library Directly

If you want to integrate corpus generation into your own tooling or test harness:

// Reference CloudHealthOffice.BenchmarkClaimGenerator
var profile = DefaultCorpusProfile.Create();     // 1M claims, 60/25/10/5
var refData = new InMemoryReferenceDataProvider();
var writer = new JsonCorpusWriter("/output/corpus");
var generator = new ClaimCorpusGenerator(refData);

var result = await generator.GenerateCorpusAsync(profile, writer);

Console.WriteLine($"Generated {result.TotalClaims:N0} claims in {result.ElapsedTime}");
Console.WriteLine($"  Professional: {result.ProfessionalCount:N0}");
Console.WriteLine($"  Institutional: {result.InstitutionalCount:N0}");
Console.WriteLine($"  Dental: {result.DentalCount:N0}");
Console.WriteLine($"  Edge Cases: {result.EdgeCaseCount:N0}");

Output Formats

WriterFormatUse Case
JsonCorpusWriter Partitioned JSON files Primary — feed into any claims engine via API or batch import
FhirBundleWriter FHIR R4 Bundle JSON CMS-0057-F compliance testing — validate FHIR APIs directly

Deterministic Reproducibility

The generator uses seeded randomness throughout. Same seed, same corpus, byte-for-byte, on any machine. The default seed is 42 because of course it is.

Seeding Prerequisites

Generating a million claims is only half the story. To actually adjudicate them, your claims engine needs reference data: members to look up, providers to pay, benefit plans to apply, fee schedules to price against, and accumulators to track cost-sharing. As of April 2026, MCC ships with a complete synthetic data generation framework that seeds all of these in one shot.

What Gets Seeded

The SeedingOrchestrator generates seven data pools in strict dependency order. Each pool is production-shaped — the documents map directly to the Cosmos DB containers used by Cloud Health Office microservices.

StepData PoolDefault CountWhat It Produces
1 Benefit Plans 10 Texas Medicaid plan templates — STAR, STAR+PLUS, STAR Kids, CHIP (A & B), STAR Health, dental, vision. Realistic cost-sharing rules per program.
2 Fee Schedules 3 Medicaid (70% of Medicare RBRVS), Out-of-Network (150% of Medicaid), and Capitation (PMPM by program). 50+ CPT codes and 50 DRG rates each.
3 Providers 5,500 5,000 individual + 500 organizational. Luhn-10 valid NPIs, NUCC taxonomy codes, DFW-region addresses. 80% in-network, 10% OON, 10% terminated.
4 Provider Contracts ~4,400 One contract per in-network provider. Links to Medicaid or Capitation fee schedule. 80% fee-for-service, 15% capitation, 5% per-diem.
5 Members 75,000 50,000 subscribers + ~25,000 dependents. Realistic demographics, DFW addresses, PCP assignments, coverage records with plan year dates.
6 Coverage ~75,000 One coverage record per member with plan assignment, coverage tier (EMP/ESP/FAM/ECH), and insurance lines (HLT, DEN, VIS).
7 Accumulators ~71,000 Individual and family deductible/OOP tracking per active member. 70% at $0 (Medicaid), 20% partial, 10% near max.
Dependency ordering matters

The orchestrator enforces a strict pipeline: plans and fee schedules first (no dependencies), then providers, then contracts (needs providers + fee schedules), then members (needs plans + providers for PCP), then coverage, then accumulators (needs members + plans). Skip a step and the downstream data won’t make sense.

Configuration

Two profile classes control the shape of the generated data. The defaults produce a mid-size Texas Medicaid managed care organization.

ProfileKey SettingDefaultWhat It Controls
MemberPoolProfile SubscriberCount 50,000 Primary members (heads of household)
TargetTotalMembers 75,000 Total including dependents
ActiveRate 0.95 Fraction with active enrollment (rest are terminated)
DependentDistribution 60/20/15/5 0, 1, 2–3, 4+ dependents per subscriber
AgeDistribution 25/35/30/10 <18 / 18–44 / 45–64 / 65+ age brackets
ProviderPoolProfile IndividualProviderCount 5,000 Type 1 NPI physicians (40% are PCPs)
OrganizationalProviderCount 500 Type 2 NPI facilities (hospitals, clinics, SNFs, behavioral health)
InNetworkRate 0.80 Participating providers with contracts
ContractTypes 80/15/5 Fee-for-service / capitation / per-diem split

Running Against Cloud Health Office

Here’s the end-to-end workflow for benchmarking the full Cloud Health Office adjudication pipeline — from seeding prerequisite data through scoring results.

Step 1 — Seed Prerequisite Data

The SeedingOrchestrator generates all reference data in dependency order and optionally persists it to Cosmos DB:

// Reference CloudHealthOffice.BenchmarkClaimGenerator
using CloudHealthOffice.BenchmarkClaimGenerator.Configuration;
using CloudHealthOffice.BenchmarkClaimGenerator.Seeding;

var orchestrator = new SeedingOrchestrator(logger);

// Configure pool sizes (defaults shown — adjust to taste)
var memberProfile = new MemberPoolProfile
{
    SubscriberCount = 50_000,        // 50K subscribers
    TargetTotalMembers = 75_000,     // ~75K with dependents
    Seed = 42
};

var providerProfile = new ProviderPoolProfile
{
    IndividualProviderCount = 5_000, // 5K physicians
    OrganizationalProviderCount = 500, // 500 facilities
    Seed = 42
};

// Option A: Generate in-memory only (for file-based import)
var result = await orchestrator.GenerateAndSeedAsync(
    memberProfile, providerProfile, seed: 42);

// Option B: Generate and seed to Cosmos DB
var seeder = new CosmosDbBenchmarkSeeder(
    connectionString: "AccountEndpoint=https://...",
    tenantId: "mcc-benchmark",
    databaseName: "cloudhealthoffice");

var result = await orchestrator.GenerateAndSeedAsync(
    memberProfile, providerProfile, seed: 42, seeder: seeder);

Step 2 — Generate Output Files

After seeding, generate import-ready files for the enrollment, provider, and fee schedule pipelines:

// Generate 834 EDI, provider CSVs, and fee schedule CSVs
await orchestrator.GenerateOutputFilesAsync(result, "./mcc-seed-output");

// Output structure:
//   ./mcc-seed-output/834/          ← X12 834 enrollment files (5K members/file)
//   ./mcc-seed-output/providers/    ← Provider import CSVs
//   ./mcc-seed-output/fee-schedules/ ← Fee schedule CSVs

Step 3 — Generate the Claim Corpus

With reference data seeded, generate the claims:

# Generate the full million claims
dotnet run --project src/tools/mcc-runner -- --claims 1000000 --seed 42

# Or a smaller test run first
dotnet run --project src/tools/mcc-runner -- --claims 10000 --seed 42

Step 4 — Submit Claims & Score

Feed the generated claims into the Cloud Health Office adjudication pipeline (via the Claims API or batch import), then compare each result against the pre-computed ExpectedOutcome in the corpus. The scoring tiers are defined in the Scoring section below.

Same seed, same universe

Use the same seed (42 by default) for both the prerequisite seeding and the claim corpus. This ensures the members, providers, and plans referenced in the claims match exactly what was seeded — deterministic end to end.

Output Formats & Import Pipelines

The seeding framework produces import-ready files for each Cloud Health Office microservice, so you can benchmark the full intake pipeline — not just the adjudication engine.

WriterFormatTarget ServiceDetails
X12_834Writer X12 834 EDI files enrollment-import-service Properly formatted ISA/GS/ST envelopes with INS, REF, DTP, NM1, HD, and PCP (LX/NM1 P3) segments. 5,000 members per file.
ProviderImportCsvWriter CSV Provider data (Cosmos DB seeding or custom import tooling) NPI, TaxId, taxonomy, network status, credentialing, contract type, fee schedule linkage. Separate files for individuals and organizations. Use with CosmosDbSeeder or a custom loader — provider-service itself exposes a JSON API, not a CSV endpoint.
FeeScheduleImportCsvWriter CSV Fee schedule data (Cosmos DB seeding or custom import tooling) Procedure code, modifier, place of service, allowed amount. One file per fee schedule (Medicaid, OON, Capitation). FeeScheduleEngine is a library consumed by the adjudication pipeline; seed its backing Cosmos DB container directly or use these CSVs with a custom loader.
CosmosDbSeeder Cosmos DB documents All microservices Bulk write to Members, Coverage, Providers, ProviderContracts, BenefitPlans, FeeSchedules, and Accumulators containers. Note: The base WriteDocumentsAsync is a no-op stub — subclass CosmosDbSeeder and override it with your Azure.Cosmos SDK bulk-write implementation, then pass that instance to CosmosDbBenchmarkSeeder.
Two paths to seeding

File-based: Generate 834 EDI and CSV files, then feed them through the enrollment-import-service and provider-service APIs. This benchmarks the full intake pipeline including X12 parsing, validation, and persistence.

Direct Cosmos DB: Skip the intake pipeline and seed documents directly. This is faster for benchmarking adjudication in isolation, but doesn't exercise the import path.

Benchmarking Methodology

Here’s the playbook for running a rigorous end-to-end benchmark against Cloud Health Office. The same approach works for any claims engine — adapt the intake format and API calls.

Environment Setup

ComponentRecommendation
Infrastructure Deploy CHO via Kubernetes Quick Start or Docker Compose. Use production-equivalent node sizes for meaningful throughput numbers.
Cosmos DB Provision with sufficient RU/s for the seed volume. 10,000 RU/s handles the default 75K member seed comfortably. Scale to 50,000+ RU/s for load testing the adjudication pipeline.
Isolation Use a dedicated tenantId (default: mcc-benchmark) to isolate benchmark data from production tenants.

Benchmark Phases

PhaseCommand / ActionWhat to Measure
1. Seed Run SeedingOrchestrator.GenerateAndSeedAsync() with Cosmos DB seeder Seeding throughput (records/sec), total elapsed time, Cosmos DB RU consumption
2. Generate dotnet run --project src/tools/mcc-runner -- -n 1000000 Corpus generation time (expect < 5 min on M-series, < 10 min on CI)
3. Ingest POST claims to /api/claims or batch import via 837 EDI Sustained claims/second, P50/P95/P99 ingestion latency, error rate
4. Adjudicate Monitor claims-examiner-service processing Adjudication throughput, queue depth, autoscaler behavior (KEDA pod count)
5. Score Compare adjudicated results to ExpectedOutcome answer key Disposition accuracy, amount accuracy, edge case accuracy (see Scoring)

Key Metrics to Capture

MetricTargetHow to Capture
Seeding Time < 5 min (75K members) SeedingResult.ElapsedTime + RecordsSeeded dictionary
Corpus Generation < 5 min (1M claims) CorpusResult.ElapsedTime
Claims/Second (sustained) Platform-dependent Total claims / wall-clock time over the full corpus, not burst
P99 Adjudication Latency < 500ms Per-claim processing time at 99th percentile
Cost Per Claim Platform-dependent Total Azure spend during benchmark / 1,000,000
Disposition Accuracy ≥ 99.5% Paid/Denied/Pended match rate
Edge Case Accuracy ≥ 95.0% Correct disposition + amount on the 50K edge case claims
Deterministic = repeatable

Every generator in the framework is seeded. Same seed, same data, every time. This means you can run the benchmark before and after a code change and get a directly comparable result. No need to control for data variance — there isn’t any.

Evaluating a Payer Platform

If you're evaluating a claims adjudication platform — whether that's Cloud Health Office, a legacy system like QNXT or Facets, or a newer entrant like HealthEdge — MCC gives you three things that vendor demos and slide decks can't: accuracy under volume, throughput at scale, and edge case coverage you didn't have to invent yourself.

Accuracy Under Volume

Any engine can adjudicate 10 claims correctly on a demo call. The question is whether it still gets the right answer on claim 847,293 — a COB birthday rule dispute for a dual-eligible member with an expired prior auth. MCC answers that question with data, not promises.

Throughput & Load Testing

The corpus is large enough to stress-test a production-grade pipeline. Feed it in at pace and you'll learn how the engine behaves under sustained load — where the bottlenecks are, whether accuracy degrades at volume, and what the actual per-claim cost looks like when the cluster is under real pressure.

MetricWhat to MeasureWhy It Matters
Claims/Second Sustained Throughput over the full million, not a burst Burst numbers look great in demos. Sustained throughput predicts production reality.
P99 Latency 99th percentile single-claim adjudication time Outlier latency shows where the engine chokes — complex DRG, COB cascades, etc.
Cost Per Claim Total infrastructure cost / 1,000,000 Vendor pricing decks quote per-member-per-month. MCC gives you the real per-claim compute cost.
Accuracy at Scale Does accuracy hold steady or drift as volume increases? Some engines cache aggressively and return stale results under load. MCC catches that.
Ask the vendor to run it

When evaluating a payer platform, hand them the MCC corpus and ask for results. A vendor confident in their engine will welcome the test. A vendor that declines just told you something important.

RFP-Ready Artifact

The MCC scoring report — disposition accuracy, amount accuracy, edge case accuracy, throughput, and cost-per-claim — is a ready-made evaluation artifact for RFP scoring committees. It's objective, reproducible, and directly comparable across platforms. No more "our engine is accurate" without receipts.

Scoring

Run the corpus through your adjudication engine and compare against the answer key. We recommend three tiers:

MetricWhat It MeasuresTarget
Disposition Accuracy Paid/Denied/Pended matches the expected disposition ≥ 99.5%
Amount Accuracy Paid amount within ±$0.01 of expected ≥ 98.0%
Edge Case Accuracy Disposition + amount correct on the 50K edge cases ≥ 95.0%

Why are the targets different? Because getting office visits right is table stakes. Getting a Medicaid dual-eligible spend-down claim right while a COB birthday rule is in play — that's where engines separate themselves.

How does Cloud Health Office score?

We run MCC against CHO on every release. Our current scores are published in the repository README. We believe in showing our work. If we're going to build a benchmark, we should be willing to stand in front of it.

FAQ

Can I use MCC to benchmark a competitor's platform?

Yes. That's the point. MCC is open source and not coupled to Cloud Health Office. Generate a corpus, convert it to whatever intake format the other platform expects, run it, and compare. We'd genuinely love to see the results.

Are the ICD-10 / CPT / CDT codes real?

The codes are drawn from real code ranges with simplified descriptions. 99213 is really CPT 99213 (established patient, low complexity). D2750 is really CDT D2750 (porcelain fused to high noble metal crown). The charges are realistic but synthetic — no PHI, no real patients, nothing to worry about.

Why are expected outcomes simplified in v1?

V1 uses straightforward pricing rules (e.g., allowed = charges × 0.65 for professional claims, DRG flat rates for inpatient). This lets you validate your engine's logic — does it pick the right disposition, apply the right cost-sharing, catch the right edge cases? V2 will plug in the real BenefitEngine and FeeScheduleEngine for contract-accurate expected amounts.

How long does generation take?

On an Apple M-series MacBook: under 5 minutes for the full million. The generator uses System.Threading.Channels to pipeline four parallel producers (one per claim type) into a single writer consumer. It's fast enough that you won't have time to get coffee, but slow enough that you can watch the progress counter and feel productive.

What's the claim ID format?

MCC-{type}-{sequence} where type is P (Professional), I (Institutional), D (Dental), or E (Edge Case), and sequence is a zero-padded 7-digit number. Example: MCC-P-0000001, MCC-E-0042000.

Can I use MCC for load testing and capacity planning?

Absolutely. A million claims is a meaningful volume for stress-testing a production-grade adjudication pipeline. Feed them in at pace to measure sustained throughput, P99 latency, autoscaler behavior, and per-claim infrastructure cost. The deterministic corpus means you can run the same load test repeatedly and get directly comparable results as you tune.

I'm evaluating payer platforms for an RFP. How do I use this?

Generate the corpus, hand it to each vendor under evaluation, and ask them to run it. Compare disposition accuracy, amount accuracy, edge case accuracy, throughput, and cost-per-claim side by side. It's the same million claims, so the comparison is apples-to-apples. We've seen this approach cut weeks off payer platform evaluation cycles because it replaces subjective demo impressions with objective, reproducible data.

Does Cloud Health Office publish its own MCC results?

Yes. We run MCC against CHO on every release and publish the results. We built the benchmark, so it would be strange to hide from it. You can find our current scores in the repository README.

What is the SeedingOrchestrator and why do I need it?

The SeedingOrchestrator generates all the prerequisite data a claims engine needs before it can adjudicate: members, providers, benefit plans, fee schedules, provider contracts, coverage records, and accumulators. Without this data, claims have nothing to adjudicate against. The orchestrator enforces dependency ordering (plans before members, providers before contracts) and produces production-shaped documents that map to the Cosmos DB containers CHO uses.

How many members/providers does the seeding framework generate?

By default: 50,000 subscribers with ~25,000 dependents (75,000 total members), 5,000 individual providers, and 500 organizational providers (hospitals, clinics, SNFs). Both pools are configurable via MemberPoolProfile and ProviderPoolProfile. Scale down for quick iteration; scale up for production-grade load testing.

What output formats does the seeding framework support?

Four writers: X12 834 EDI files for the enrollment-import-service (5,000 members per file, properly formatted ISA/GS/ST envelopes), provider CSV for the provider-service import endpoint, fee schedule CSV for the FeeScheduleEngine, and direct Cosmos DB bulk seeding for bypassing intake pipelines. Use file-based import to benchmark the full pipeline; use Cosmos DB seeding to benchmark adjudication in isolation.

Can I seed to a database other than Cosmos DB?

Yes. The seeding framework uses an IBenchmarkDataSeeder interface. Implement that interface for your target database (Postgres, SQL Server, DynamoDB, etc.) and pass it to SeedingOrchestrator.GenerateAndSeedAsync(). The built-in CosmosDbBenchmarkSeeder is one implementation; the file-based writers (834, CSV) work with any backend that accepts those formats.

Are the generated NPIs valid?

Yes. Every NPI passes Luhn-10 check digit validation with the standard “80840” prefix. This means provider lookups, NPPES cross-references, and NPI format validators will accept them. The NPIs are synthetic — they don’t map to real providers.