Healthee is integrating NVIDIA Nemotron open models to transform AI assistant conversations into high-value insights that power personalization, experimentation, and analytics at scale.
Table of contents
Healthee platform and Zoe: AI-native benefits navigation
Why NVIDIA Nemotron for ZaaS: open, agentic, and optimized for inference
End-to-end ZaaS data and inference pipeline
1. Conversation capture
2. Preprocessing and routing
3. Nemotron inference endpoint
4. Structured semantic output
5. Downstream consumption
Results – Nemotron excels for Zoe’s use cases
Future plans
Healthee platform and Zoe: AI-native benefits navigation
Healthee offers a digital platform that helps employees understand, use, and optimize their health and benefits, with a strong focus on self-service and cost transparency. Zoe, Healthee’s AI assistant, acts as a personal health and benefits concierge that can:
- Explain coverage and benefits in natural language
- Help employees find in-network providers and services
- Guide users to lower-cost options and surface personalized recommendations
- Reduce friction around open enrollment and day-to-day usage of benefits
Beneath the user-facing flow, Zoe orchestrates multiple models and services: LLMs, benefits rules engines, provider and network data, and personalized user context.
ZaaS is a backend feature-extraction layer that sits behind Zoe: it consumes Zoe’s conversational logs and generates structured semantic representations (intents, topics, entities, and higher-order conversational features) that downstream systems can consume for personalization and analytics.
Why NVIDIA Nemotron for ZaaS: open, agentic, and optimized for inference
NVIDIA Nemotron offers open models, datasets and libraries for building agentic AI systems – systems that can perform reasoning, planning, and tool use in complex enterprise workflows. NVIDIA Nemotron is designed to be:
- Open and extensible: models, datasets, and libraries are openly published so enterprises can inspect, fine-tune, or extend.
- Deployable anywhere: Nemotron models and NIM microservices can run from data center to edge, on a wide range of NVIDIA GPUs.
- Optimized for efficiency: leveraging hybrid Transformer-Mamba architecture, and using libraries such as TensorRT-LLM, and techniques such as KV-caching, quantization options, and optimized execution graphs for high throughput(tokens/sec) and low latency.
For Healthee’s ZaaS:
- The task is semantic feature extraction over short to medium-length multi-turn conversations (Zoe chat sessions).
- Rich system prompts that define schemas, domain constraints, and reasoning behavior
- Chain-of-thought or intermediate annotation (internally) to drive better structured outputs (without exposing chain-of-thought to end users)
- Optional tool-augmented flows where Nemotron calls back into Healthee’s domain services (for example, benefit definitions) during extraction
- Latency and throughput are both important: ZaaS must keep up with production traffic while supporting downstream analytics jobs that may replay or reprocess historical logs.
- The workload is inference only: Healthee uses Nemotron models currently and is planning to customize them in the near future
Nemotron fits this pattern well as an agentic-capable model that can be deployed via community inference engines or via NVIDIA NIM and optimized for high-throughput, low-latency inference.
Healthee standardized on NVIDIA Nemotron 2 Nano model, deployed as a self-hosted NIM inference microservice on the internal network. The NIM endpoint has an industry-standard chat completions API, which simplifies integration and allows Healthee to swap models without changing application code.
Prompting strategy: ZaaS uses a structured NER-style (Named Entity Recognition) prompt that casts Nemotron as a “healthcare analyst.” The system prompt dynamically injects domain-specific feature type definitions at runtime by querying the zaas_feature_types lookup table in Snowflake. This ensures the model always operates against the latest feature taxonomy without redeployment. The prompt defines three tiers of feature extraction:
- Explicit features, entities the user directly mentions (services, medical conditions, symptoms, insurance terms, places of service, specialists, specific provider names)
- Inferred features, entities deduced from context (health goals, recommended specialists, recommended treatments, conversation sentiment [POSITIVE/NEUTRAL/NEGATIVE], user healthcare-savviness [LOW/MEDIUM/HIGH], profanity detection)
- Freestyle features, any additional entities the model discovers on its own, prefixed with an underscore (e.g., _custom_type)
The prompt mandates strict JSON array output ([{“feature”: “…”, “type”: “…”}]) parseable by json.loads(), with no markdown or explanatory text. Nemotron’s reasoning (<think> blocks) is automatically stripped before JSON parsing. Temperature is set to 0.1 to maximize output consistency. No fine-tuning or custom adapters are applied at this stage, extraction quality comes entirely from prompt engineering and the dynamic feature-type injection.
End-to-end ZaaS data and inference pipeline
The ZaaS architecture adds a GPU-accelerated inference path off the main Zoe conversational flow, without blocking user-facing responses.
1. Conversation capture
- Zoe handles user messages, composes tool calls, and returns responses using existing LLM infrastructure.
- Each conversation is logged with message text, timestamps, user IDs (or pseudonymous IDs), channel metadata, and relevant contextual tags.
2. Preprocessing and routing
A ZaaS ingestion service consumes conversation logs (for example, via an event bus or message queue) and prepares requests for Nemotron:
- Group messages into sessions or windows (for example, last N turns or entire session).
- Normalize text (lowercasing where relevant, redacting sensitive data, injecting system prompts).
- Enrich with metadata such as product, customer segment, or experiment cohort.
Example pseudo-schema for requests:
{
“conversation_id”: “abc123”,
“user_id”: “hashed_user_42”,
“turns”: [
{“role”: “user”, “text”: “What does my plan cover for mental health?”},
{“role”: “assistant”, “text”: “Your plan covers 10 in-network therapy sessions per year…”},
{“role”: “user”, “text”: “Can I see someone online?”}
],
“context”: {
“plan_type”: “PPO”,
“employer_id”: “tenant-789”,
“language”: “en-US”
}
}
Conversations are sourced from external_sources.assistant_server_zoe.threads and aggregated into per-thread records:
{
“user_id”: “hashed_user_42”,
“session_ts”: “2025-11-03T14:22:00Z”,
“thread_id”: “tid_abc123”,
“full_text”: “What does my plan cover for mental health?\nYour plan covers 10 in-network therapy sessions per year…\nCan I see someone online?”,
“user_text_data”: [
{
“created_at”: “2025-11-03T14:22:00Z”,
“intent”: “coverage_check”,
“subject”: “mental_health”,
“text”: “What does my plan cover for mental health?”
},
{
“created_at”: “2025-11-03T14:22:45Z”,
“intent”: “”,
“subject”: “”,
“text”: “Can I see someone online?”
}
]
}
Key points:
- full_text: A plain-text concatenation of all messages in the thread (newline-delimited), used as the primary input for Nemotron extraction.
- user_text_data: A JSON array preserving individual message timestamps, intents, and subjects, providing richer context for the model.
- Only user messages are included. Assistant responses are not part of the extraction input.
- Pre-Nemotron filtering: Internal/test users are excluded, and threads are later enriched with company_id and company_name.
- No explicit plan_type or language fields are passed in the current schema. Employer context is joined post-extraction at the drill-down stage.
3. Nemotron inference endpoint
Nemotron is deployed as an NVIDIA NIM inference microservice, running in Healthee’s AWS Kubernetes cluster on A100-based EC2 instances.
NVIDIA NIM is a set of easy-to-use microservices designed for secure, reliable deployment of high-performance AI model inferencing across clouds, data centers, and workstations.
. NIM profiles can control tensor parallelism, precision, and other execution parameters to hit desired latency/throughput targets.
A typical call pattern for the model via NIM:
curl -X POST \
-H “Authorization: Bearer ${NIM_API_KEY}” \
-H “Content-Type: application/json” \
https://<nemotron-nim-endpoint>/v1/chat/completions \
-d ‘{
“model”: “nvidia/nvidia-nemotron-nano-9b-v2”,
“messages”: [
{
“role”: “system”,
“content”: “You are a semantic feature extractor…”
},
{
“role”: “user”,
“content”: “<serialized conversation and context here>”
}
],
“temperature”: 0.0,
“max_tokens”: 512
Production configuration:
# Endpoint configuration
NIM_BASE_URL = “https://<nemotron-nim-endpoint>/v1” # Self-hosted NIM
NIM_MODEL = “nvidia/nvidia-nemotron-nano-9b-v2”
# Model parameters (ModelConfig dataclass)
model = “nvidia/nvidia-nemotron-nano-9b-v2”
max_tokens = 50000
temperature = 0.1 # Low for consistent structured JSON output
timeout = 120 # Extended timeout (seconds)
max_retries = 3 # Application-level retries with exponential backoff
Actual system prompt (dynamically assembled at runtime):
You are a healthcare insurance analyst that specialized in reviewing user
utterances in conversations with a chatbot and analyzing the content.
Your way of work is very similar to a Named Entity Recognition system,
with improvements.
The improvement is that you can also extract entities that were not
defined to you in advance.
Your task is to extract entities from a provided conversation, that
will become features.
We have three kinds of features:
- explicit entities – ones that come from the users own utterances
- implicit or inferred entities – ones that you can deduce their
existence without the user mentioning them
- free style entities that were not defined which where you will have
full jurisdiction to decide whether they are important enough to
include in the extraction
The explicit features are of the following entity types:
– service (Medical services such as blood test or MRI)
– medical_condition (Medical conditions such as a broken leg or cancer)
– symptom (Symptoms like fever or high blood pressure)
– insurance_glossary (Insurance-related terms like HSA or deductibles)
– place_of_service (Healthcare facilities like hospitals or labs)
– specialists (Medical specialists like cardiologist or urologist)
– nominal_provider (Specific names of doctors or specialists)
[… additional instructions for inference and freestyle extraction …]
Inferred entity types:
– health_goals (User goals in terms of improving their health)
– recommended_specialist (Recommended specialists based on conditions)
– recommended_treatment (Recommended services based on conditions)
– conversation_sentiment (POSITIVE, NEUTRAL, NEGATIVE)
– user_savviness (LOW, MEDIUM, HIGH healthcare insurance knowledge)
– profanity (True, False)
[… the user’s conversation text is injected here …]
Output – return the data in a valid JSON format. The response must be
a JSON array of objects with exactly two string properties: “feature”
and “type”.
Note: The explicit and inferred feature lists are not hard-coded in the prompt file. They are dynamically fetched from the healthee_dwh.zoe.zaas_feature_types Snowflake table and injected at runtime via placeholder replacement. This allows the data team to add new feature types without code changes.
Execution mode: ZaaS uses synchronous batched inference (not streaming). Conversations are processed in parallel via a ThreadPoolExecutor with up to 10 concurrent workers. Each worker makes a synchronous chat completion call to the NIM endpoint. This approach provides effective throughput scaling while keeping per-request logic simple. Rate-limiting and timeout errors trigger automatic retry with exponential backoff (up to 60 seconds for rate limits, 2 seconds for timeouts).
4. Structured semantic output
Nemotron is prompted to emit structured, machine-consumable outputs containing semantic features. Typical fields might include:
- Primary and secondary intents (for example, “coverage_check”, “provider_search”)
- Thematic clusters (for example, “mental_health”, “prescriptions”, “out-of-network”)
- Entities (for example, provider types, care modalities, plan components)
- Behavioral signals (for example, “confusion”, “frustration”, “cost_sensitivity”)
- Integration-friendly metadata (for example, confidence scores, versioned schema)
Example JSON (illustrative):
{
“conversation_id”: “abc123”,
“intents”: [
{“label”: “coverage_check”, “confidence”: 0.96},
{“label”: “virtual_care_interest”, “confidence”: 0.88}
],
“themes”: [“mental_health”, “telehealth”],
“entities”: [
{“type”: “benefit_category”, “value”: “mental_health”},
{“type”: “care_modality”, “value”: “online_therapy”}
],
“signals”: [
{“type”: “cost_concern”, “confidence”: 0.71},
{“type”: “navigation_difficulty”, “confidence”: 0.63}
],
“model_metadata”: {
“model_name”: “nvidia-nemotron-nano-9b-v2”,
“model_version”: “vX.Y”,
“prompt_version”: “zaas-v3”
}
}
The actual Nemotron output schema is simpler and more NER-focused than the illustrative example. The model returns a flat JSON array of feature–type pairs (no confidence scores, no nested intents/themes/signals structure):
Raw model output (anonymized example):
[
{“feature”: “Physical Therapy”, “type”: “service”},
{“feature”: “Knee Pain”, “type”: “symptom”},
{“feature”: “Orthopedist”, “type”: “recommended_specialist”},
{“feature”: “Weight Loss”, “type”: “health_goals”},
{“feature”: “NEUTRAL”, “type”: “conversation_sentiment”},
{“feature”: “MEDIUM”, “type”: “user_savviness”},
{“feature”: “False”, “type”: “profanity”},
{“feature”: “Telehealth”, “type”: “_care_modality”}
]
This is then wrapped and stored in Snowflake as:
{
“features”: [
{“feature”: “Physical Therapy”, “type”: “service”},
{“feature”: “Knee Pain”, “type”: “symptom”},
…
]
}
After feature explosion (SQL FLATTEN), each feature becomes a separate row in our feature store / internal data model:

The category column is derived by joining against zaas_feature_types. Any type not found in the lookup table is automatically classified as freestyle.
Additional metadata stored per thread (in zoe_user_features):

Note on confidence scores: The current ZaaS schema does not include per-feature confidence scores. The model outputs binary feature presence. Confidence scoring is a candidate for a future iteration using Nemotron’s logprobs or a separate scoring pass.
5. Downstream consumption
These semantic features are stored in analytical and operational data stores and power:
- Real-time personalization of Zoe responses and recommendations
- User-level features for ML models (for churn, engagement, or value prediction)
- Employer-level analytics and dashboards (aggregate intents, themes, and utilization patterns)
- Product experimentation (measure how copy, flows, or benefit design change thematic patterns and intent fulfillment)
A. Back to Zoe, Stronger Assistant Memory and Personalization
ZaaS features are fed back into Zoe itself to improve the user experience in real time:
1. Search Recommendation
ZaaS-extracted features serve as input signals to a dedicated recommendation model that also incorporates other feature sources (user profile, browsing history, plan data). The model blends ZaaS-derived intents and entities with these complementary signals to produce contextually relevant search suggestions within Zoe.
2. Service Ranking
Extracted features (e.g., service, medical_condition, recommended_treatment) are used to re-rank benefit services presented to the user, prioritizing results that match the user’s inferred needs from the conversation.
3. Main Page Personalization
The Zoe main page / home screen surfaces personalized content based on the user’s historical conversation themes. for example, highlighting mental health resources if a user has repeatedly asked about therapy coverage.
B. Personalization & Content Generation
4. Dynamic Content Generation
- DCG is Healthee’s content generation system that produces personalized health and benefits articles for users.
- It consumes three input signals: (1) user profile demographics, (2) search history from Find Care, and (3) ZaaS-extracted events/features from conversations.
- ZaaS features enable DCG to generate articles that are topically relevant to what users are actually asking about, e.g., generating an article about telehealth options when ZaaS detects high _care_modality: Telehealth frequency.
- DCG operates at both levels: at the individual user level, it generates articles tailored to a specific user’s health interests and benefit usage patterns as surfaced by ZaaS; at the aggregate/cohort level, it identifies trending topics across user segments (e.g., a spike in mental_health features across a given employer’s workforce) and produces broadly relevant content that addresses emerging needs at scale.
5. Podcast
- Healthee generates personalized audio podcasts for each user on a biweekly cadence. ZaaS features drive the topic selection and narrative arc: each user’s extracted intents, conditions, and benefit interests are used to compose a podcast episode that speaks directly to their health and benefits situation. For example, a user whose recent conversations surfaced mental_health, telehealth, and cost_concern features would receive an episode covering virtual therapy options, in-network provider availability, and cost-saving strategies under their plan.
- This creates a scalable, high-touch engagement channel where every user receives content that feels curated for them, powered entirely by the semantic signals ZaaS extracts from their conversational history.
C. Analytics & Insights
6. Stats & Dashboards
- The zaas_drill_down table serves as the primary interface for BI tools, enabling dashboards that show feature distributions by company, time period, and feature category.
- Employer-level analytics aggregate ZaaS features across all employees of a given company, surfacing patterns like “80% of conversations for Company X involve prescription drug coverage.”
- Supports product experimentation by allowing teams to measure how different benefit designs, copy changes, or conversational flows affect the distribution of extracted intents and themes.
7. Features to Threadz
- ZaaS features are linked back to the original conversation threads (via the zaas_drill_down join), enabling analysts to drill from an aggregate feature trend down to the specific conversations that produced it.
- This “features to threads” traceability is critical for validating extraction quality, investigating anomalies, and grounding analytics in real user language.
D. User Similarity & Segmentation
8. User Similarity / Segmentation
- ZaaS enables a representation-learning approach to user understanding. For each user, the system constructs a multi-dimensional semantic profile by aggregating their extracted features over time, a sparse vector where dimensions correspond to feature types (services, conditions, symptoms, specialists, health goals, sentiment, savviness) and values encode frequency, recency, and co-occurrence patterns.
- These user-level feature vectors are then projected into a shared embedding space where semantic similarity can be computed efficiently. Users who occupy nearby regions of this space share not just surface-level keyword overlap, but deeper behavioral and intent patterns: similar health concerns, similar levels of benefit engagement, similar navigational friction, and similar sentiment trajectories over time.
- The resulting similarity graph supports both hard clustering (discrete user segments for cohort analysis and employer reporting) and soft similarity (continuous nearest-neighbor retrieval for recommendation). Segments can be defined along multiple axes, by health interest profile, by engagement maturity, by sentiment trend, or by benefit utilization pattern, and updated dynamically as new ZaaS features are extracted from ongoing conversations.
9. “Users like you are interested in…”
- The user similarity graph powers collaborative-filtering-style recommendations: given a target user, the system retrieves the K nearest neighbors in the ZaaS embedding space and surfaces benefits, providers, or content that similar users have engaged with but the target user has not yet explored.
- This enables proactive, intent-aware suggestions, “Users with a similar health profile also looked into telehealth therapy options”, that go beyond content-based filtering by leveraging the collective conversational intelligence captured by ZaaS across the user base.
10. Email Campaigns & Push Notifications
- ZaaS-derived user segments and individual feature profiles power targeted outbound engagement campaigns. Campaigns can be triggered at both the individual level (a user whose recent conversations surfaced cost_concern + mental_health features receives a targeted email about lower-cost therapy options) and the segment level (a cohort of users clustered around prescription + insurance_glossary features receives content about formulary navigation during open enrollment).
- This closes the loop between conversational intelligence and proactive outreach: ZaaS extracts the signal, segmentation identifies the audience, and campaigns deliver the right message at the right time.
E. Data Infrastructure
11. Snowflake Data Warehouse (primary analytical store)
- zoe_user_features_exploded, normalized feature table for direct SQL analytics
- zaas_drill_down, enriched view joining features with original conversation text and company context
12. Snowflake Cortex Search (semantic search)
- A Cortex Search Service (ZOE_THREADS_SEARCH) indexes the full text of Zoe conversations using snowflake-arctic-embed-m-v1.5 embeddings
- Powering an internal “Zoe Threads AI Agent” in Snowflake Intelligence that lets internal teams ask questions and get answers grounded in actual conversation data
13. Customer Health & Churn Scoring
- The CUSTOMER_HEALTH_METRICS table aggregates Zoe engagement metrics per company alongside Gong, Intercom, and deal metrics
- Zoe inactivity (>30 days since last question) contributes +15 to the churn risk score
14. Unified Customer Interaction Search
- CUSTOMER_INTERACTIONS_UNIFIED unions Zoe conversations with Gong calls, Intercom support, and product feedback into a single searchable corpus
15. Operational Logging & Cost Analytics
- zoe_tokens tracks per-request token usage, model, execution time, and estimated cost
- benefits_filtering and services_retrieved log Zoe’s internal RAG retrieval and benefit-matching operations
16. User-Level Engagement & Product Analytics
- Zoe engagement metrics (zoe_pv, zoe_provider_click, zoe_benefits_intent, zoe_find_care_intent, zoe_telehealth_intent, zoe_telehealth_click) are propagated into users_overview_unified
- These metrics classify users into discovery, activation, and stickiness stages in product_engagement_mart
- The compass DAG uses Zoe chat counts to compute support_intent_rate and overall_intent_score
Results – Nemotron excels for Zoe’s use cases
Healthee is already live with its platform and has reported thousands of companies leveraging Zoe and its underlying infrastructure for benefits navigation.
Based on an evaluation performed by the Healthee team, Nemotron outperforms other leading models on extraction in feature coverage and consistency, validating it as a strong foundation for ZaaS and downstream personalization and analytics.
The team ran an EDA comparing Nemotron and other leading models for extraction on user conversations, focusing on the volume, consistency, and structure of extracted features (intents, themes, and semantic signals).
Key takeaways from the analysis:
- Nemotron consistently extracts a higher number of structured features per user interaction, indicating broader semantic coverage.
- Strong performance across feature types (intents, keywords, themes), not driven by outliers.
- High overlap with extracted signals from other leading models, while also surfacing additional unique signals, suggesting complementary and additive value rather than noise.
- Statistical tests show a clear and consistent advantage in feature volume for Nemotron across users.
- Feature distributions are stable and well-behaved, making the outputs suitable for downstream ML and analytics.
Overall, the results validate Nemotron as a strong foundation for ZaaS as our centralized semantic and intent layer, with immediate applicability to personalization, recommendations, and behavioral analytics.
Performance highlights:
- Nemotron extracts features across all three tiers (explicit, inferred, and freestyle) consistently, with particular strength in surfacing freestyle features, novel entity types not predefined in the taxonomy, demonstrating broad semantic coverage beyond the predefined feature taxonomy.
- Nemotron’s self-hosted NIM deployment on internal infrastructure gives Healthee full control over the inference stack, no external API dependency, data never leaves the VPC, and per-token cost is a fraction of commercial alternatives.
- The pipeline processes threads in parallel (10 concurrent workers), with built-in retry logic and extended timeouts (120s) tuned for the self-hosted endpoint.
- Feature distributions are stable and well-behaved across users and time periods, making the outputs suitable for downstream ML models, segmentation, and production analytics without additional normalization.
Future Plans
As ZaaS matures, the Nemotron-based semantic layer can:
- Power more granular personalization across the Healthee user journey
- Enable richer analytics and decision support for employers
- Support new ML products (for example, predictive models on top of ZaaS features)
- Serve as a reusable semantic substrate for future agentic workflows across the platform
On NVIDIA’s side, Nemotron continues to evolve as a family of models, datasets, and best practices for building agentic AI systems, with a focus on openness, efficiency, and deployment flexibility across NVIDIA GPUs.
Next steps:
- Fine-tuning on domain data: Leverage Nemotron’s open weights to fine-tune on Healthee’s annotated conversation corpus, targeting higher precision on healthcare-specific entity types (insurance terms, provider specialties, benefit categories).
- Confidence scoring: Introduce per-feature confidence scores using Nemotron’s logprob outputs, enabling downstream systems to threshold or weight features by extraction confidence.
- Multi-lingual support: Extend ZaaS to support Spanish and other languages prevalent in Healthee’s user base, leveraging Nemotron’s multilingual capabilities.
- Tighter Cortex integration: Feed ZaaS-extracted features directly into the Cortex Search index to enable faceted semantic search (e.g., “show me all conversations where users expressed cost concern about mental health services”).
- Real-time streaming extraction: Move from batch-oriented parallel processing to a streaming architecture where features are extracted in near-real-time as conversations complete, enabling in-session personalization.
- Feature store integration: Publish user-level feature vectors (aggregated ZaaS features over time) to a feature store for consumption by ML models (churn prediction, engagement scoring, recommendation systems).
About the authors

Inception Partner Manager, NVIDIA

Senior AI Solutions Architect, NVIDIA