Contract-First Modular Workflow Tutorial

Introduction

This tutorial demonstrates the Contract-First Modular Workflow - a best practice approach for building scalable, maintainable data science projects in DataCards. This pattern emphasizes defining clear interfaces between components before implementation, ensuring better collaboration, testing, and system evolution.

📖 Learn More: For a deeper understanding of the theoretical foundations and design principles behind this approach, see our Contract-First Modular Workflow best practices guide.

The Problem: Monolithic Notebooks

Traditional data science workflows often use monolithic notebooks - single, extensive scripts that combine all stages of analysis from data loading to results visualization. For example, consider this Student Stress Analysis notebook on Kaggle, which contains:

Data loading and preprocessing
Exploratory data analysis (EDA)
Feature engineering
Model training and evaluation
Visualization and reporting

All in one massive notebook! While this approach works for quick prototyping, it creates several problems:

Hard to maintain: Changes in one section can break others
Difficult to collaborate: Multiple people can’t work on different parts simultaneously
Poor reusability: Can’t easily reuse individual components
Slow iteration: Must rerun entire notebook for small changes
Memory inefficient: All code runs even when only one part is needed

The Solution: Modular DataCards Workflow

The Contract-First Modular Workflow breaks this monolithic approach into five distinct, interconnected components:

Installation & Data Upload - Centralized environment setup
Data Preparation - Cleaning and feature engineering
Input/Filter Cards - User interaction controls
Business Logic/Model - Core analytical computations
Visualization/Results - Presentation and communication

Each component has a clear responsibility and well-defined interfaces, enabling:

Independent development: Teams can work on different components simultaneously
Selective execution: Only run the components you need
Easy maintenance: Changes are isolated to specific components
Better testing: Each component can be tested independently
Improved collaboration: Clear contracts between components

Step 1: Installation & Data Upload Notebook

Objective

Install required dependencies & data files (once, in a central place)
Ensure reproducibility across all other notebooks
Avoid having to re-install in every card
Establish a single source of truth for data

Exercise Instructions

Create a new notebook and name it Installation & Data Load.
Install the required libraries (this step only needs to be run once per environment):


!pip install pandas numpy matplotlib seaborn scikit-learn joblib kaggle
!pip install --upgrade setuptools

Import the necessary libraries for subsequent operations:


import pandas as pd
import numpy as np
import kagglehub

Download the dataset from Kaggle:


# Download latest version
path = kagglehub.dataset_download("poushal02/student-academic-stress-real-world-dataset")
print("Path to dataset files:", path)

Load the dataset:


# Load the main dataset
df = pd.read_csv(f"{path}/academic Stress level - maintainance 1.csv")

Note: In the original monolithic notebook , this data loading would be mixed with preprocessing, EDA, and modeling all in one cell. Here, we isolate it to establish a clean data contract.

Publish the raw dataset as a DataCards variable, making it available to other notebooks:


dc.publish.variable(key="raw_stress_data", value=df)

Preview the loaded data to confirm successful loading:


df.head()
df.info()

Conceptual Takeaways

Why centralize environment setup? Having a dedicated notebook for installation and data uploading ensures all team members use the same versions of dependencies and data, improving consistency and minimizing setup errors.
Reproducibility: Storing data in shared variables allows the project to recover from kernel crashes or VM resets quickly without repeated data loading or reprocessing.
Modularity: Separating environment setup and data ingestion from data preparation and modeling fosters a clean, maintainable workflow, reducing duplication of effort and enhancing collaboration.
Scalability: This approach serves as the foundation for larger projects where multiple notebooks interact, helping manage computational resources more effectively.

Step 2: Data Preparation Notebook

Objective

Load, clean, and prepare the dataset for further analysis or modeling
Keep environment setup and data logic modular
Enable faster reruns after crashes or kernel restarts
Apply data quality checks and transformations

Exercise Instructions

Create a new notebook and name it Data Preparation.
Load the raw dataset published in the previous step and import the necessary libraries:


import pandas as pd 
import numpy as np
df = dc.consume.variable.raw_stress_data()

Explore the data structure:


print("Dataset shape:", df.shape)
print("\nColumn names:")
print(df.columns.tolist())
print("\nData types:")
print(df.dtypes)
print("\nFirst few rows:")
print(df.head())

Clean the data by handling missing values and inconsistencies:


# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())
 
# Drop rows with missing values
df_clean = df.dropna()
 
# Reset index after dropping rows
df_clean = df_clean.reset_index(drop=True)

Create derived features for better analysis:


# First, let's see what columns we have to work with
print("Available columns:")
print(df_clean.columns.tolist())
 
# Create a stress level score using available numeric columns
# Adjust these column names based on your actual dataset
numeric_cols = df_clean.select_dtypes(include=[np.number]).columns.tolist()
print(f"Numeric columns available: {numeric_cols}")
 
# Use the first few numeric columns for stress score calculation
# (adjust this based on what makes sense for your dataset)
if len(numeric_cols) >= 3:
    df_clean['stress_score'] = df_clean[numeric_cols[:3]].mean(axis=1)
    print(f"Created stress_score using columns: {numeric_cols[:3]}")
else:
    # Fallback: use all available numeric columns
    df_clean['stress_score'] = df_clean[numeric_cols].mean(axis=1)
    print(f"Created stress_score using all numeric columns: {numeric_cols}")
 
# Create categorical stress level
df_clean['stress_level'] = pd.cut(
    df_clean['stress_score'], 
    bins=3, 
    labels=['Low', 'Medium', 'High']
)
 
print(f"Stress score range: {df_clean['stress_score'].min():.2f} to {df_clean['stress_score'].max():.2f}")
print(f"Stress level distribution:")
print(df_clean['stress_level'].value_counts())

Contrast with Monolithic Approach: In the original monolithic notebook , feature engineering would be scattered throughout the analysis. Here, we centralize all data transformations in one dedicated notebook, making them reusable and maintainable.

Publish the cleaned dataset as a new DataCards variable:


dc.publish.variable(key="clean_stress_data", value=df_clean)

Display summary statistics:


df_clean.describe()

Conceptual Takeaways

Modularity: Separating data cleaning and feature engineering into its own notebook keeps the workflow organized and logical, making it easier to maintain and debug.
Efficiency: By publishing cleaned data as a separate variable, other notebooks can consume prepared data quickly without repeating time-consuming cleaning steps.
Robustness: This structure supports smoother recovery from interruptions, since the environment setup and raw data loading happen independently from data transformation.
Extensibility: The prepared data can be extended with additional features or transformations in this notebook before it feeds into modeling or visualization stages.

Step 3: Input/Filter Cards Notebook

Objective

Separate filtering logic from data cleaning and final presentation
Enable dynamic filter generation and flexible card layout
Optimize memory usage by consolidating filter controls in a single notebook
Create interactive user controls for data exploration

Exercise Instructions

Create a new notebook and name it Input Controls.
Initialize filter variables with default values:


dc.publish.variable("selected_stress_level", "Medium")  # Default stress level
dc.publish.variable("min_academic_performance", 3.0)   
dc.publish.variable("max_academic_performance", 5.0)
dc.publish.variable("social_support_threshold", 3.0)  
dc.publish.variable("include_financial_stress", True)  # True = Include, False = Exclude

Load the cleaned data from the previous notebook:


df = dc.consume.variable.clean_stress_data()

Generate options for the stress level combobox dynamically:


options = [str(level) for level in df["stress_level"].unique()]
print("Available stress levels:", options)

Publish a combobox card for stress level selection:


dc.publish.card(
    type='combobox',
    label='Stress Level',
    options=options,
    variable_key='selected_stress_level',
    logic_view_size=(2,2),
    layout=[{"size": (3,2), "position": (0,0), "deck": "default-deck"}]    
)

Publish slider cards for academic performance range:


dc.publish.card(
    type='floatSlider',
    label='Min Academic Performance',
    unit='[1-5 scale]',
    min=1.0,
    max=5.0,
    step=0.1,
    variable_key='min_academic_performance',
    logic_view_size=(3,1),
    layout=[{"size": (3,2), "position": (0,2), "deck": "default-deck"}]
)


dc.publish.card(
    type='floatSlider',
    label='Max Academic Performance',
    unit='[1-5 scale]',
    min=1.0,
    max=5.0,
    step=0.1,
    variable_key='max_academic_performance',
    logic_view_size=(3,1),
    layout=[{"size": (3,2), "position": (0,4), "deck": "default-deck"}]
)

Publish additional filter controls:


dc.publish.card(
    type='floatSlider',
    label='Social Support Threshold',
    unit='[1-5 scale]',
    min=1.0,
    max=5.0,
    step=0.1,
    variable_key='social_support_threshold',
    logic_view_size=(3,1),
    layout=[{"size": (3,2), "position": (3,0), "deck": "default-deck"}]
)


dc.publish.card(
    type='toggle',
    label='Include Financial Stress',
    variable_key='include_financial_stress',
    logic_view_size=(2,1),
    layout=[{"size": (3,2), "position": (3,2), "deck": "default-deck"}]
)

Conceptual Takeaways

Separation of concerns: Keeping filtering controls distinct from data cleaning and modeling promotes a modular, maintainable architecture.
Dynamic UI generation: Filter options can adapt automatically based on the dataset, which improves flexibility across projects.
Memory optimization: Combining all filters in a single notebook reduces the number of active notebooks, saving RAM and improving performance.
User interaction: Creating interactive filter cards provides a user-friendly way for stakeholders or analysts to explore and customize data views.

Step 4: Business Logic / Model Notebook

Objective

Isolate core analytical and predictive computations apart from data preparation and visualization
Enable iterative experimentation with different modeling approaches without repeating data loading or plotting
Improve maintainability by allowing changes in business logic or model architecture without impacting other workflow components
Apply contract-first principles by defining clear input/output interfaces

Exercise Instructions

Create a new notebook and name it Business Logic & Modeling.
Import required libraries:


import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.model_selection import train_test_split

Load prepared data and user inputs published from previous notebooks:


df = dc.consume.variable.clean_stress_data()
selected_stress_level = dc.consume.variable.selected_stress_level()
min_academic = dc.consume.variable.min_academic_performance()
max_academic = dc.consume.variable.max_academic_performance()
social_threshold = dc.consume.variable.social_support_threshold()
include_financial = dc.consume.variable.include_financial_stress()

Apply user filters to the dataset:


# Filter by stress level
filtered_df = df[df['stress_level'] == selected_stress_level].copy()
 
# Get numeric columns for filtering (excluding stress_score and stress_level)
numeric_cols = filtered_df.select_dtypes(include=[np.number]).columns.tolist()
filterable_cols = [col for col in numeric_cols if col not in ['stress_score', 'stress_level']]
 
print(f"Available numeric columns for filtering: {filterable_cols}")
 
# Apply filters using available columns (adjust based on your dataset)
if len(filterable_cols) >= 1:
    # Use first numeric column for min/max filtering
    col1 = filterable_cols[0]
    filtered_df = filtered_df[
        (filtered_df[col1] >= min_academic) & 
        (filtered_df[col1] <= max_academic)
    ]
    print(f"Filtered by {col1} range: {min_academic} - {max_academic}")
 
if len(filterable_cols) >= 2:
    # Use second numeric column for threshold filtering
    col2 = filterable_cols[1]
    filtered_df = filtered_df[filtered_df[col2] >= social_threshold]
    print(f"Filtered by {col2} threshold: >= {social_threshold}")
 
if len(filterable_cols) >= 3 and not include_financial:
    # Use third numeric column for conditional filtering
    col3 = filterable_cols[2]
    filtered_df = filtered_df[filtered_df[col3] < 3.0]
    print(f"Filtered by {col3} < 3.0 (exclude high values)")

Print filtering summary:


print(f"Filtered dataset shape: {filtered_df.shape}")
print(f"Original dataset shape: {df.shape}")
print(f"Filtered for stress level: {selected_stress_level}")
print(f"Academic performance range: {min_academic} - {max_academic}")
print(f"Social support threshold: {social_threshold}")
print(f"Include financial stress: {include_financial}")

Prepare features and target for modeling:


# Select features for modeling (use numeric columns, excluding stress_score)
numeric_cols = filtered_df.select_dtypes(include=[np.number]).columns.tolist()
feature_cols = [col for col in numeric_cols if col != 'stress_score']
print(f"Using features: {feature_cols}")
 
X = filtered_df[feature_cols]
y = filtered_df['stress_score']
 
print(f"Feature matrix shape: {X.shape}")
print(f"Target variable shape: {y.shape}")
 
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Train multiple models for comparison:


# Linear Regression
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)
lr_r2 = r2_score(y_test, lr_pred)
lr_mae = mean_absolute_error(y_test, lr_pred)
 
# Random Forest
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
rf_r2 = r2_score(y_test, rf_pred)
rf_mae = mean_absolute_error(y_test, rf_pred)

Compare model performance:


print("=" * 60)
print("MODEL PERFORMANCE COMPARISON")
print("=" * 60)
print(f"Linear Regression - R²: {lr_r2:.4f}, MAE: {lr_mae:.4f}")
print(f"Random Forest     - R²: {rf_r2:.4f}, MAE: {rf_mae:.4f}")
print("=" * 60)

Select the best model and publish results:


# Use the model with better R² score
if rf_r2 > lr_r2:
    best_model = rf_model
    best_pred = rf_pred
    best_r2 = rf_r2
    best_mae = rf_mae
    model_name = "Random Forest"
else:
    best_model = lr_model
    best_pred = lr_pred
    best_r2 = lr_r2
    best_mae = lr_mae
    model_name = "Linear Regression"
 
# Publish model performance metrics
dc.publish.variable("model_r2_score", best_r2)
dc.publish.variable("model_mae", best_mae)
dc.publish.variable("selected_model", model_name)
dc.publish.variable("filtered_data_count", len(filtered_df))

Publish a number card displaying model performance:


dc.publish.card(
    type='number',
    value=best_r2,
    label=f'{model_name} R² Score',
    unit='',
    decimals=3,
    logic_view_size=(2, 1),
    layout=[{"size": (3,2), "position": (6,0), "deck": "default-deck"}]
)

Contrast with Monolithic Approach: In the original monolithic notebook , model training, evaluation, and visualization would all be mixed together. Here, we isolate the modeling logic, making it easier to experiment with different algorithms and hyperparameters.

Conceptual Takeaways

Separation of concerns: Complex modeling logic is isolated from data preparation and visualization to enhance clarity and maintainability.
Iterative modeling: This setup facilitates trying out different models or analytical approaches without redoing upstream steps.
Model transparency: Publishing performance metrics and model selection criteria helps understand model behavior and fit quality.
Contract-first approach: Clear input/output interfaces between notebooks enable independent development and testing of components.
Publishing results: Sharing predictions and metrics as variables and cards supports smooth integration into dashboards or further analysis steps.

Step 5: Visualization Notebook / Result Display

Objective

Communicate findings effectively by turning complex insights into clear, intuitive visuals
Support decision-making with business-ready presentations suitable for non-technical stakeholders
Separate presentation from computation to enable independent updating of visuals without rerunning heavy data processing
Demonstrate the power of modular visualization components

Exercise Instructions

Create a new notebook and name it Visualization & Results.
Import required libraries and enable dark mode styling:


import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
 
plt.style.use('datacards-dark-mode')

Consume required variables from previous notebooks:


df = dc.consume.variable.clean_stress_data()
filtered_count = dc.consume.variable.filtered_data_count()
model_r2 = dc.consume.variable.model_r2_score()
selected_model = dc.consume.variable.selected_model()
selected_stress_level = dc.consume.variable.selected_stress_level()
 
print(f"Creating visualization for {selected_stress_level} stress level")
print(f"Filtered data points: {filtered_count}")
print(f"Selected model: {selected_model} (R² = {model_r2:.3f})")

Create a comprehensive dashboard with multiple visualizations:


# Create figure with subplots
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=datacards.utilities.plotting.plot_area(rows=4, columns=6))

Plot 1: Stress Score Distribution:


# Filter data for selected stress level
stress_data = df[df['stress_level'] == selected_stress_level]['stress_score']
 
ax1.hist(stress_data, bins=20, alpha=0.7, color='#ff6b9d', edgecolor='white', linewidth=0.5)
ax1.set_xlabel('Stress Score')
ax1.set_ylabel('Frequency')
ax1.set_title('Stress Score Distribution')
ax1.grid(True, alpha=0.3)

Plot 2: Feature Comparison:


scatter_data = df[df['stress_level'] == selected_stress_level]
numeric_cols = scatter_data.select_dtypes(include=[np.number]).columns.tolist()
feature_cols = [col for col in numeric_cols if col not in ['stress_score', 'stress_level']]
 
if len(feature_cols) >= 2:
    ax2.scatter(scatter_data[feature_cols[0]], scatter_data[feature_cols[1]], 
               c=scatter_data['stress_score'], cmap='viridis', alpha=0.6)
    ax2.set_xlabel(feature_cols[0])
    ax2.set_ylabel(feature_cols[1])
    ax2.set_title(f'{feature_cols[0]} vs {feature_cols[1]}')
else:
    ax2.text(0.5, 0.5, 'Not enough numeric features\nfor scatter plot', 
             ha='center', va='center', transform=ax2.transAxes)
    ax2.set_title('Feature Comparison')
ax2.grid(True, alpha=0.3)

Plot 3: Feature Importance (if Random Forest was selected):


if selected_model == "Random Forest":
    # This would require the model to be published as a variable
    # For demonstration, create a mock feature importance
    features = ['Academic Performance', 'Social Support', 'Financial Stress']
    importance = [0.4, 0.35, 0.25]  # Mock values
    ax3.bar(features, importance, color='#4ecdc4')
    ax3.set_ylabel('Feature Importance')
    ax3.set_title('Feature Importance')
    ax3.tick_params(axis='x', rotation=45)
else:
    ax3.text(0.5, 0.5, f'{selected_model}\nSelected', ha='center', va='center', transform=ax3.transAxes)
    ax3.set_title('Model Type')

Plot 4: Model Performance Summary:


metrics = ['R² Score', 'Data Points']
values = [model_r2, filtered_count]
bars = ax4.bar(metrics, values, color=['#ffd93d', '#4ecdc4'])
ax4.set_ylabel('Value')
ax4.set_title('Model Performance')
ax4.tick_params(axis='x', rotation=45)
 
# Add value labels on bars
for bar, value in zip(bars, values):
    height = bar.get_height()
    ax4.text(bar.get_x() + bar.get_width()/2., height + 0.01,
             f'{value:.3f}' if isinstance(value, float) else f'{value}',
             ha='center', va='bottom')

Adjust layout and publish the visualization:


plt.tight_layout()
 
dc.publish.card(
    type='matplotlib',
    fig=fig,
    label=f'Stress Analysis Dashboard - {selected_stress_level}',
    logic_view_size=(6, 4),
    layout=[{"size": (15, 8), "position": (10, 0), "deck": "default-deck"}]
)

Create additional summary cards:


# Data summary card
dc.publish.card(
    type='number',
    value=filtered_count,
    label='Filtered Data Points',
    unit='records',
    decimals=0,
    logic_view_size=(2, 1),
    layout=[{"size": (3,2), "position": (6,2), "deck": "default-deck"}]
)
 
# Model performance card
dc.publish.card(
    type='number',
    value=model_r2,
    label=f'{selected_model} R²',
    unit='',
    decimals=3,
    logic_view_size=(2, 1),
    layout=[{"size": (3,2), "position": (6,4), "deck": "default-deck"}]
)

Contrast with Monolithic Approach: In the original monolithic notebook , visualizations would be scattered throughout the analysis, making it hard to update or modify them. Here, we centralize all visualization logic, making it easy to iterate on the presentation layer.

Conceptual Takeaways

Effective communication: Visualizing distributions and relationships helps stakeholders quickly grasp the data and understand model performance in context.
Separation of concerns: This notebook focuses solely on presentation; analytical computations are kept separate for modularity and performance.
Interactivity readiness: Published cards can be integrated into dashboards, enabling dynamic interactions with filtered or predicted data.
Design principles: Following DataCards best practices ensures clean, professional dashboards that communicate insights effectively.
Modular visualization: Each visualization component can be updated independently, allowing for rapid iteration and customization.

Conclusion

This tutorial demonstrated how to refactor the monolithic approach used in notebooks like the Student Stress Analysis into the Contract-First Modular Workflow pattern, showing how to:

Centralize setup in a dedicated installation notebook
Separate data preparation from analysis and presentation
Create interactive controls for user input and filtering
Isolate business logic for maintainable modeling
Build modular visualizations for effective communication

Key Benefits Over Monolithic Approach

This modular approach promotes:

Maintainability: Each component has a clear responsibility
Scalability: Components can be developed and tested independently
Collaboration: Clear interfaces enable team development
Flexibility: Easy to modify or extend individual components
Performance: Optimized resource usage and faster iteration cycles
Reusability: Components can be reused across different projects

By following this pattern, you can build robust, scalable data science applications that are easy to maintain, extend, and collaborate on - a significant improvement over traditional monolithic notebook approaches.