Contract-First Modular Workflow Tutorial
Introduction
This tutorial demonstrates the Contract-First Modular Workflow - a best practice approach for building scalable, maintainable data science projects in DataCards. This pattern emphasizes defining clear interfaces between components before implementation, ensuring better collaboration, testing, and system evolution.
š Learn More: For a deeper understanding of the theoretical foundations and design principles behind this approach, see our Contract-First Modular Workflow best practices guide.
The Problem: Monolithic Notebooks
Traditional data science workflows often use monolithic notebooks - single, extensive scripts that combine all stages of analysis from data loading to results visualization. For example, consider this Student Stress Analysis notebookĀ on Kaggle, which contains:
- Data loading and preprocessing
- Exploratory data analysis (EDA)
- Feature engineering
- Model training and evaluation
- Visualization and reporting
All in one massive notebook! While this approach works for quick prototyping, it creates several problems:
- Hard to maintain: Changes in one section can break others
- Difficult to collaborate: Multiple people canāt work on different parts simultaneously
- Poor reusability: Canāt easily reuse individual components
- Slow iteration: Must rerun entire notebook for small changes
- Memory inefficient: All code runs even when only one part is needed
The Solution: Modular DataCards Workflow
The Contract-First Modular Workflow breaks this monolithic approach into five distinct, interconnected components:
- Installation & Data Upload - Centralized environment setup
- Data Preparation - Cleaning and feature engineering
- Input/Filter Cards - User interaction controls
- Business Logic/Model - Core analytical computations
- Visualization/Results - Presentation and communication
Each component has a clear responsibility and well-defined interfaces, enabling:
- Independent development: Teams can work on different components simultaneously
- Selective execution: Only run the components you need
- Easy maintenance: Changes are isolated to specific components
- Better testing: Each component can be tested independently
- Improved collaboration: Clear contracts between components
Step 1: Installation & Data Upload Notebook
Objective
- Install required dependencies & data files (once, in a central place)
- Ensure reproducibility across all other notebooks
- Avoid having to re-install in every card
- Establish a single source of truth for data
Exercise Instructions
-
Create a new notebook and name it
Installation & Data Load. -
Install the required libraries (this step only needs to be run once per environment):
!pip install pandas numpy matplotlib seaborn scikit-learn joblib kaggle
!pip install --upgrade setuptools- Import the necessary libraries for subsequent operations:
import pandas as pd
import numpy as np
import kagglehub- Download the dataset from Kaggle:
# Download latest version
path = kagglehub.dataset_download("poushal02/student-academic-stress-real-world-dataset")
print("Path to dataset files:", path)- Load the dataset:
# Load the main dataset
df = pd.read_csv(f"{path}/academic Stress level - maintainance 1.csv")Note: In the original monolithic notebookĀ , this data loading would be mixed with preprocessing, EDA, and modeling all in one cell. Here, we isolate it to establish a clean data contract.
- Publish the raw dataset as a DataCards variable, making it available to other notebooks:
dc.publish.variable(key="raw_stress_data", value=df)- Preview the loaded data to confirm successful loading:
df.head()
df.info()Conceptual Takeaways
-
Why centralize environment setup? Having a dedicated notebook for installation and data uploading ensures all team members use the same versions of dependencies and data, improving consistency and minimizing setup errors.
-
Reproducibility: Storing data in shared variables allows the project to recover from kernel crashes or VM resets quickly without repeated data loading or reprocessing.
-
Modularity: Separating environment setup and data ingestion from data preparation and modeling fosters a clean, maintainable workflow, reducing duplication of effort and enhancing collaboration.
-
Scalability: This approach serves as the foundation for larger projects where multiple notebooks interact, helping manage computational resources more effectively.
Step 2: Data Preparation Notebook
Objective
- Load, clean, and prepare the dataset for further analysis or modeling
- Keep environment setup and data logic modular
- Enable faster reruns after crashes or kernel restarts
- Apply data quality checks and transformations
Exercise Instructions
-
Create a new notebook and name it
Data Preparation. -
Load the raw dataset published in the previous step and import the necessary libraries:
import pandas as pd
import numpy as np
df = dc.consume.variable.raw_stress_data()- Explore the data structure:
print("Dataset shape:", df.shape)
print("\nColumn names:")
print(df.columns.tolist())
print("\nData types:")
print(df.dtypes)
print("\nFirst few rows:")
print(df.head())- Clean the data by handling missing values and inconsistencies:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())
# Drop rows with missing values
df_clean = df.dropna()
# Reset index after dropping rows
df_clean = df_clean.reset_index(drop=True)- Create derived features for better analysis:
# First, let's see what columns we have to work with
print("Available columns:")
print(df_clean.columns.tolist())
# Create a stress level score using available numeric columns
# Adjust these column names based on your actual dataset
numeric_cols = df_clean.select_dtypes(include=[np.number]).columns.tolist()
print(f"Numeric columns available: {numeric_cols}")
# Use the first few numeric columns for stress score calculation
# (adjust this based on what makes sense for your dataset)
if len(numeric_cols) >= 3:
df_clean['stress_score'] = df_clean[numeric_cols[:3]].mean(axis=1)
print(f"Created stress_score using columns: {numeric_cols[:3]}")
else:
# Fallback: use all available numeric columns
df_clean['stress_score'] = df_clean[numeric_cols].mean(axis=1)
print(f"Created stress_score using all numeric columns: {numeric_cols}")
# Create categorical stress level
df_clean['stress_level'] = pd.cut(
df_clean['stress_score'],
bins=3,
labels=['Low', 'Medium', 'High']
)
print(f"Stress score range: {df_clean['stress_score'].min():.2f} to {df_clean['stress_score'].max():.2f}")
print(f"Stress level distribution:")
print(df_clean['stress_level'].value_counts())Contrast with Monolithic Approach: In the original monolithic notebookĀ , feature engineering would be scattered throughout the analysis. Here, we centralize all data transformations in one dedicated notebook, making them reusable and maintainable.
- Publish the cleaned dataset as a new DataCards variable:
dc.publish.variable(key="clean_stress_data", value=df_clean)- Display summary statistics:
df_clean.describe()Conceptual Takeaways
-
Modularity: Separating data cleaning and feature engineering into its own notebook keeps the workflow organized and logical, making it easier to maintain and debug.
-
Efficiency: By publishing cleaned data as a separate variable, other notebooks can consume prepared data quickly without repeating time-consuming cleaning steps.
-
Robustness: This structure supports smoother recovery from interruptions, since the environment setup and raw data loading happen independently from data transformation.
-
Extensibility: The prepared data can be extended with additional features or transformations in this notebook before it feeds into modeling or visualization stages.
Step 3: Input/Filter Cards Notebook
Objective
- Separate filtering logic from data cleaning and final presentation
- Enable dynamic filter generation and flexible card layout
- Optimize memory usage by consolidating filter controls in a single notebook
- Create interactive user controls for data exploration
Exercise Instructions
-
Create a new notebook and name it
Input Controls. -
Initialize filter variables with default values:
dc.publish.variable("selected_stress_level", "Medium") # Default stress level
dc.publish.variable("min_academic_performance", 3.0)
dc.publish.variable("max_academic_performance", 5.0)
dc.publish.variable("social_support_threshold", 3.0)
dc.publish.variable("include_financial_stress", True) # True = Include, False = Exclude- Load the cleaned data from the previous notebook:
df = dc.consume.variable.clean_stress_data()- Generate options for the stress level combobox dynamically:
options = [str(level) for level in df["stress_level"].unique()]
print("Available stress levels:", options)- Publish a combobox card for stress level selection:
dc.publish.card(
type='combobox',
label='Stress Level',
options=options,
variable_key='selected_stress_level',
logic_view_size=(2,2),
layout=[{"size": (3,2), "position": (0,0), "deck": "default-deck"}]
)- Publish slider cards for academic performance range:
dc.publish.card(
type='floatSlider',
label='Min Academic Performance',
unit='[1-5 scale]',
min=1.0,
max=5.0,
step=0.1,
variable_key='min_academic_performance',
logic_view_size=(3,1),
layout=[{"size": (3,2), "position": (0,2), "deck": "default-deck"}]
)dc.publish.card(
type='floatSlider',
label='Max Academic Performance',
unit='[1-5 scale]',
min=1.0,
max=5.0,
step=0.1,
variable_key='max_academic_performance',
logic_view_size=(3,1),
layout=[{"size": (3,2), "position": (0,4), "deck": "default-deck"}]
)- Publish additional filter controls:
dc.publish.card(
type='floatSlider',
label='Social Support Threshold',
unit='[1-5 scale]',
min=1.0,
max=5.0,
step=0.1,
variable_key='social_support_threshold',
logic_view_size=(3,1),
layout=[{"size": (3,2), "position": (3,0), "deck": "default-deck"}]
)dc.publish.card(
type='toggle',
label='Include Financial Stress',
variable_key='include_financial_stress',
logic_view_size=(2,1),
layout=[{"size": (3,2), "position": (3,2), "deck": "default-deck"}]
)Conceptual Takeaways
-
Separation of concerns: Keeping filtering controls distinct from data cleaning and modeling promotes a modular, maintainable architecture.
-
Dynamic UI generation: Filter options can adapt automatically based on the dataset, which improves flexibility across projects.
-
Memory optimization: Combining all filters in a single notebook reduces the number of active notebooks, saving RAM and improving performance.
-
User interaction: Creating interactive filter cards provides a user-friendly way for stakeholders or analysts to explore and customize data views.
Step 4: Business Logic / Model Notebook
Objective
- Isolate core analytical and predictive computations apart from data preparation and visualization
- Enable iterative experimentation with different modeling approaches without repeating data loading or plotting
- Improve maintainability by allowing changes in business logic or model architecture without impacting other workflow components
- Apply contract-first principles by defining clear input/output interfaces
Exercise Instructions
-
Create a new notebook and name it
Business Logic & Modeling. -
Import required libraries:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.model_selection import train_test_split- Load prepared data and user inputs published from previous notebooks:
df = dc.consume.variable.clean_stress_data()
selected_stress_level = dc.consume.variable.selected_stress_level()
min_academic = dc.consume.variable.min_academic_performance()
max_academic = dc.consume.variable.max_academic_performance()
social_threshold = dc.consume.variable.social_support_threshold()
include_financial = dc.consume.variable.include_financial_stress()- Apply user filters to the dataset:
# Filter by stress level
filtered_df = df[df['stress_level'] == selected_stress_level].copy()
# Get numeric columns for filtering (excluding stress_score and stress_level)
numeric_cols = filtered_df.select_dtypes(include=[np.number]).columns.tolist()
filterable_cols = [col for col in numeric_cols if col not in ['stress_score', 'stress_level']]
print(f"Available numeric columns for filtering: {filterable_cols}")
# Apply filters using available columns (adjust based on your dataset)
if len(filterable_cols) >= 1:
# Use first numeric column for min/max filtering
col1 = filterable_cols[0]
filtered_df = filtered_df[
(filtered_df[col1] >= min_academic) &
(filtered_df[col1] <= max_academic)
]
print(f"Filtered by {col1} range: {min_academic} - {max_academic}")
if len(filterable_cols) >= 2:
# Use second numeric column for threshold filtering
col2 = filterable_cols[1]
filtered_df = filtered_df[filtered_df[col2] >= social_threshold]
print(f"Filtered by {col2} threshold: >= {social_threshold}")
if len(filterable_cols) >= 3 and not include_financial:
# Use third numeric column for conditional filtering
col3 = filterable_cols[2]
filtered_df = filtered_df[filtered_df[col3] < 3.0]
print(f"Filtered by {col3} < 3.0 (exclude high values)")- Print filtering summary:
print(f"Filtered dataset shape: {filtered_df.shape}")
print(f"Original dataset shape: {df.shape}")
print(f"Filtered for stress level: {selected_stress_level}")
print(f"Academic performance range: {min_academic} - {max_academic}")
print(f"Social support threshold: {social_threshold}")
print(f"Include financial stress: {include_financial}")- Prepare features and target for modeling:
# Select features for modeling (use numeric columns, excluding stress_score)
numeric_cols = filtered_df.select_dtypes(include=[np.number]).columns.tolist()
feature_cols = [col for col in numeric_cols if col != 'stress_score']
print(f"Using features: {feature_cols}")
X = filtered_df[feature_cols]
y = filtered_df['stress_score']
print(f"Feature matrix shape: {X.shape}")
print(f"Target variable shape: {y.shape}")
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)- Train multiple models for comparison:
# Linear Regression
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)
lr_r2 = r2_score(y_test, lr_pred)
lr_mae = mean_absolute_error(y_test, lr_pred)
# Random Forest
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
rf_r2 = r2_score(y_test, rf_pred)
rf_mae = mean_absolute_error(y_test, rf_pred)- Compare model performance:
print("=" * 60)
print("MODEL PERFORMANCE COMPARISON")
print("=" * 60)
print(f"Linear Regression - R²: {lr_r2:.4f}, MAE: {lr_mae:.4f}")
print(f"Random Forest - R²: {rf_r2:.4f}, MAE: {rf_mae:.4f}")
print("=" * 60)- Select the best model and publish results:
# Use the model with better R² score
if rf_r2 > lr_r2:
best_model = rf_model
best_pred = rf_pred
best_r2 = rf_r2
best_mae = rf_mae
model_name = "Random Forest"
else:
best_model = lr_model
best_pred = lr_pred
best_r2 = lr_r2
best_mae = lr_mae
model_name = "Linear Regression"
# Publish model performance metrics
dc.publish.variable("model_r2_score", best_r2)
dc.publish.variable("model_mae", best_mae)
dc.publish.variable("selected_model", model_name)
dc.publish.variable("filtered_data_count", len(filtered_df))- Publish a number card displaying model performance:
dc.publish.card(
type='number',
value=best_r2,
label=f'{model_name} R² Score',
unit='',
decimals=3,
logic_view_size=(2, 1),
layout=[{"size": (3,2), "position": (6,0), "deck": "default-deck"}]
)Contrast with Monolithic Approach: In the original monolithic notebookĀ , model training, evaluation, and visualization would all be mixed together. Here, we isolate the modeling logic, making it easier to experiment with different algorithms and hyperparameters.
Conceptual Takeaways
-
Separation of concerns: Complex modeling logic is isolated from data preparation and visualization to enhance clarity and maintainability.
-
Iterative modeling: This setup facilitates trying out different models or analytical approaches without redoing upstream steps.
-
Model transparency: Publishing performance metrics and model selection criteria helps understand model behavior and fit quality.
-
Contract-first approach: Clear input/output interfaces between notebooks enable independent development and testing of components.
-
Publishing results: Sharing predictions and metrics as variables and cards supports smooth integration into dashboards or further analysis steps.
Step 5: Visualization Notebook / Result Display
Objective
- Communicate findings effectively by turning complex insights into clear, intuitive visuals
- Support decision-making with business-ready presentations suitable for non-technical stakeholders
- Separate presentation from computation to enable independent updating of visuals without rerunning heavy data processing
- Demonstrate the power of modular visualization components
Exercise Instructions
-
Create a new notebook and name it
Visualization & Results. -
Import required libraries and enable dark mode styling:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
plt.style.use('datacards-dark-mode')- Consume required variables from previous notebooks:
df = dc.consume.variable.clean_stress_data()
filtered_count = dc.consume.variable.filtered_data_count()
model_r2 = dc.consume.variable.model_r2_score()
selected_model = dc.consume.variable.selected_model()
selected_stress_level = dc.consume.variable.selected_stress_level()
print(f"Creating visualization for {selected_stress_level} stress level")
print(f"Filtered data points: {filtered_count}")
print(f"Selected model: {selected_model} (R² = {model_r2:.3f})")- Create a comprehensive dashboard with multiple visualizations:
# Create figure with subplots
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=datacards.utilities.plotting.plot_area(rows=4, columns=6))- Plot 1: Stress Score Distribution:
# Filter data for selected stress level
stress_data = df[df['stress_level'] == selected_stress_level]['stress_score']
ax1.hist(stress_data, bins=20, alpha=0.7, color='#ff6b9d', edgecolor='white', linewidth=0.5)
ax1.set_xlabel('Stress Score')
ax1.set_ylabel('Frequency')
ax1.set_title('Stress Score Distribution')
ax1.grid(True, alpha=0.3)- Plot 2: Feature Comparison:
scatter_data = df[df['stress_level'] == selected_stress_level]
numeric_cols = scatter_data.select_dtypes(include=[np.number]).columns.tolist()
feature_cols = [col for col in numeric_cols if col not in ['stress_score', 'stress_level']]
if len(feature_cols) >= 2:
ax2.scatter(scatter_data[feature_cols[0]], scatter_data[feature_cols[1]],
c=scatter_data['stress_score'], cmap='viridis', alpha=0.6)
ax2.set_xlabel(feature_cols[0])
ax2.set_ylabel(feature_cols[1])
ax2.set_title(f'{feature_cols[0]} vs {feature_cols[1]}')
else:
ax2.text(0.5, 0.5, 'Not enough numeric features\nfor scatter plot',
ha='center', va='center', transform=ax2.transAxes)
ax2.set_title('Feature Comparison')
ax2.grid(True, alpha=0.3)- Plot 3: Feature Importance (if Random Forest was selected):
if selected_model == "Random Forest":
# This would require the model to be published as a variable
# For demonstration, create a mock feature importance
features = ['Academic Performance', 'Social Support', 'Financial Stress']
importance = [0.4, 0.35, 0.25] # Mock values
ax3.bar(features, importance, color='#4ecdc4')
ax3.set_ylabel('Feature Importance')
ax3.set_title('Feature Importance')
ax3.tick_params(axis='x', rotation=45)
else:
ax3.text(0.5, 0.5, f'{selected_model}\nSelected', ha='center', va='center', transform=ax3.transAxes)
ax3.set_title('Model Type')- Plot 4: Model Performance Summary:
metrics = ['R² Score', 'Data Points']
values = [model_r2, filtered_count]
bars = ax4.bar(metrics, values, color=['#ffd93d', '#4ecdc4'])
ax4.set_ylabel('Value')
ax4.set_title('Model Performance')
ax4.tick_params(axis='x', rotation=45)
# Add value labels on bars
for bar, value in zip(bars, values):
height = bar.get_height()
ax4.text(bar.get_x() + bar.get_width()/2., height + 0.01,
f'{value:.3f}' if isinstance(value, float) else f'{value}',
ha='center', va='bottom')- Adjust layout and publish the visualization:
plt.tight_layout()
dc.publish.card(
type='matplotlib',
fig=fig,
label=f'Stress Analysis Dashboard - {selected_stress_level}',
logic_view_size=(6, 4),
layout=[{"size": (15, 8), "position": (10, 0), "deck": "default-deck"}]
)- Create additional summary cards:
# Data summary card
dc.publish.card(
type='number',
value=filtered_count,
label='Filtered Data Points',
unit='records',
decimals=0,
logic_view_size=(2, 1),
layout=[{"size": (3,2), "position": (6,2), "deck": "default-deck"}]
)
# Model performance card
dc.publish.card(
type='number',
value=model_r2,
label=f'{selected_model} R²',
unit='',
decimals=3,
logic_view_size=(2, 1),
layout=[{"size": (3,2), "position": (6,4), "deck": "default-deck"}]
)Contrast with Monolithic Approach: In the original monolithic notebookĀ , visualizations would be scattered throughout the analysis, making it hard to update or modify them. Here, we centralize all visualization logic, making it easy to iterate on the presentation layer.
Conceptual Takeaways
-
Effective communication: Visualizing distributions and relationships helps stakeholders quickly grasp the data and understand model performance in context.
-
Separation of concerns: This notebook focuses solely on presentation; analytical computations are kept separate for modularity and performance.
-
Interactivity readiness: Published cards can be integrated into dashboards, enabling dynamic interactions with filtered or predicted data.
-
Design principles: Following DataCards best practices ensures clean, professional dashboards that communicate insights effectively.
-
Modular visualization: Each visualization component can be updated independently, allowing for rapid iteration and customization.
Conclusion
This tutorial demonstrated how to refactor the monolithic approach used in notebooks like the Student Stress AnalysisĀ into the Contract-First Modular Workflow pattern, showing how to:
- Centralize setup in a dedicated installation notebook
- Separate data preparation from analysis and presentation
- Create interactive controls for user input and filtering
- Isolate business logic for maintainable modeling
- Build modular visualizations for effective communication
Key Benefits Over Monolithic Approach
This modular approach promotes:
- Maintainability: Each component has a clear responsibility
- Scalability: Components can be developed and tested independently
- Collaboration: Clear interfaces enable team development
- Flexibility: Easy to modify or extend individual components
- Performance: Optimized resource usage and faster iteration cycles
- Reusability: Components can be reused across different projects
By following this pattern, you can build robust, scalable data science applications that are easy to maintain, extend, and collaborate on - a significant improvement over traditional monolithic notebook approaches.