Skip to content

CA Biositing Data Models

This package contains the SQLModel-based database models for the CA Biositing project. It is implemented as a PEP 420 namespace package that can be shared across multiple components of the application (ETL pipelines, API services, etc.).

Overview

The ca_biositing.datamodels package provides:

  • Hand-Written SQLModel Classes: Models organized across domain subdirectories, combining SQLAlchemy ORM and Pydantic validation in a single class hierarchy.
  • Materialized Views: Analytical views defined as SQLAlchemy Core select() expressions (plus one SQL-based aggregate view), managed via Alembic migrations.
  • Database Configuration: SQLModel-based engine and session management with Docker-aware URL adjustment.
  • Model Configuration: Shared configuration for model behavior using Pydantic Settings.
  • Alembic Migrations: Version-controlled database schema changes.

Schema Management Workflow

All schema changes are managed through Alembic migrations generated from SQLModel class definitions.

Making Schema Changes

  1. Edit Models: Modify or add SQLModel classes in the appropriate subdirectory under ca_biositing/datamodels/models/.
  2. Auto-Generate Migration: Run the autogenerate command to create a new migration script:
    pixi run migrate-autogenerate -m "Description of your changes"
    
  3. Review Migration: Check the generated script in alembic/versions/ to ensure it accurately reflects your changes.
  4. Apply Migration: Apply the changes to the database:
    pixi run migrate
    

Managing Materialized Views

Materialized views are defined in ca_biositing/datamodels/views.py and created/modified through manual Alembic migration scripts (not autogenerated). After loading new data into the base tables, refresh the views:

pixi run refresh-views

To check the status of materialized views:

pixi run schema-analytics-list

Validation with pgschema (Optional)

The pgschema tool can still be used for validation (diffing the live DB against reference SQL), but it is no longer used for applying schema changes:

# Diff public schema
pixi run schema-plan

# Diff analytics schema (materialized views)
pixi run schema-analytics-plan

Structure

src/ca_biositing/datamodels/
├── ca_biositing/
│   └── datamodels/
│       ├── __init__.py              # Package initialization and version
│       ├── config.py                # Model configuration (Pydantic Settings)
│       ├── database.py              # SQLModel engine and session management
│       ├── views.py                 # Materialized view definitions
│       ├── models/                  # Hand-written SQLModel classes
│       │   ├── __init__.py          # Central re-export of models
│       │   ├── base.py              # Base classes (BaseEntity, LookupBase, etc.)
│       │   ├── aim1_records/        # Aim 1 analytical records
│       │   ├── aim2_records/        # Aim 2 processing records
│       │   ├── auth/                # API users and authentication models
│       │   ├── core/                # ETL lineage and run tracking
│       │   ├── data_sources_metadata/ # Data source and dataset metadata
│       │   ├── experiment_equipment/  # Experiments and equipment
│       │   ├── external_data/       # LandIQ, USDA, Billion Ton records
│       │   ├── field_sampling/      # Field samples and collection methods
│       │   ├── general_analysis/    # Observations and analysis types
│       │   ├── infrastructure/      # Infrastructure facility records
│       │   ├── methods_parameters_units/ # Methods, parameters, units
│       │   ├── people/              # Contacts and providers
│       │   ├── places/              # Location and address models
│       │   ├── resource_information/ # Resources, availability, strains
│       │   └── sample_preparation/  # Prepared samples and methods
│       └── sql_schemas/             # Reference SQL files (for pgschema validation)
├── tests/
│   ├── __init__.py
│   ├── conftest.py                  # Pytest fixtures and configuration
│   ├── test_package.py              # Tests for package metadata
│   └── README.md                    # Test documentation
├── LICENSE                          # BSD License
├── README.md                        # This file
└── pyproject.toml                   # Package metadata and dependencies

Installation

This package is part of the CA Biositing namespace package structure and is designed to be shared across multiple components of the project.

As part of the full project

The recommended way to install is using Pixi (which manages all dependencies):

pixi install

Standalone installation (development)

For development of just the datamodels package:

cd src/ca_biositing/datamodels
pip install -e .

Usage

Importing Models

All models are re-exported from the models package for convenient imports:

from ca_biositing.datamodels.models import (
    Resource,
    ResourceClass,
    FieldSample,
    Place,
    LocationAddress,
)

# Create a model instance
sample = FieldSample(
    name="Sample-001",
    resource_id=1,
    amount_collected=50.5
)

You can also import from specific domain submodules:

from ca_biositing.datamodels.models.resource_information import Resource
from ca_biositing.datamodels.models.field_sampling import FieldSample
from ca_biositing.datamodels.models.places import Place

Database Operations

from sqlmodel import Session, select
from ca_biositing.datamodels.database import get_engine
from ca_biositing.datamodels.models import Resource

engine = get_engine()

with Session(engine) as session:
    # Query models
    statement = select(Resource)
    resources = session.exec(statement).all()

    # Add new records
    new_resource = Resource(name="Corn Stover")
    session.add(new_resource)
    session.commit()

Refreshing Materialized Views

from ca_biositing.datamodels.database import get_engine
from ca_biositing.datamodels.views import refresh_all_views

engine = get_engine()
refresh_all_views(engine)

Testing

The package includes a comprehensive test suite covering model instantiation, field validation, and database persistence.

Run all tests

pixi run pytest src/ca_biositing/datamodels -v

Run specific test files

pixi run pytest src/ca_biositing/datamodels/tests/test_package.py -v

Run with coverage

pixi run pytest src/ca_biositing/datamodels --cov=ca_biositing.datamodels --cov-report=html

See tests/README.md for detailed information about the test suite.

Model Categories

The models are organized into domain subdirectories under models/:

Core and Infrastructure

  • base.py: Base classes shared across all models (BaseEntity, LookupBase, Aim1RecordBase, Aim2RecordBase).
  • auth/: API authentication models (ApiUser).
  • core/: ETL run tracking and lineage (EtlRun, EntityLineage, LineageGroup).
  • infrastructure/: Infrastructure facility records (biodiesel plants, landfills, ethanol biorefineries, etc.).
  • places/: Location and address models (Place, LocationAddress, LocationResolution).
  • people/: Contact and provider information (Contact, Provider).
  • data_sources_metadata/: Data source tracking (DataSource, Dataset, FileObjectMetadata).

Resources and Sampling

  • resource_information/: Core resource entities (Resource, ResourceClass, ResourceSubclass, ResourceAvailability, Strain).
  • field_sampling/: Field sampling data (FieldSample, HarvestMethod, CollectionMethod, SoilType).
  • sample_preparation/: Sample processing (PreparedSample, PreparationMethod, ProcessingMethod).

Experiments and Analysis

  • experiment_equipment/: Experimental setup (Experiment, Equipment, ExperimentAnalysis).
  • methods_parameters_units/: Methods, parameters, and units (Method, Parameter, Unit, MethodCategory).
  • general_analysis/: Observations and analysis results (Observation, AnalysisType, PhysicalCharacteristic).
  • aim1_records/: Aim 1 analytical records (proximate, ultimate, compositional, ICP, XRD, XRF, etc.).
  • aim2_records/: Aim 2 processing records (autoclave, fermentation, gasification, pretreatment).

External Data

  • external_data/: Integration with external datasets (LandIQ, USDA Census, USDA Survey, Billion Ton 2023, USDA Market).

Dependencies

Core dependencies (defined in pyproject.toml):

  • SQLModel: Combines SQLAlchemy + Pydantic for typed ORM models
  • SQLAlchemy >= 2.0.0: SQL database interaction
  • GeoAlchemy2: PostGIS geometry column support
  • Alembic >= 1.13.2: Database migration tool
  • psycopg2 >= 2.9.9: PostgreSQL adapter for Python
  • Pydantic Settings >= 2.0.0: Settings management

Development

Code Quality

Before committing changes, run pre-commit checks:

pixi run pre-commit run --files src/ca_biositing/datamodels/**/*

Adding New Models

  1. Create Model: Add a new SQLModel class in the appropriate subdirectory under models/, or create a new subdirectory if needed.
  2. Re-Export: Add the import to models/__init__.py so the model is available from ca_biositing.datamodels.models.
  3. Generate Migration: Run pixi run migrate-autogenerate -m "Add new model".
  4. Review: Check the generated migration in alembic/versions/.
  5. Apply: Run pixi run migrate to update the database.

Modifying Existing Models

  1. Edit Model: Modify the SQLModel class in its domain subdirectory.
  2. Generate Migration: Run pixi run migrate-autogenerate -m "Describe change".
  3. Review: Check the migration script for accuracy.
  4. Apply: Run pixi run migrate.

Package Information

Contributing

See the main project's CONTRIBUTING.md for guidelines on contributing to this package.