Navigating Datamodels¶

Here we describe using Python to navigate the SDSS Datamodel system.

The SDSS Datamodel¶

The highest level entry point into is the datamodel is the SDSSDataModel. It essentially acts as a container object for all SDSS data products and metadata. The SDSSDataModel currently contains the following accessible properties

releases: a list of all public and internal data releases, plus the active “WORK” release
surveys: a list of all SDSS surveys
phases: a list of all SDSS phases
tags: a list of all SDSS software tags
products: a list of all SDSS data products
vacs: a list of SDSS VAC environment variables

>>> from datamodel.products import SDSSDataModel
>>> dm = SDSSDataModel()
>>> dm
<SDSS DataModel (n_releases=30, n_products=1, n_surveys=13, n_phases=5)>

You can access each property from the datamodel itself.

>>> # display the list of SDSS surveys
>>> dm.surveys
[Survey(name='MWM', long='Milky Way Mapper', description='A time-domain, optical+IR spectroscopic survey of Milky Way stars of all types.', phase=Phase(name='Phase-V', id=5, start=2020, end=None, active=True))
 Survey(name='BHM', long='Black Hole Mapper', description='An optical time-domain spectroscopic survey of quasars and X-ray sources', phase=Phase(name='Phase-V', id=5, start=2020, end=None, active=True))
 ...]

Much of this metadata are described as a list of models, and can be accessed independently of the SDSSDataModel or Products objects. See Metadata Models for more details.

Note

The datamodel examples on this page only have a single “mangaRss” datamodel product. These examples will be updated with more realistic results once more example datamodels have been generated.

Data Products¶

Each SDSS JSON datamodel is converted, and serialized, into a Product, with the complete list of data products collected into a DataProducts list. The list of data products is available from the datamodel, i.e. dm.products or separately.

>>> # import the list of data products
>>> from datamodel.products import DataProducts
>>> dp = DataProducts()
>>> dp
<DataProducts (n_products=1)>

DataProducts is a FuzzyList, so can indexed normally, or by a fuzzy name.

>>> # get item by index
>>> dp[0]
<Product ("mangaRss", summary="this is a manga rss")>

>>> # get item by name
>>> dp['mangarss']
<Product ("mangaRss", summary="this is a manga rss")>

>>> # get item by fuzzy name
>>> dp['mngars']
<Product ("mangaRss", summary="this is a manga rss")>

Loading Content¶

By default, DataProducts lazy-loads all data products. This means that the underlying JSON datamodel content will not be loaded upon instantiation. Only when a Product is retrieved from the DataProducts list, is when the JSON content is read in. This allows for efficient navigation of the list of data products for a large number of items. You can load a product manually by passing in load=True on instantiation, or call the Product’s load method.

>>> # create a new Product for the MaNGA RSS datamodel
>>> from datamodel.products import Product
>>> rss = Product('mangarss')
>>> rss
<Product ("mangarss", summary="")>

>>> # by default it is unloaded
>>> rss.loaded
False

>>> # load the JSON content
>>> rss.load()
>>> rss
<Product ("mangarss", summary="this is a manga rss")>

>>> rss.loaded
True

You can also list all available products by their “file species” name.

>>> # list all data products by name
>>> dp.list_products()
['mangaRss']

Retrieving Content¶

The underlying JSON ProductModel is available on each product, accessible via the _model attribute. A subset of the model attributes have been “extracted” up on the Product object itself, e.g. the general.releases, general.short, and general.description attributes. The _extract class attribute contains a list of general parameters to be included. Additional parameters can be included by adding them to this list, reinstantiating, and reloading the product.

>>> # list the product releases
>>> rss.releases
[Release(name='MPL5', description='SDSS MaNGA internal product release 5', public=False, release_date='2016-06-27'),
 Release(name='DR14', description='SDSS public data release 14', public=True, release_date='2017-07-31'),
 Release(name='DR15', description='SDSS public data release 15', public=True, release_date='2018-12-10'),
 Release(name='DR16', description='SDSS public data release 16', public=True, release_date='2019-12-09'),
 Release(name='MPL10', description='SDSS MaNGA internal product release 10', public=False, release_date='2020-07-13'),
 Release(name='WORK', description='SDSS unreleased data.  Represents any work-in-progress data.', public=False, release_date='unreleased')]

>>> # list the product short and long descriptions
>>> rss.short, rss.description
('A MaNGA Row-Stacked Spectra (RSS) product',
 "The MaNGA DRP provides summary row-stacked spectra (RSS; with both logarithmic and
 linear wavelength solutions) for each galaxy that combine individual fiber spectra of
 that galaxy across multiple exposures into a single row-stacked format. The RSS files are a
 two-dimensional array with horizontal size N_spec and vertical size N = \\sum N_fiber(i)
 where N_fiber(i) is the number of fibers in the IFU targeting this galaxy for the i''th
 exposure and the sum runs over all exposures."
)

The datamodel Product contains various convenience methods of returning content from the datamodel. You can return the entire datamodel content has a dictionary using get_content:

>>> # return the datamodel content
>>> rss.get_content()
{'general': {'name': 'mangaRss',
 'short': 'this is a manga rss',
 'description': 'longer description',
 'environments': ['MANGA_SPECTRO_REDUX'],
 'datatype': 'FITS',
 ...
}

You can return content specific to a release using get_release:

>>> # return the datamodel content for DR15
>>> rss.get_release("DR15")
Release(
 template='$MANGA_SPECTRO_REDUX/[DRPVER]/[PLATE]/stack/manga-[PLATE]-[IFU]-[WAVE]RSS.fits.gz',
 ...)

Note that get_release method returns the Release object, which can be converted to a dictionary through its own dict() method.

You can return either the example filepath, or a more general path location, for a given release.

>>> # return the default datamodel example for the WORK release
>>> rss.get_example()
'/Users/Brian/Work/sdss/sas/mangawork/manga/spectro/redux/v3_1_1/8485/stack/manga-8485-1901-LOGRSS.fits.gz'

>>> # return the file location for DR16
>>> rss.get_location(drpver='v2_4_3', plate=8485, ifu=1901, wave='LOG', release='DR16')
'/Users/Brian/Work/sdss/sas/dr16/manga/spectro/redux/v2_4_3/8485/stack/manga-8485-1901-LOGRSS.fits.gz'

Reorganizing¶

By default, DataProducts is a complete list of products organized by the “file species” datamodel name. To group data products by some other property, you can use the group_by method. Possible fields to group by are any attribute on the Product instance, or any field in the underlying _model JSON datamodel, i.e. ProductModel.

To group products by a Product attribute, pass in the attribute name. For example, to group products by data releases, use the releases attribute:

>>> # group the products by the releases attribute
>>> group = dm.products.group_by('releases')
>>> group
{'DR15': [<Product ("mangaRss", summary="this is a manga rss")>],
 'DR16': [<Product ("mangaRss", summary="this is a manga rss")>],
 'MPL10': [<Product ("mangaRss", summary="this is a manga rss")>],
 'WORK': [<Product ("mangaRss", summary="this is a manga rss")>]}

To group products by an attribute on the underlying JSON ProductModel, pass in a “dotted attribute chain” path to the field. For example, to group products by the SAS environment variable, which lives in the “environments” field of the GeneralSection of the JSON datamodel file, the full string path would be _model.general.environments:

>>> # group the products by the environments attribute
>>> group = dm.products.group_by('_model.general.environments')
>>> group
{'MANGA_SPECTRO_REDUX': [<Product ("mangaRss", summary="this is a manga rss")>]}

Data Levels¶

Each data product has a data level, a string that describes the data processing level of the product, with format x.y.z. A level consists of three components, each:

x: the product type - high level category of pipeline processing, e.g. raw, intermediate, final, VAC
y: the product subtype - sub-category of the data type, e.g. image, spectra, cubes, etc
z: the product variant - optional, more specific description of the product, e.g. extracted spectra

>>> # access a product and its level
>>> from datamodel.products import SDSSDataModel
>>> dm = SDSSDataModel()
>>> prod = dm.products['mangaRss']
>>> prod
<Product ("mangaRSS", summary="this is a manga rss", level="2.2.1")>
>>> print(prod.data_level)
2.2.1

The data_level is actually a descriptive object that can be expanded for more information.

>>> prod.data_level
DataLevel(x=<X.final: 2>, y=<Y.spectra: 2>, z=1)
>>> prod.describe()
{'product_type': 'final: Final science data product from a reduction or analysis pipeline',
 'product_subtype': 'spectra: A 1d or 2d spectral data product, or a set of spectral data',
 'product_variant': 'extracted_spectra: 1D extracted, wavelength-calibrated spectra'}

You can also get all products with a specific data level with the dm.products.get_level method.

>>> # get all products with at the specific data level 2.2.1
>>> dm.products.get_level('2.2.1')
{"2.2.1": [<Product ("mangaRSS", summary="this is a manga rss", level="2.2.1")>]}

>>> # get all products at data level 2
>>> dm.products.get_level('2')
{"2.2.1": [<Product ("mangaRSS", summary="this is a manga rss", level="2.2.1")>],
 "2.3.0": [<Product ("mangaDrpall", summary="the manga drpall summary", level="2.3.0")>]}

Metadata Models¶

The datamodel products contains SDSS metadata accessible for lookup, or for use within web applications or Python software. These metadata files are defined as YAML files, and serialized into Python objects using Pydantic. For example, the datamodel/releases.yaml file defines the list of all available public or internal SDSS data releases, and gets converted into datamodel.models.releases.Releases, a list of datamodel.models.releases.Release objects.

Each metadata YAML file is structured in the same way, with two parts: a schema section, and a “named” list section of objects, e.g. “releases”. The schema section defines the parameters attached to each object, while the named section defines the object themselves. For example:

schema:
  title: Release
  key: release
  description: SDSS data release versions
  properties:
    name:
      title: name
      description: the name of the data release
      type: string
      required: true
    description:
      title: description
      description: a short description of the data release
      type: string
      required: true
    public:
      title: release
      description: a flag whether it is public or not
      type: bool
      required: false
      default: false
    release_date:
      title: release_date
      description: the date the data was released to the public or the collaboration, in str isoformat
      type: str
      required: false
      default: unreleased

releases:
  - name: DR17
    description: SDSS public data release 17
    public: true
    release_date: '2021-12-06'
  - name: DR16
    description: SDSS public data release 16
    public: true
    release_date: '2019-12-09'
  ...

When the datamodel package reads in these files and serializes them, they become accessible as navigable objects. For example, to access the list of SDSS releases, you can do the following:

>>> # import the SDSS releases
>>> from datamodel.models import releases
>>> releases
[Release(name='DR17', description='SDSS public data release 17', public=True, release_date=datetime.date(2021, 12, 6))
 Release(name='DR16', description='SDSS public data release 16', public=True, release_date=datetime.date(2019, 12, 9))
 ...
 Release(name='WORK', description='SDSS unreleased data.  Represents any work-in-progress data.', public=False, release_date='unreleased')
 Release(name='MPL11', description='SDSS MaNGA internal product release 11.  Equivalent to DR17.', public=False, release_date=datetime.date(2021, 3, 1))
 Release(name='MPL10', description='SDSS MaNGA internal product release 10', public=False, release_date=datetime.date(2020, 7, 13))
 ...]

>>> # check for containment
>>> 'DR17' in releases
True

>>> # select a release by index or name
>>> releases[0]
Release(name='DR17', description='SDSS public data release 17', public=True, release_date=datetime.date(2021, 12, 6))
>>> releases["DR13"]
Release(name='DR13', description='SDSS public data release 13', public=True, release_date=datetime.date(2016, 7, 31))

All metadata objects subclass from datamodel.models.base.BaseList, and behave the same way. To list just the names of each item, use the list_names method.

>>> # list just the names of the releases
>>> releases.list_names()
['DR17',
'DR16',
'DR15',
...]

By default, the order of the items in each list is defined by the order in the YAML file. You can sort (in-place) the list of items by any attribute on the object. To sort the releases by release_date from most recent to oldest, do:

>>> # sort the releases by date in descending order
>>> releases.sort('release_date', reverse=True)
>>> releases
[Release(name='DR17', description='SDSS public data release 17', public=True, release_date=datetime.date(2021, 12, 6))
 Release(name='MPL11', description='SDSS MaNGA internal product release 11.  Equivalent to DR17.', public=False, release_date=datetime.date(2021, 3, 1))
 Release(name='MPL10', description='SDSS MaNGA internal product release 10', public=False, release_date=datetime.date(2020, 7, 13))
 Release(name='DR16', description='SDSS public data release 16', public=True, release_date=datetime.date(2019, 12, 9))
 Release(name='MPL9', description='SDSS MaNGA internal product release 9', public=False, release_date=datetime.date(2019, 12, 2))
 ...]

The same structure and behaviour is true for any of the other metadata files, e.g. SDSS Phases or Surveys.

>>> # import the SDSS phases
>>> from datamodel.models import phases
>>> phases
[Phase(name='Phase-V', id=5, start=2020, end=None, active=True)
 Phase(name='Phase-IV', id=4, start=2014, end=2020, active=False)
 Phase(name='Phase-III', id=3, start=2008, end=2014, active=False)
 Phase(name='Phase-II', id=2, start=2005, end=2008, active=False)
 Phase(name='Phase-I', id=1, start=2000, end=2005, active=False)]

As a reminder, all metadata items are accessible on the main datamodel.products.product.SDSSDataModel.

>>> # access the list of phases from the datamodel
>>> dm.phases
[Phase(name='Phase-V', id=5, start=2020, end=None, active=True)
 Phase(name='Phase-IV', id=4, start=2014, end=2020, active=False)
 Phase(name='Phase-III', id=3, start=2008, end=2014, active=False)
 Phase(name='Phase-II', id=2, start=2005, end=2008, active=False)
 Phase(name='Phase-I', id=1, start=2000, end=2005, active=False)]

Tags¶

The SDSS Tag model represents a software release tag. A specific tag is associated with a SDSS data release, a SDSS survey, and is commonly referenced by a specific version name.

>>> from datamodel.products import SDSSDataModel
>>> dm = SDSSDataModel()
>>> tag = dm.tags[0]
>>> tag
Tag(version=Version(name='drpver', description='software tag key for the MaNGA Data Reduction Pipeline (DRP)'), tag='v3_1_1', release=Release(name='DR17', description='SDSS public data release 17', public=True, release_date='2021-12-06'), survey=Survey(name='MaNGA', long='Mapping Nearby Galaxies at Apache Point Observatory', description='A wide-field optical spectroscopic IFU survey of extragalactic sources to study galaxy dynamics and kinematics', phase=Phase(name='Phase-IV', id=4, start=2014, end=2020, active=False), id='manga'))

>>> # examine the tag release
>>> tag.release
Release(name='DR17', description='SDSS public data release 17', public=True, release_date='2021-12-06')

>>> # examine the tag survey
>>> tag.survey
Survey(name='MaNGA', long='Mapping Nearby Galaxies at Apache Point Observatory', description='A wide-field optical spectroscopic IFU survey of extragalactic sources to study galaxy dynamics and kinematics', phase=Phase(name='Phase-IV', id=4, start=2014, end=2020, active=False), id='manga')

>>> # examine the tag version
>>> tag.version
Version(name='drpver', description='software tag key for the MaNGA Data Reduction Pipeline (DRP)')

You can reorganize the list of Tags into a nested dictionary, grouped by release or survey, using the group_by. The default ordering is by SDSS data release.

>>> from datamodel.products import SDSSDataModel
>>> dm = SDSSDataModel()
>>> dm.tags.group_by()
{'DR17':
    {'manga': {'drpver': 'v3_1_1', 'dapver': '3.1.0'},
     'mastar': {'drpver': 'v3_1_1'},
     'eboss': {'run2d': 'v5_13_2', 'run1d': 'v5_13_2'},
     'apogee2': {'apred_vers': 'dr17',
        'apstar_vers': 'stars',
        'aspcap_vers': 'synspec_rev1',
        'results_vers': 'synspec_rev1'},
     'legacy': {'run2d': [26, 103, 104]}},
'DR16':
    {'manga': {'drpver': 'v2_4_3', 'dapver': '2.2.1'},
     'mastar': {'drpver': 'v2_4_3'},
     'eboss': {'run2d': 'v5_13_0', 'run1d': 'v5_13_0'},
     'apogee2': {'apred_vers': 'r12',
        'apstar_vers': 'stars',
        'aspcap_vers': 'l33',
        'results_vers': 'l33'},
     'legacy': {'run2d': [26, 103, 104]}
     },
...
}

Or to reorder by SDSS survey, set order_by to survey.

>>> from datamodel.products import SDSSDataModel
>>> dm = SDSSDataModel()
>>> dm.tags.group_by('survey')
{'manga':
    {'DR17': {'drpver': 'v3_1_1', 'dapver': '3.1.0'},
     'DR16': {'drpver': 'v2_4_3', 'dapver': '2.2.1'},
     'DR15': {'drpver': 'v2_4_3', 'dapver': '2.2.1'},
     ...
     }
 'eboss':
    {'DR17': {'run2d': 'v5_13_2', 'run1d': 'v5_13_2'},
     'DR16': {'run2d': 'v5_13_0', 'run1d': 'v5_13_0'},
     'DR15': {'run2d': 'v5_10_0', 'run1d': 'v5_10_0'},
     ...
     },
...
}