Navigating Datamodels¶
Here we describe using Python to navigate the SDSS Datamodel system.
The SDSS Datamodel¶
The highest level entry point into is the datamodel is the SDSSDataModel
. It
essentially acts as a container object for all SDSS data products and metadata. The SDSSDataModel
currently contains the following accessible properties
releases: a list of all public and internal data releases, plus the active “WORK” release
surveys: a list of all SDSS surveys
phases: a list of all SDSS phases
tags: a list of all SDSS software tags
products: a list of all SDSS data products
vacs: a list of SDSS VAC environment variables
>>> from datamodel.products import SDSSDataModel
>>> dm = SDSSDataModel()
>>> dm
<SDSS DataModel (n_releases=30, n_products=1, n_surveys=13, n_phases=5)>
You can access each property from the datamodel itself.
>>> # display the list of SDSS surveys
>>> dm.surveys
[Survey(name='MWM', long='Milky Way Mapper', description='A time-domain, optical+IR spectroscopic survey of Milky Way stars of all types.', phase=Phase(name='Phase-V', id=5, start=2020, end=None, active=True))
Survey(name='BHM', long='Black Hole Mapper', description='An optical time-domain spectroscopic survey of quasars and X-ray sources', phase=Phase(name='Phase-V', id=5, start=2020, end=None, active=True))
...]
Much of this metadata are described as a list of models, and can be accessed independently of the
SDSSDataModel
or Products
objects. See Metadata Models for more details.
Note
The datamodel examples on this page only have a single “mangaRss” datamodel product. These examples will be updated with more realistic results once more example datamodels have been generated.
Data Products¶
Each SDSS JSON datamodel is converted, and serialized, into a Product
,
with the complete list of data products collected into a DataProducts
list. The list of data products is available from the datamodel, i.e. dm.products
or separately.
>>> # import the list of data products
>>> from datamodel.products import DataProducts
>>> dp = DataProducts()
>>> dp
<DataProducts (n_products=1)>
DataProducts
is a FuzzyList
, so can indexed normally, or by a fuzzy name.
>>> # get item by index
>>> dp[0]
<Product ("mangaRss", summary="this is a manga rss")>
>>> # get item by name
>>> dp['mangarss']
<Product ("mangaRss", summary="this is a manga rss")>
>>> # get item by fuzzy name
>>> dp['mngars']
<Product ("mangaRss", summary="this is a manga rss")>
Loading Content¶
By default, DataProducts
lazy-loads all data products. This means that
the underlying JSON datamodel content will not be loaded upon instantiation. Only when a Product
is retrieved from the DataProducts
list, is when the JSON content is read in. This allows for
efficient navigation of the list of data products for a large number of items. You can load a product
manually by passing in load=True
on instantiation, or call the Product’s
load
method.
>>> # create a new Product for the MaNGA RSS datamodel
>>> from datamodel.products import Product
>>> rss = Product('mangarss')
>>> rss
<Product ("mangarss", summary="")>
>>> # by default it is unloaded
>>> rss.loaded
False
>>> # load the JSON content
>>> rss.load()
>>> rss
<Product ("mangarss", summary="this is a manga rss")>
>>> rss.loaded
True
You can also list all available products by their “file species” name.
>>> # list all data products by name
>>> dp.list_products()
['mangaRss']
Retrieving Content¶
The underlying JSON ProductModel
is available on each product, accessible via
the _model
attribute. A subset of the model attributes have been “extracted” up on the Product
object itself, e.g. the general.releases
, general.short
, and general.description
attributes. The _extract
class attribute contains a list of general
parameters to be included.
Additional parameters can be included by adding them to this list, reinstantiating, and reloading
the product.
>>> # list the product releases
>>> rss.releases
[Release(name='MPL5', description='SDSS MaNGA internal product release 5', public=False, release_date='2016-06-27'),
Release(name='DR14', description='SDSS public data release 14', public=True, release_date='2017-07-31'),
Release(name='DR15', description='SDSS public data release 15', public=True, release_date='2018-12-10'),
Release(name='DR16', description='SDSS public data release 16', public=True, release_date='2019-12-09'),
Release(name='MPL10', description='SDSS MaNGA internal product release 10', public=False, release_date='2020-07-13'),
Release(name='WORK', description='SDSS unreleased data. Represents any work-in-progress data.', public=False, release_date='unreleased')]
>>> # list the product short and long descriptions
>>> rss.short, rss.description
('A MaNGA Row-Stacked Spectra (RSS) product',
"The MaNGA DRP provides summary row-stacked spectra (RSS; with both logarithmic and
linear wavelength solutions) for each galaxy that combine individual fiber spectra of
that galaxy across multiple exposures into a single row-stacked format. The RSS files are a
two-dimensional array with horizontal size N_spec and vertical size N = \\sum N_fiber(i)
where N_fiber(i) is the number of fibers in the IFU targeting this galaxy for the i''th
exposure and the sum runs over all exposures."
)
The datamodel
Product
contains various convenience methods of
returning content from the datamodel. You can return the entire datamodel content has a
dictionary using get_content
:
>>> # return the datamodel content
>>> rss.get_content()
{'general': {'name': 'mangaRss',
'short': 'this is a manga rss',
'description': 'longer description',
'environments': ['MANGA_SPECTRO_REDUX'],
'datatype': 'FITS',
...
}
You can return content specific to a release using get_release
:
>>> # return the datamodel content for DR15
>>> rss.get_release("DR15")
Release(
template='$MANGA_SPECTRO_REDUX/[DRPVER]/[PLATE]/stack/manga-[PLATE]-[IFU]-[WAVE]RSS.fits.gz',
...)
Note that get_release
method returns the Release
object, which can be
converted to a dictionary through its own dict()
method.
You can return either the example filepath, or a more general path location, for a given release.
>>> # return the default datamodel example for the WORK release
>>> rss.get_example()
'/Users/Brian/Work/sdss/sas/mangawork/manga/spectro/redux/v3_1_1/8485/stack/manga-8485-1901-LOGRSS.fits.gz'
>>> # return the file location for DR16
>>> rss.get_location(drpver='v2_4_3', plate=8485, ifu=1901, wave='LOG', release='DR16')
'/Users/Brian/Work/sdss/sas/dr16/manga/spectro/redux/v2_4_3/8485/stack/manga-8485-1901-LOGRSS.fits.gz'
Reorganizing¶
By default, DataProducts
is a complete list of products organized by the “file species” datamodel
name. To group data products by some other property, you can use the
group_by
method. Possible fields to group by are
any attribute on the Product
instance, or any field in the underlying
_model
JSON datamodel, i.e. ProductModel
.
To group products by a Product
attribute, pass in the attribute name. For example, to group
products by data releases, use the releases
attribute:
>>> # group the products by the releases attribute
>>> group = dm.products.group_by('releases')
>>> group
{'DR15': [<Product ("mangaRss", summary="this is a manga rss")>],
'DR16': [<Product ("mangaRss", summary="this is a manga rss")>],
'MPL10': [<Product ("mangaRss", summary="this is a manga rss")>],
'WORK': [<Product ("mangaRss", summary="this is a manga rss")>]}
To group products by an attribute on the underlying JSON ProductModel
, pass in a “dotted attribute
chain” path to the field. For example, to group products by the SAS environment variable, which lives
in the “environments” field of the GeneralSection
of the JSON datamodel file, the full string path would be _model.general.environments
:
>>> # group the products by the environments attribute
>>> group = dm.products.group_by('_model.general.environments')
>>> group
{'MANGA_SPECTRO_REDUX': [<Product ("mangaRss", summary="this is a manga rss")>]}
Data Levels¶
Each data product has a data level, a string that describes the data processing level of the product, with
format x.y.z
. A level consists of three components, each:
x: the product type - high level category of pipeline processing, e.g. raw, intermediate, final, VAC
y: the product subtype - sub-category of the data type, e.g. image, spectra, cubes, etc
z: the product variant - optional, more specific description of the product, e.g. extracted spectra
>>> # access a product and its level
>>> from datamodel.products import SDSSDataModel
>>> dm = SDSSDataModel()
>>> prod = dm.products['mangaRss']
>>> prod
<Product ("mangaRSS", summary="this is a manga rss", level="2.2.1")>
>>> print(prod.data_level)
2.2.1
The data_level
is actually a descriptive object that can be expanded for more information.
>>> prod.data_level
DataLevel(x=<X.final: 2>, y=<Y.spectra: 2>, z=1)
>>> prod.describe()
{'product_type': 'final: Final science data product from a reduction or analysis pipeline',
'product_subtype': 'spectra: A 1d or 2d spectral data product, or a set of spectral data',
'product_variant': 'extracted_spectra: 1D extracted, wavelength-calibrated spectra'}
You can also get all products with a specific data level with the dm.products.get_level
method.
>>> # get all products with at the specific data level 2.2.1
>>> dm.products.get_level('2.2.1')
{"2.2.1": [<Product ("mangaRSS", summary="this is a manga rss", level="2.2.1")>]}
>>> # get all products at data level 2
>>> dm.products.get_level('2')
{"2.2.1": [<Product ("mangaRSS", summary="this is a manga rss", level="2.2.1")>],
"2.3.0": [<Product ("mangaDrpall", summary="the manga drpall summary", level="2.3.0")>]}
Metadata Models¶
The datamodel
products contains SDSS metadata accessible for lookup, or for use within web
applications or Python software. These metadata files are defined as YAML files, and serialized
into Python objects using Pydantic. For example,
the datamodel/releases.yaml
file defines the list of all available public or internal SDSS
data releases, and gets converted into datamodel.models.releases.Releases
, a list of
datamodel.models.releases.Release
objects.
Each metadata YAML file is structured in the same way, with two parts: a schema
section, and
a “named” list section of objects, e.g. “releases”. The schema
section defines the parameters
attached to each object, while the named section defines the object themselves. For example:
schema:
title: Release
key: release
description: SDSS data release versions
properties:
name:
title: name
description: the name of the data release
type: string
required: true
description:
title: description
description: a short description of the data release
type: string
required: true
public:
title: release
description: a flag whether it is public or not
type: bool
required: false
default: false
release_date:
title: release_date
description: the date the data was released to the public or the collaboration, in str isoformat
type: str
required: false
default: unreleased
releases:
- name: DR17
description: SDSS public data release 17
public: true
release_date: '2021-12-06'
- name: DR16
description: SDSS public data release 16
public: true
release_date: '2019-12-09'
...
When the datamodel
package reads in these files and serializes them, they become accessible as
navigable objects. For example, to access the list of SDSS releases, you can do the following:
>>> # import the SDSS releases
>>> from datamodel.models import releases
>>> releases
[Release(name='DR17', description='SDSS public data release 17', public=True, release_date=datetime.date(2021, 12, 6))
Release(name='DR16', description='SDSS public data release 16', public=True, release_date=datetime.date(2019, 12, 9))
...
Release(name='WORK', description='SDSS unreleased data. Represents any work-in-progress data.', public=False, release_date='unreleased')
Release(name='MPL11', description='SDSS MaNGA internal product release 11. Equivalent to DR17.', public=False, release_date=datetime.date(2021, 3, 1))
Release(name='MPL10', description='SDSS MaNGA internal product release 10', public=False, release_date=datetime.date(2020, 7, 13))
...]
>>> # check for containment
>>> 'DR17' in releases
True
>>> # select a release by index or name
>>> releases[0]
Release(name='DR17', description='SDSS public data release 17', public=True, release_date=datetime.date(2021, 12, 6))
>>> releases["DR13"]
Release(name='DR13', description='SDSS public data release 13', public=True, release_date=datetime.date(2016, 7, 31))
All metadata objects subclass from datamodel.models.base.BaseList
, and behave the same way. To list
just the names of each item, use the list_names
method.
>>> # list just the names of the releases
>>> releases.list_names()
['DR17',
'DR16',
'DR15',
...]
By default, the order of the items in each list is defined by the order in the YAML file. You
can sort (in-place) the list of items by any attribute on the object. To sort the releases
by release_date
from most recent to oldest, do:
>>> # sort the releases by date in descending order
>>> releases.sort('release_date', reverse=True)
>>> releases
[Release(name='DR17', description='SDSS public data release 17', public=True, release_date=datetime.date(2021, 12, 6))
Release(name='MPL11', description='SDSS MaNGA internal product release 11. Equivalent to DR17.', public=False, release_date=datetime.date(2021, 3, 1))
Release(name='MPL10', description='SDSS MaNGA internal product release 10', public=False, release_date=datetime.date(2020, 7, 13))
Release(name='DR16', description='SDSS public data release 16', public=True, release_date=datetime.date(2019, 12, 9))
Release(name='MPL9', description='SDSS MaNGA internal product release 9', public=False, release_date=datetime.date(2019, 12, 2))
...]
The same structure and behaviour is true for any of the other metadata files, e.g. SDSS Phases or Surveys.
>>> # import the SDSS phases
>>> from datamodel.models import phases
>>> phases
[Phase(name='Phase-V', id=5, start=2020, end=None, active=True)
Phase(name='Phase-IV', id=4, start=2014, end=2020, active=False)
Phase(name='Phase-III', id=3, start=2008, end=2014, active=False)
Phase(name='Phase-II', id=2, start=2005, end=2008, active=False)
Phase(name='Phase-I', id=1, start=2000, end=2005, active=False)]
As a reminder, all metadata items are accessible on the main datamodel.products.product.SDSSDataModel
.
>>> # access the list of phases from the datamodel
>>> dm.phases
[Phase(name='Phase-V', id=5, start=2020, end=None, active=True)
Phase(name='Phase-IV', id=4, start=2014, end=2020, active=False)
Phase(name='Phase-III', id=3, start=2008, end=2014, active=False)
Phase(name='Phase-II', id=2, start=2005, end=2008, active=False)
Phase(name='Phase-I', id=1, start=2000, end=2005, active=False)]
Tags¶
The SDSS Tag
model represents a software release tag. A specific tag is associated with a SDSS
data release, a SDSS survey, and is commonly referenced by a specific version name.
>>> from datamodel.products import SDSSDataModel
>>> dm = SDSSDataModel()
>>> tag = dm.tags[0]
>>> tag
Tag(version=Version(name='drpver', description='software tag key for the MaNGA Data Reduction Pipeline (DRP)'), tag='v3_1_1', release=Release(name='DR17', description='SDSS public data release 17', public=True, release_date='2021-12-06'), survey=Survey(name='MaNGA', long='Mapping Nearby Galaxies at Apache Point Observatory', description='A wide-field optical spectroscopic IFU survey of extragalactic sources to study galaxy dynamics and kinematics', phase=Phase(name='Phase-IV', id=4, start=2014, end=2020, active=False), id='manga'))
>>> # examine the tag release
>>> tag.release
Release(name='DR17', description='SDSS public data release 17', public=True, release_date='2021-12-06')
>>> # examine the tag survey
>>> tag.survey
Survey(name='MaNGA', long='Mapping Nearby Galaxies at Apache Point Observatory', description='A wide-field optical spectroscopic IFU survey of extragalactic sources to study galaxy dynamics and kinematics', phase=Phase(name='Phase-IV', id=4, start=2014, end=2020, active=False), id='manga')
>>> # examine the tag version
>>> tag.version
Version(name='drpver', description='software tag key for the MaNGA Data Reduction Pipeline (DRP)')
You can reorganize the list of Tags
into a nested dictionary, grouped by release or survey, using
the group_by
. The default ordering is by SDSS data release.
>>> from datamodel.products import SDSSDataModel
>>> dm = SDSSDataModel()
>>> dm.tags.group_by()
{'DR17':
{'manga': {'drpver': 'v3_1_1', 'dapver': '3.1.0'},
'mastar': {'drpver': 'v3_1_1'},
'eboss': {'run2d': 'v5_13_2', 'run1d': 'v5_13_2'},
'apogee2': {'apred_vers': 'dr17',
'apstar_vers': 'stars',
'aspcap_vers': 'synspec_rev1',
'results_vers': 'synspec_rev1'},
'legacy': {'run2d': [26, 103, 104]}},
'DR16':
{'manga': {'drpver': 'v2_4_3', 'dapver': '2.2.1'},
'mastar': {'drpver': 'v2_4_3'},
'eboss': {'run2d': 'v5_13_0', 'run1d': 'v5_13_0'},
'apogee2': {'apred_vers': 'r12',
'apstar_vers': 'stars',
'aspcap_vers': 'l33',
'results_vers': 'l33'},
'legacy': {'run2d': [26, 103, 104]}
},
...
}
Or to reorder by SDSS survey, set order_by
to survey
.
>>> from datamodel.products import SDSSDataModel
>>> dm = SDSSDataModel()
>>> dm.tags.group_by('survey')
{'manga':
{'DR17': {'drpver': 'v3_1_1', 'dapver': '3.1.0'},
'DR16': {'drpver': 'v2_4_3', 'dapver': '2.2.1'},
'DR15': {'drpver': 'v2_4_3', 'dapver': '2.2.1'},
...
}
'eboss':
{'DR17': {'run2d': 'v5_13_2', 'run1d': 'v5_13_2'},
'DR16': {'run2d': 'v5_13_0', 'run1d': 'v5_13_0'},
'DR15': {'run2d': 'v5_10_0', 'run1d': 'v5_10_0'},
...
},
...
}