Data files
Data files#
The Explorer uses custom HDF5, parquet, and JSON files for its application. Data files for the SDSS Explorer are stored at this directory on CHPC Utah.
Data file types#
explorerAllVisit and explorerAllStar (HDF5)#
Each of the explorerAllVisit.hdf5
and explorerAllStar.hdf5
files are stacked versions of the astraAllStar
and astraAllVisit
FITS summary files produced by sdss/astra
after each run.
All data within these are stored as Apache Arrow datatypes to ensure maximum access and filtering speed except for 2D columns, such as sdss5_target_flags
or similar nested lists. Nested lists are instead stored as numpy arrays for efficient filtering and data access (because vaex
doesn't support 2D Arrow data yet).
Accessing each dataset#
To differentiate each dataset, one can use the pipeline
column.
import vaex as vx
df = vx.open("explorerAllStar.hdf5")# a Vaex DataFrame
dff = df[df['pipeline == "aspcap"']] # dff has the exact same data as astraAllStarASPCAP.fits.gz
In Explorer, this filtering step is done first, and then an extracted DataFrame object is stored with each Subset, which improves performance by restricting new filters to the rows of the selected dataset. This process is managed by task
in TODO SubsetUI.
# continuing from above; this is what the app does under the hood
Subset.df = dff.extract() # now when doing new filters, they are only applied to the rows in dff, and not df
Columns JSON#
The columnsAllStar
and columnsAllVisit
assist with guardrailing users and downloading by allowing the app to efficiently selecting columns that have no NaN values.
These columns
JSON files are loaded as dict[str]
to State.columns
, which is then accessed to fill Subset.columns
on each dataset
/pipeline
update.
Computing which columns are all nan
per dataset
switch within the app live was inefficient. Precompiling them is a trivial operation and takes minimal memory/CPU to load.
Mappings parquet#
mappings.parquet
is a compiled datafile of all the sdss
targeting cartons and programs, which is used by the filter_carton_mapper
function for selecting cartons and mapper programs.
Warning
mappings.parquet
is not generated via the datafile generation described below. It must be generated manually (trival via pandas
) from any updated bitmappings.csv
file in sdss/semaphore
.
Generating new data files#
Generation of new datafiles requires the use of a computer with lots of memory due to the size of the merge operation, which requires that the entire dataset fits into memory.
Generation can be done with the files in sdss/explorer-filegen
repository and takes three steps.
- Source all current
astra
summary data files + relevantspAll
files and place into a subdirectory corresponding to theastra
version.- Place these in some working directory with enough space to store upwards of 100GB of total data.
- For
spall
, we use the BOSS pipeline summary filesspAll-lite
for visit .spAll-lite_multimjd
for star, since many BOSS-only sources (like quasars) have no real "star" coadd equivalent.
- Convert all astra summary files to arrow (parquet), then HDF5.
vaex
can't loadfits
files sincevaex.astro
hasn't been updated in some time, so we use Bristol'sSTILTS
(part of TOPCAT) to convert them toparquet
.- This requires the
topcat-extra.jar
file. - Note that this process drops the
tags
andcarton_0
columns.
- This requires the
- This also automagically ensures that ALL datatypes are encoded as Apache Arrow ones.
- When we reconvert back to HDF5, we additionally ensure that any nested
pa.ChunkedArray
(list of list) datatypes are converted back intonumpy
datatypes to ensure compatibility. - NOTE: you must convert
spAll-lite
files used in the stacks manually usingspall_convert.py
.
- Merge all files together and output
columns*.json
- Columns files are used for guardrailing, see here.
- Generate custom datamodels for Column Glossary
- TODO
Within the repository, you will find additional slurm commands and scripts to run the generators (only steps 2 and 3) via sbatch
.