This repository is a ready-to-run starter kit for the semester project described in galaxy_classification_guide.md. It walks you through getting the data from SDSS, exploring key columns, and downloading/visualising galaxy images (DESI Legacy Survey) and spectra (SDSS DR19).
- Run the provided SQL on SDSS CasJobs to build your project table
- Export the table to CSV and place it in this repo
- Use the notebooks to validate data, plot histograms, and fetch/preview cutouts and spectra
galaxy_classification_guide.md: Project background and goals (read this first)input/queries/DATA7901_DR19_casjobs.sql: SQL to run on SDSS CasJobsinput/tables/: Put your exported CSV here (expected filename:DATA7901_DR19.csv)input/images/: JPEG cutouts downloaded by the notebookinput/spectra/: FITS spectra downloaded by the notebooknotebooks/explore_tables.ipynb: Main walkthrough: load CSV, validate fields, histograms, download and visualise images and spectrasrc/: We will keep all the Python scripts associated with the project here. If we talk about a Python script (any*.pyfile), it is stored insrc/.models/: This folder keeps all the trained models (saved checkpoints/weights, experiment outputs)..gitignore: Important! This file prevents large data files and sensitive configurations from being pushed to GitHub. It should include:- Data directories (
input/images/,input/spectra/,input/tables/*.csv) - Model files (
models/*.pkl,models/*.h5,models/*.pth) - Personal configuration files with local paths (
config.py,local_settings.py) - System files (
__pycache__/,.DS_Store,*.pyc)
- Data directories (
- Python 3.10+ (tested with 3.12)
- Jupyter (Lab or Notebook)
- Packages:
numpy,pandas,matplotlib,astropy - Command-line
wget(recommended) for downloads- macOS:
brew install wget - Ubuntu/Debian:
sudo apt-get install wget - Windows: use WSL or install wget; the notebook also includes a Python fallback for images
- macOS:
Suggested environment setup:
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install jupyter numpy pandas matplotlib astropyAlternative using conda:
conda create -n data7901 python=3.12 -y
conda activate data7901
pip install jupyter numpy pandas matplotlib astropyCasJobs portal for running SDSS SQL queries: https://casjobs.sdss.org/casjobs/
-
Go to the SDSS CasJobs website and sign in (create an account if needed). You should see a page similar to this:
-
Click "Login". You can create a new SciServer account or use Globus to authenticate:
-
After confirming your email and completing sign-in, you should see the CasJobs query workspace. This is where you'll paste and run the SQL provided in this repo:
-
Open a new query window and paste the contents of
input/queries/DATA7901_DR19_casjobs.sql.
- Select DR19 from the dropdown menu under Context
- The query creates
mydb.DATA7901_DR19with the following (key) columns:objid,ra,dec, Galacticl,b- Spectra identifiers:
specObjID,plate,mjd,fiberid,class,programname,sdssPrimary - Galaxy Zoo votes (counts; nvote_*):
nvote_tot: total votesnvote_std: votes for the standard classificationnvote_mr1: votes for the vertical mirrored classificationnvote_mr2: votes for the diagonally mirrored classificationnvote_mon: votes for the monochrome classification
- Galaxy Zoo vote fractions (p_*; values in [0,1]):
p_el: ellipticalp_cw: clockwise spiralp_acw: anticlockwise spiralp_edge: edge-on diskp_dk: don't knowp_mg: mergerp_cs: combined spiral (cw + acw + edge-on)
Visual guide to Galaxy Zoo class buttons used in the project (reproduced from Lintott et al. 2011):
Source: Lintott, C. et al. (2011), “Galaxy Zoo 1: data release of morphological classifications for nearly 900,000 galaxies,” MNRAS, 410, 166. ADS link: https://ui.adsabs.harvard.edu/abs/2011MNRAS.410..166L/abstract
SDSS schema references (useful while building and inspecting your table):
- SDSS Table Descriptions: [https://skyserver.sdss.org/dr7/en/help/docs/tabledesc.asp](https://skyserver.sdss.org/dr7/en/help/docs/tabledesc.asp)
- TABLE PhotoObj: [https://skyserver.sdss.org/dr7/en/help/browser/browser.asp?n=PhotoObj&t=U](https://skyserver.sdss.org/dr7/en/help/browser/browser.asp?n=PhotoObj&t=U)
- TABLE SpecObj: [https://skyserver.sdss.org/dr7/en/help/browser/browser.asp?n=SpecObj&t=U](https://skyserver.sdss.org/dr7/en/help/browser/browser.asp?n=SpecObj&t=U)
- TABLE zooVotes (Galaxy Zoo): [https://skyserver.sdss.org/dr8/en/help/browser/description.asp?n=zooVotes&t=U](https://skyserver.sdss.org/dr8/en/help/browser/description.asp?n=zooVotes&t=U)
- The filters in the SQL (magnitude and redshift cuts, and
zWarning = 0) keep the result manageable.
-
Submit the query. Within a few seconds to minutes (depending on load), the job status should be "Finished" with the message "Query Complete". That confirms your table was created in
MyDBwithout errors: -
When it completes, export the results from
mydb.DATA7901_DR19as CSV. -
Save the CSV locally as
DATA7901_DR19.csvand place it at:
input/tables/DATA7901_DR19.csv
Notes:
- Some tools may rename duplicate column names (e.g.,
ra,decappear in multiple joined tables). The provided notebooks expect the CSV format produced by CasJobs; the examples here already work with the CSV used during development.
Open notebooks/explore_tables.ipynb and run the cells in order:
- Load the CSV from
input/tables/DATA7901_DR19.csv. - Validate completeness and ranges for all Galaxy Zoo fields for rows where
class == 'GALAXY':- p_* (fractions in [0,1]):
p_el,p_cw,p_acw,p_edge,p_dk,p_mg,p_cs - nvote_* (non-negative integers):
nvote_tot,nvote_std,nvote_mr1,nvote_mr2,nvote_mon
- p_* (fractions in [0,1]):
- Plot histograms for all
p_*and allnvote_*columns.
In the same notebook:
- A cell prepares “valid galaxies” and builds the Legacy Survey cutout URLs.
- By default, it prints commands and limits downloads (e.g., first 10). You can increase or decrease
num_to_download. - Images are saved as
input/images/<objid>.jpeg.
If you don’t have wget, either install it or use the Python fallback cell (already included) that uses urllib to fetch the same URLs.
- The notebook includes a cell that shows the first 10 downloaded JPEGs side-by-side with titles from the filename (
objid).
This step mirrors Step 1, but targets the full photometric catalogue. Because the table is large, start by downloading only the first 100 rows to validate your workflow.
- Open a new CasJobs query window and use the query in
input/queries/DATA7901_DR19_casjobs_photo.sql. To limit output for testing, adapt the select toTOP (100)(see comments inside the SQL file for guidance). - Run the query. Export the result to CSV and examine columns to confirm they match expectations. Only after you are confident, consider exporting the full photometric catalogue; be mindful this can be a very large file (GB‑scale), so plan storage and bandwidth accordingly.
- The notebook includes a cell to download the first N spectra using
plate,mjd, andfiberidintoinput/spectra/. - It then plots a few spectra using
astropy.io.fitsto read common SDSS formats (prefers table HDUs withloglam/flux, falls back to image HDUs withCOEFF0/COEFF1).
As with the photometric sample, begin with a small spectroscopic extract to validate the pipeline.
- Open a new CasJobs query window and use the query in
input/queries/DATA7901_DR19_casjobs_spectra.sql. To keep the output manageable for testing, adapt the select toTOP (100)(see notes in the SQL file). - Run the query. Export to CSV and inspect. If/when you decide to export the full spectroscopic set, note that files will be large; plan storage and versioning appropriately.
- “File not found”: confirm your CSV is named
DATA7901_DR19.csvand placed underinput/tables/. - Missing
wget: install it or use the Python fallback image downloader cell. - Missing
astropy:pip install astropy. - Spectra 404s: not every
plate/mjd/fiberidexists at the hard-coded path. Try a few, or adjust the base URL. - Duplicate columns in CSV: CasJobs (and pandas) may rename duplicates; the provided notebook uses columns as exported during development.
After you’ve verified the data flows end-to-end:
- Feature engineering from tables (e.g., thresholds on
p_el, vote counts) - Image models (CNNs) and spectral models
- Model evaluation and reporting
Tips:
- Be gentle with external services. Keep download limits small (e.g., 10–50) while testing.
- Think carefully about data volume before mass downloads (cutouts/spectra can be many large files):
- Start small; download a handful first and verify your pipeline end‑to‑end.
- Estimate storage needs (files × average size) and ensure you have space and bandwidth.
- Save to the intended locations (
input/images/,input/spectra/) and keep a tidy directory structure. - If pushing to github remember to add large files and folders to the gitignore file
- Consider caching, checkpoints, or manifests to avoid repeated downloads.
- If you need everything, parallelize cautiously and be respectful of rate limits.
-
Exploratory Data Analysis (EDA) is crucial:
- Visualize distributions of key features (redshift, magnitude, colors)
- Check class imbalance in galaxy types
- Identify correlations between features
- Understand missingness patterns (not random in astronomy!)
-
Domain-specific considerations:
- Photometric errors are not uniform (fainter objects = larger errors)
- Selection effects: your sample may be biased by survey limitations
- Physical relationships exist (e.g., color-magnitude diagrams)
-
Feature engineering opportunities:
- Color indices (e.g., g-r, r-i)
- Morphological parameters from images
- Spectral line ratios and equivalent widths
- Photometric redshift estimates
-
Handling measurement uncertainties:
- Consider using error-weighted loss functions
- Propagate uncertainties through your pipeline
- Bootstrap/Monte Carlo for uncertainty quantification
-
Memory management:
- Use data generators/loaders for large datasets
- Consider chunking strategies for processing
- Profile memory usage before scaling up
-
Model complexity vs. performance trade-off:
- Start with simple, fast models for baseline
- Document training time and inference speed
- Consider model size for deployment scenarios
-
Version control best practices:
- Don't commit large data files (use .gitignore)
- Document random seeds for reproducibility
- Keep a changelog of experiments
-
Documentation requirements:
- README with clear setup instructions
- Requirements.txt or environment.yml
- Jupyter notebooks with markdown explanations
- Final report linking to industry applications
-
First and foremost, this is your project — organize it in a way that works for you and be innovative.
-
Break work into small tasks:
- Clarify the objective (what is success?).
- Choose problem type: Classification vs Regression.
- Prepare your data: handle missing values, outliers, scaling, encode categoricals.
- Choose and implement cross‑validation.
- Select candidate models; train baselines and iterate.
- Evaluate with appropriate metrics; fine‑tune hyperparameters.
-
Cross-validation options (pick what fits your data):
- Stratified K-fold: Recommended for imbalanced galaxy classes
- Hold-out: Good for large datasets (70/15/15 or 60/20/20 split)
- Time-based split: If using time-series spectral features
- K-fold: Standard choice for balanced datasets
- Group K-fold: If galaxies are grouped (e.g., by survey region)
-
Data and model types:
- Supervised: has one or more targets
- Classification: predict a category
- Regression: predict a continuous value
- Unsupervised: no target (e.g., clustering, dimensionality reduction)
- Supervised: has one or more targets
-
Tabulated data — common model choices:
- Decision Trees
- Random Forests
- Logistic Regression
- Gradient Boosting (e.g., XGBoost/LightGBM/CatBoost)
- Symbolic Regression (PySR/gplearn) — discovers interpretable equations
- Neural Networks (use judiciously; simpler models may match performance with less complexity)
-
Evaluation metrics (select per objective):
- Classification:
- Confusion matrix (essential for multi-class)
- Per-class precision/recall (identify weak classes)
- Weighted/macro/micro F1 scores
- ROC curves for each class (one-vs-rest)
- Regression (if predicting continuous properties):
- MAE, MSE, RMSE
- R² score
- Residual plots
- Classification:
-
Deep Learning considerations:
- CNNs for galaxy images (consider transfer learning from pre-trained models)
- 1D CNNs or LSTMs for spectra
- Attention mechanisms for identifying important features
- Start with smaller architectures; deep ≠ better
- Monitor for overfitting (use early stopping, dropout)
- Data leakage: Ensure no target information in features
- Overfitting to small samples: Use proper validation
- Ignoring class imbalance: Consider SMOTE, class weights
- Not checking data quality: Remove artifacts, bad pixels
- Forgetting to scale features: Especially mixing images/tabular data
- Training on incomplete data: Handle NaNs before modeling
- SDSS CasJobs and data services
- Legacy Survey image cutouts




