Astronomical Images for Machine Learning Applications

Carter Rhea
4 min readNov 14, 2020


Hubble Space Telescope

An example of the power of the Hubble Space Telescope. In this image alone, there are several thousands of galaxies!

The Hubble Space Telescope has revealed an enormous wealth of astronomical information over the past several decades. That being said, this article is not going to focus on the HST’s scientific prowess. Instead, I will describe how to query the Hubble Legacy Archive for use in statistical or machine learning applications. Although we won’t be doing any machine learning in this tutorial, we will set up the problem here.

Imagine that you want to write a machine learning algorithm to classify galaxies based on their morphology. Well, you’ll need a well-rounded training set of galaxies! Let’s explore how we can go about doing this :)

First off, we need to choose a sample of galaxies. Since we are choosing galaxies in the context of a machine learning problem (supervised), we will want galaxies that are already labeled by their morphological classification. Hello Galaxy Zoo!! Galaxy Zoo is an impressive citizen-scientist initiative aimed at (in part) classifying galaxies based on their morphology (check out the project at We will be using a specific dataset known as CANDELS: Cosmic Assembly Near-infrared Deep Extragalactic Legacy Survey. The dataset can be found at We will be using the schawinski_GZ_2010_catalogue.fits dataset.

From import fits
from astropy.table import Table
GZ_catalog =‘schawinski_GZ_2010_catalogue.fits’)
catalog = Table(GZ_catalog[1].data)
ObjID = catalog[‘OBJID’][0]
Morphology = catalog[‘GZ1_MORPHOLOGY’][0]
RA = catalog[‘RA’][0]
DEC = catalog[‘DEC’][0]

Although a wealth of information exists in the GZ_catalog, we will concern ourselves with the Object IDs, Morphology, and coordinates (RA, DEC).

Now that we have the location and morphology of our galaxies, we need to download the Hubble data. To do this, we will be using the astroquery package. This package allows us to query several astronomical archives that are diligently kept up to date by a dedicated team of astronomers. We will be querying the Barbara A. Mikulski Archive for Space Telescopes (MAST). In doing so, we can query for specific HST observations. We will be selecting observations taken by the Wide Field Camera (WFC) in the F160W band for demonstrative purposes. More specifically, we are only going to download science-ready fits files.

for gal_ct in range(len(ObjID)):
id_ = ObjID[gal_ct]
redshift_ = Redshift[gal_ct]
morph_ = Morphology[gal_ct]
bpt_ = BPT_class[gal_ct]
ra_ = RA[gal_ct]
dec_ = DEC[gal_ct]
pos = coords.SkyCoord(ra=ra_*, dec= dec_*, frame='fk5')
# Use the MAST catalog to query for Hubble observation
obs = Observations.query_criteria(project='HST',

if obs: # If there exists Hubble F160W data
for ob in obs:
# Get data inform
data_products = Observations.get_product_list(ob['obsid'])
# Download
manifest = Observations.download_products(data_products, download_dir='HST/'+ob['obsid']+'-'+ob['filters'], productType='SCIENCE', extension='fits', productSubGroupDescription='DRZ')

Let’s take a look at one of these fits files! Note that this step will take quite some time since we have over 50,000 galaxies we will be looking at!

An example of the HST image downloaded in the form of a fits file. The image has been rendered using the astronomical software DS9.

Since the WFC takes, as the name suggests, a wide-field image, we need to take the cutout of this image containing our galaxy :) First, we will do a little bit of cleaning and move all the fits files into a single folder.

for root, dirs, files in os.walk("HST"):
path = root.split(os.sep)
#print((len(path) - 1) * '---', os.path.basename(root))
for file in files:
if file.endswith('.fits'):
print(os.path.join(root, file))
shutil.copy(os.path.join(root, file), os.path.join('HST-Final', file))

Sloan Digital Sky Survey

Another amazing tool in the astronomer’s toolkit is the Sloan Digitial Sky Survey (SDSS). Take a look at it here:

I’m not going to go into the amazingness of the SDSS here, but that shouldn’t stop everyone from investigating it more! So let’s get down to business. How do we go about getting images we can use for future machine learning projects? Note that we are using the same catalog and lists we made earlier :) We are going to make 40x40 pixel thumbnails.

pix=40  # 40x40
arcsec=0.396*pix # Arcseconds per pixel in SDSS
for gal_ct in range(len(ObjID)):#inputs[0]:
id_ = ObjID[gal_ct]
ra_ = RA[gal_ct]
dec_ = DEC[gal_ct]
# Get sky coordinates
pos = coords.SkyCoord(ra=ra_*, dec= dec_*, frame='fk5')
im = SDSS.get_images(pos, band='r')
image_fits = im[0]
data = image_fits[0].data
header = image_fits[0].header
w = wcs.WCS(image_fits[0])
pickle.dump(cutout, open('SDSS/'+id_+"_image.p", "wb" ))

And that's it! Now we have a set of images (47,675 to be exact) from HST and SDSS for further projects :D

Example r-band thumbnail

In this short tutorial, I’ve demonstrated one way to download HST images for future use :) In the next part of the series, I will discuss ways to prepare these data for use in machine learning applications!



Carter Rhea

PhD Student in Astrophysics at the University of Montreal working on machine learning in astronomy. Co-founder of