Instructor: Professor Gordon Richards | Lecture: MW 9:30-10:50am; Disque 919 |
e-mail: gtr25 (on the standard Drexel domain) | Office Hours: R 2:30-3:30pm (but please e-mail ahead) |
Text: Statistics, Data Mining, and Machine Learning in Astronomy, Ivezic, Connolly, VanderPlas, and Gray | http://www.physics.drexel.edu/~gtr/teaching/phys_440_540/ |
Course Description: Course Purpose: Prerequisites: Expected Learning Outcomes: Course Materials:
Grading: Lecture: For 2024, the class will be in person. However, if there is interest I can make the videos that I recorded for the 2020 class available as appropriate. These include some pre-lecture videos (about 15-20 minutes) and the lectures themselves. I haven't updated them, so they aren't a substitute for the actual lecture, but will provide a way to get more help.
The lecture notebooks will be available on github
at github.com/gtrichards/PHYS_440_540/.
Your "exit ticket" for each lecture (due by the end of the day) will be the completed notebook from lecture. If you attended the lecture, you should be able to submit this at the end of class. If you were unable to make the lecture, then that is how you will receive partial credit (80%) for that day. I'll allow you 2 "free passes" during the quarter as I'd prefer you not come to lecture if sick. Late exit tickets will lose 20% credit per day.
Homework: Final Paper/Project:
The grading rubric for the Paper/Project is as follows:
Machine learning jargon/mathematics can be somewhat "inaccessible".
Your primary goal for these projects is to take something complicated
and find a simpler way to explain it in your own words). Your goal is not to simply regurgitate information from
your resources; it is to distill it.
It is not
necessary for undergraduate students to do any data analysis as this is not intended as a
quarter-long research project. However, data analysis could be one
aspect of the project and some data analysis is required for graduate students (ideally using data from your field of study).
Some suggestions for possible projects include:
Describe one particular method (or family of methods) in detail. Give anecdotes, make animations, etc. to illustrate the math principles. Acadmic Policies:
While collaboration is encouraged (e.g., you may show your broken code to a colleague and seek their advice), but they may not share their actual working code with you.
Students may not copy one another's exams,
homeworks, or code. All of these are considered cheating
and will be dealt with in the following manner. The first infraction
will result in a zero for all parties involved. The second infraction
will result in an F for the course and a report to the office of
academic affairs.
Student with disabilities requesting accommodations and services at
Drexel University need to present a current accommodation verification
letter (AVL) to faculty before accommodations can be made. AVL's are
issued by the Office of Equity and Diversity (OED). For additional
information, see drexel.edu/disability-resources/support-accommodations/student-family-resources/a>.
For Health and Counseling needs, students can find further information at Appropriate Use of Course Materials: Briefly, this policy states that all course materials including recordings provided by the course instructor may not be copied, reproduced, distributed or re-posted. Doing so may be considered a breach of this policy and will be investigated and addressed as possible academic dishonesty, among other potential violations. Improper use of such materials may also constitute a violation of the University Code of Conduct and will be investigated as such. Finally, changes to the parameters of the course may need to be made during
the quarter. In the case of such events, students will be notified by
the instructor through their official Drexel e-mail.
Last Modified: 13 September 2024
Many Physics disciplines today involve data volumes and/or data rates
that are solidly in the realm of "Big Data", for which knowledge of
modern machine learning tools is crucial. This course provides the
framework for physics students at all levels to begin interacting with
large data sets in physics. Data analysis will be done using Python
tools, including
the Scikit-Learn library for
machine learning. We concentrate on practical application of
classification and regression techniques for both unsupervised and
supervised data in addition to dimensionality reduction techniques and
time-domain analysis. An introduction to statistical methods,
bayesian inference, and markov-chain monte carlo methods provides a
foundation for application of machine learning tools.
Big Data Physics is a topical course that is designed to give Physics
students an introduction to methods for getting the most information
out of massive data sets using modern maching learning techniques.
The core of the course will be based on astronomical data as it is a
particularly rich dataset for exploration with machine learning. However, we will use as broad of a data set from physics (and perhaps industry)
as possible, and this course should be useful to all students
interested in data analysis in the physical sciences. This course
counts for 3 credits within
the Physics
"methods" requirements.
Students should be familiar with the Python programming language; completion of the Contemporary Physics sequence (113, 114, 115) and/or CS 171 is sufficient. Experience with Anaconda Python, SciPy, NumPy, as well as GitHub would be ideal, but is not required.
This course will address the Drexel Learning Priorities: Communication, Critical Thinking, Information
Literacy, Self-Directed Learning, and Technology Use. Upon completion of this course, students will be able to:
Graduate students are further expected to be able to demonstrate the
application of one or more machine learning techniques specifically to
data sets within their research areas.
The text for this class is Statistics,
Data Mining, and Machine Learning in Astronomy: A Practical Python
Guide for the Analysis of Survey Data by Ivezic, Connolly,
VanderPlas, and Gray (ISBN: 9780691198309).
Other excellent references include:
Python Data Science Handbook: Essential Tools for Working with Data by Jake VanderPlas, which serves as an excellent basic reference for this class and is available online.
A more advanced book is Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Geron. Note that O'Reilly books can be accessed for free with your Drexel ID.
Those interested in more details about the algorithms might look to: The Elements of Statistical Learning by
Hastie, Tibshirani, and Friedman. Note that the PDF is available online through the link.
Students will also be required to gain practical experience with broad types of data sets using the software learning environment at datacamp.com, which can also serve as a refresher for Python programming.
Finally, for students looking for an introduction (or refresher) for basic statistics, one of the classics is
Data Reduction and Error Analysis for the Physical Sciences by Bevington and Robinson.
Grading will be based upon lecture attendance, class participation,
homework, and a paper/project. We will use a 10-point grading scale
(80-82=B-, 83-86=B, 87-89=B+; and similarly for A,C,D). The final
course grades will be weighted according to
20% Lecture Attendance/Class Participation
60% Homeworks
20% Final Paper/Project
We will meet 2x times a week for 1.5 hours. The class period will be
roughly evenly divided between lecture and hands-on practice following
the examples from the book.
Every week there will be a programming/reading assignment for homework in order to to help synthesize the topics covered in lectures.
Some of the homework will be in the form of DataCamp assignments. You will receive an e-mail with a link to join our classroom on their site.
Find out more here:
datacamp.com/groups/education.
Within DataCamp you are welcome (and encouraged) to work through any
lessons of interest -- even if we aren't using them for this class.
Starter lessons that can explored on your own (even before class
begins) are included as are more advanced lessons in the "Data
Scientist with Python" track (e.g., intermediate Python, Pandas, or
Bokeh). Homework assignments outside of the DataCamp environment
should be submitted in the form of a Jupyter notebook submitted via GitHub.
Instead of a final exam, you will complete a capstone project on a topic of your choosing. You will write a
4+ page paper (not including figures [which are strongly suggested], references [required], or code).
By the end of Week 8, please e-mail me with your topic. (10%)
By the end of Week 9, please submit a 1-page draft of your paper to Blackboard. (20%)
Paper (due by 11:59pm on Sunday 8 December, submitted to Blackboard; 70%, covering background/motivation, accessibility of explanations, appropriateness of figures/references, quality of writing, [and data analysis for graduate students])
Illustrate the Pros and Cons of classical vs. Bayesian methods discussed in Section 5.9
Illustrate the Pros and Cons of the algorithms discussed in Section 6.6
Illustrate the Pros and Cons of the algorithms discussed in Section 7.7
Illustrate the Pros and Cons of the algorithms discussed in Section 8.12
Illustrate the Pros and Cons of the algorithms discussed in Section 9.9
Enter a Kaggle competition: www.kaggle.com
Students are expected to be familiar with Drexel's policies on
Academic Integrity, Plagiarism, Dishonesty and Cheating: drexel.edu/provost/policies-calendars/policies/academic-integrity/,
Course Adding/Dropping: drexel.edu/provost/policies-calendars/policies/course-add-drop,
Course Withdrawal: drexel.edu/provost/policies-calendars/policies/course-coop-withdrawal,
Incomplete Grades: drexel.edu/provost/policies-calendars/policies/incomplete_grades/, and
Grade Appeals: drexel.edu/provost/policies-calendars/policies/grade-appeals/.
drexel.edu/counselingandhealth/student-health-center/overview/
drexel.edu/counselingandhealth/counseling-center/overview/
It is important to recognize that some or all of the course materials provided to you may be the intellectual property of Drexel University, the course instructor, or others. Use of this intellectual property is governed by Drexel University policies, including the IT-1 Policy.
Tentative Course Schedule
Lecture
Subject
Notebook
Reading
1 (9/23, Monday)
Introduction
Motivation,InitialSetup
Hastie Chapter1; Ivezic 1.1, 1.3, 1.4, 1.5.1
2 (9/25, Wednesday or 09/27, Friday???)
Introduction
HistogramExample
Ivezic 4.8
3 (9/30, Monday)
Basic Stats and Bayes Rule
BasicStats
Ivezic 1.2, 3.0, 3.1, 3.2
4 (10/2, Wednesday)
More Basic Stats
BasicStats2
Ivezic 3.3, 3.4, 3.5
5 (10/7, Monday)
Classical and Bayesian Inference
Inference
Ivezic 4.0, 4.1, 4.2, 4.3, 4.5, 5.0, 5.1, 5.2
6 (10/9, Wednesday or 10/11, Friday???)
Bayesian Inference and MCMC
Inference2
Ivezic 4.6, 5.3, 5.4, 5.6, 5.8
X (10/14, Monday)
University Holiday (but there will be online assignments for Wed)
7 (10/16, Wednesday)
Intro to Scikit-Learn (N.B. Pre-lecture assignments)
Scikit-Learn-Intro
Ivezic 1.5, 1.6.1, 1.6.2, 1.7
8 (10/21, Monday)
Finding Structure in Data
DensityEstimation
Ivezic 6.0, 6.1, 6.2, 2.2, 2.4, 2.5
9 (10/23, Wednesday or 10/25 Friday???)
Clustering
DensityEstimation2andClustering
Ivezic 4.4, 6.3, 6.4, 6.6
10 (10/28, Monday)
Dimensional Reduction, PCA, NMF, and ICA
DimensionReduction
Ivezic 7.0, 7.1, 7.3, 7.4,
11 (10/30, Wednesday)
Nonlinear Dimensional Reduction, Manifold Learning, and tSNE
NonlinearDimensionReduction
Ivezic 7.5 and tSNE references
12 (11/4, Monday)
Intro to Regression
Regression
Ivezic 8.0, 8.1, 8.2, 8.3, 8.5
13 (11/6, Wednesday)
Advanced Topics in Regression
Regression2
Ivezic 8.6-8.11
14 (11/11, Monday)
Generative Classification
Classification
Ivezic 9.0-9.4,9.8
15 (11/13, Wednesday or 11/15, Friday???)
Discriminative Classification
Classification2
Ivezic 9.5-9.9
16 (11/18, Monday)
Advanced Topics in Classification
Classification3
Ivezic
17 (11/20, Wednesday)
Neural Networks
NeuralNetworksIntegrated
18 (11/25, Monday)
Neural Networks
NeuralNetworksIntegrated2
X (11/27, Wednesday)
No Lecture
19 (12/2, Monday)
Time Series
TimeSeries
Ivezic 10.0-10.4
20 (12/4, Wednesday)
Time Series
TimeSeries2
Ivezic 10.5