PHYS 440/540

Big Data Physics

Fall 2024-2025

Instructor: Professor Gordon Richards Lecture: MW 9:30-10:50am; Disque 919
e-mail: gtr25 (on the standard Drexel domain) Office Hours: R 2:30-3:30pm (but please e-mail ahead)
Text: Statistics, Data Mining, and Machine Learning in Astronomy, Ivezic, Connolly, VanderPlas, and Gray http://www.physics.drexel.edu/~gtr/teaching/phys_440_540/
NOT YET FULLY UPDATED FOR FALL 2024. STAY TUNED.

Course Description:
Many Physics disciplines today involve data volumes and/or data rates that are solidly in the realm of "Big Data", for which knowledge of modern machine learning tools is crucial. This course provides the framework for physics students at all levels to begin interacting with large data sets in physics. Data analysis will be done using Python tools, including the Scikit-Learn library for machine learning. We concentrate on practical application of classification and regression techniques for both unsupervised and supervised data in addition to dimensionality reduction techniques and time-domain analysis. An introduction to statistical methods, bayesian inference, and markov-chain monte carlo methods provides a foundation for application of machine learning tools.

Course Purpose:
Big Data Physics is a topical course that is designed to give Physics students an introduction to methods for getting the most information out of massive data sets using modern maching learning techniques. The core of the course will be based on astronomical data as it is a particularly rich dataset for exploration with machine learning. However, we will use as broad of a data set from physics (and perhaps industry) as possible, and this course should be useful to all students interested in data analysis in the physical sciences. This course counts for 3 credits within the Physics "methods" requirements.

Prerequisites:
Students should be familiar with the Python programming language; completion of the Contemporary Physics sequence (113, 114, 115) and/or CS 171 is sufficient. Experience with Anaconda Python, SciPy, NumPy, as well as GitHub would be ideal, but is not required.

Expected Learning Outcomes:
This course will address the Drexel Learning Priorities: Communication, Critical Thinking, Information Literacy, Self-Directed Learning, and Technology Use. Upon completion of this course, students will be able to:

Graduate students are further expected to be able to demonstrate the application of one or more machine learning techniques specifically to data sets within their research areas.

Course Materials:
The text for this class is Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data by Ivezic, Connolly, VanderPlas, and Gray (ISBN: 9780691198309).

Other excellent references include:
Python Data Science Handbook: Essential Tools for Working with Data by Jake VanderPlas, which serves as an excellent basic reference for this class and is available online.

A more advanced book is Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Geron. Note that O'Reilly books can be accessed for free with your Drexel ID.

Those interested in more details about the algorithms might look to: The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman. Note that the PDF is available online through the link.

Students will also be required to gain practical experience with broad types of data sets using the software learning environment at datacamp.com, which can also serve as a refresher for Python programming.


Finally, for students looking for an introduction (or refresher) for basic statistics, one of the classics is Data Reduction and Error Analysis for the Physical Sciences by Bevington and Robinson.

Grading:
Grading will be based upon lecture attendance, class participation, homework, and a paper/project. We will use a 10-point grading scale (80-82=B-, 83-86=B, 87-89=B+; and similarly for A,C,D). The final course grades will be weighted according to
20% Lecture Attendance/Class Participation
60% Homeworks
20% Final Paper/Project

Lecture:
We will meet 2x times a week for 1.5 hours. The class period will be roughly evenly divided between lecture and hands-on practice following the examples from the book.

For 2024, the class will be in person. However, if there is interest I can make the videos that I recorded for the 2020 class available as appropriate. These include some pre-lecture videos (about 15-20 minutes) and the lectures themselves. I haven't updated them, so they aren't a substitute for the actual lecture, but will provide a way to get more help.

The lecture notebooks will be available on github at github.com/gtrichards/PHYS_440_540/. Your "exit ticket" for each lecture (due by the end of the day) will be the completed notebook from lecture. If you attended the lecture, you should be able to submit this at the end of class. If you were unable to make the lecture, then that is how you will receive partial credit (80%) for that day. I'll allow you 2 "free passes" during the quarter as I'd prefer you not come to lecture if sick. Late exit tickets will lose 20% credit per day.

Homework:
Every week there will be a programming/reading assignment for homework in order to to help synthesize the topics covered in lectures. Some of the homework will be in the form of DataCamp assignments. You will receive an e-mail with a link to join our classroom on their site. Find out more here: datacamp.com/groups/education. Within DataCamp you are welcome (and encouraged) to work through any lessons of interest -- even if we aren't using them for this class. Starter lessons that can explored on your own (even before class begins) are included as are more advanced lessons in the "Data Scientist with Python" track (e.g., intermediate Python, Pandas, or Bokeh). Homework assignments outside of the DataCamp environment should be submitted in the form of a Jupyter notebook submitted via GitHub.

Final Paper/Project:
Instead of a final exam, you will complete a capstone project on a topic of your choosing. You will write a 4+ page paper (not including figures [which are strongly suggested], references [required], or code).

The grading rubric for the Paper/Project is as follows:
By the end of Week 8, please e-mail me with your topic. (10%)
By the end of Week 9, please submit a 1-page draft of your paper to Blackboard. (20%)
Paper (due by 11:59pm on Sunday 8 December, submitted to Blackboard; 70%, covering background/motivation, accessibility of explanations, appropriateness of figures/references, quality of writing, [and data analysis for graduate students])

Machine learning jargon/mathematics can be somewhat "inaccessible". Your primary goal for these projects is to take something complicated and find a simpler way to explain it in your own words). Your goal is not to simply regurgitate information from your resources; it is to distill it.

It is not necessary for undergraduate students to do any data analysis as this is not intended as a quarter-long research project. However, data analysis could be one aspect of the project and some data analysis is required for graduate students (ideally using data from your field of study).

Some suggestions for possible projects include:

Describe one particular method (or family of methods) in detail. Give anecdotes, make animations, etc. to illustrate the math principles.
Illustrate the Pros and Cons of classical vs. Bayesian methods discussed in Section 5.9
Illustrate the Pros and Cons of the algorithms discussed in Section 6.6
Illustrate the Pros and Cons of the algorithms discussed in Section 7.7
Illustrate the Pros and Cons of the algorithms discussed in Section 8.12
Illustrate the Pros and Cons of the algorithms discussed in Section 9.9
Enter a Kaggle competition: www.kaggle.com

Acadmic Policies:
Students are expected to be familiar with Drexel's policies on
Academic Integrity, Plagiarism, Dishonesty and Cheating: drexel.edu/provost/policies-calendars/policies/academic-integrity/,
Course Adding/Dropping: drexel.edu/provost/policies-calendars/policies/course-add-drop,
Course Withdrawal: drexel.edu/provost/policies-calendars/policies/course-coop-withdrawal,
Incomplete Grades: drexel.edu/provost/policies-calendars/policies/incomplete_grades/, and
Grade Appeals: drexel.edu/provost/policies-calendars/policies/grade-appeals/.

While collaboration is encouraged (e.g., you may show your broken code to a colleague and seek their advice), but they may not share their actual working code with you. Students may not copy one another's exams, homeworks, or code. All of these are considered cheating and will be dealt with in the following manner. The first infraction will result in a zero for all parties involved. The second infraction will result in an F for the course and a report to the office of academic affairs.

Student with disabilities requesting accommodations and services at Drexel University need to present a current accommodation verification letter (AVL) to faculty before accommodations can be made. AVL's are issued by the Office of Equity and Diversity (OED). For additional information, see drexel.edu/disability-resources/support-accommodations/student-family-resources/a>.

For Health and Counseling needs, students can find further information at
drexel.edu/counselingandhealth/student-health-center/overview/
drexel.edu/counselingandhealth/counseling-center/overview/

Appropriate Use of Course Materials:
It is important to recognize that some or all of the course materials provided to you may be the intellectual property of Drexel University, the course instructor, or others. Use of this intellectual property is governed by Drexel University policies, including the IT-1 Policy.

Briefly, this policy states that all course materials including recordings provided by the course instructor may not be copied, reproduced, distributed or re-posted. Doing so may be considered a breach of this policy and will be investigated and addressed as possible academic dishonesty, among other potential violations. Improper use of such materials may also constitute a violation of the University Code of Conduct and will be investigated as such.

Finally, changes to the parameters of the course may need to be made during the quarter. In the case of such events, students will be notified by the instructor through their official Drexel e-mail.


Tentative Course Schedule

Lecture Subject Notebook Reading
1 (9/23, Monday) Introduction Motivation,InitialSetup Hastie Chapter1; Ivezic 1.1, 1.3, 1.4, 1.5.1
2 (9/25, Wednesday or 09/27, Friday???) Introduction HistogramExample Ivezic 4.8
3 (9/30, Monday) Basic Stats and Bayes Rule BasicStats Ivezic 1.2, 3.0, 3.1, 3.2
4 (10/2, Wednesday) More Basic Stats BasicStats2 Ivezic 3.3, 3.4, 3.5
5 (10/7, Monday) Classical and Bayesian Inference Inference Ivezic 4.0, 4.1, 4.2, 4.3, 4.5, 5.0, 5.1, 5.2
6 (10/9, Wednesday or 10/11, Friday???) Bayesian Inference and MCMC Inference2 Ivezic 4.6, 5.3, 5.4, 5.6, 5.8
X (10/14, Monday) University Holiday (but there will be online assignments for Wed)
7 (10/16, Wednesday) Intro to Scikit-Learn (N.B. Pre-lecture assignments) Scikit-Learn-Intro Ivezic 1.5, 1.6.1, 1.6.2, 1.7
8 (10/21, Monday) Finding Structure in Data DensityEstimation Ivezic 6.0, 6.1, 6.2, 2.2, 2.4, 2.5
9 (10/23, Wednesday or 10/25 Friday???) Clustering DensityEstimation2andClustering Ivezic 4.4, 6.3, 6.4, 6.6
10 (10/28, Monday) Dimensional Reduction, PCA, NMF, and ICA DimensionReduction Ivezic 7.0, 7.1, 7.3, 7.4,
11 (10/30, Wednesday) Nonlinear Dimensional Reduction, Manifold Learning, and tSNE NonlinearDimensionReduction Ivezic 7.5 and tSNE references
12 (11/4, Monday) Intro to Regression Regression Ivezic 8.0, 8.1, 8.2, 8.3, 8.5
13 (11/6, Wednesday) Advanced Topics in Regression Regression2 Ivezic 8.6-8.11
14 (11/11, Monday) Generative Classification Classification Ivezic 9.0-9.4,9.8
15 (11/13, Wednesday or 11/15, Friday???) Discriminative Classification Classification2 Ivezic 9.5-9.9
16 (11/18, Monday) Advanced Topics in Classification Classification3 Ivezic
17 (11/20, Wednesday) Neural Networks NeuralNetworksIntegrated
18 (11/25, Monday) Neural Networks NeuralNetworksIntegrated2
X (11/27, Wednesday) No Lecture
19 (12/2, Monday) Time Series TimeSeries Ivezic 10.0-10.4
20 (12/4, Wednesday) Time Series TimeSeries2 Ivezic 10.5

Last Modified: 13 September 2024