CS 4641 B: Machine Learning (Fall 2020)

Course Information

Instructor:
Rodrigo Borela
(rborelav@gatech.edu)
Head TA:
Nimisha Roy
TA:
Danrong Zhang
TA:
Prithvi Alva Suresh
TA:
Rafael Hanashiro
TA:
Bo Zhao
TA:
Xueyu Wang
TA:
Vidisha Goyal
TA:
Tanvi Bhagwat
TA:
Gnanaguruparan Aishvaryaadevi
TA:
Yuening Tang

Course Overview

This course introduces techniques in machine learning with an emphasis on algorithms and their applications to real-world data. We will investigate the following question: how to computationally extract useful knowledge from data for decision making and task support? We will focus on machine learning methods, which are organized into three parts:

  1. Basic math for data science and machine learning

    • Linear algebra
    • Probability and statistics
    • Information theory
    • Optimization
  2. Unsupervised machine learning for data exploration

    • Clustering analysis
    • Dimensionality reduction
    • Kernel density estimation
  3. Supervised learning for predictive data analysis

    • Tree-based models
    • Support vector machines
    • Linear classification and regression
    • Neural networks

Prerequisites for this course include basic knowledge of probability, statistics, linear algebra, and basic programming experience in Python.

In addition to the technical content, this class includes the following learning objectives:

  • Structuring a task into a machine learning work flow
  • Collaborating effectively on team projects in a remote environment
  • Conducting peer evaluation in a constructive format
  • Communicating technical content in a concise and effective manner

Schedule

Date Class Assignments Project Quizzes Readings
Aug 17, 2020 Course overview;
Piazza signup GT Honor Code
Aug 19, 2020 Project information;
Heilmeier catechism;
Visual Information Theory by Chris Olah;
GitHub Pages;
YAML Configuration;
NumPy Tutorial;
Matplotlib Tutorial;
Project Examples;
seaborn: statistical data visualization;
Overleaf for GT students;
Aug 21, 2020 Q0: Getting to know you
Aug 24, 2020 Foundations: Linear algebra;
Focus video: SVD
Notes: SVD
A1 out Correlation vs Covariance
Linear Algebra Review by Zico Kolter
Aug 26, 2020 Foundations: Probability and statistics;
Focus video: MLE;
Notes: MLE
Probability Theory Review by Andrew Moore;
The Differences Between Data, Information and Knowledge;
Aug 27, 2020 Project seminar 1
Aug 28, 2020 Q1: Linear algebra and Probability
Aug 31, 2020 Foundations: Information Theory;
Class Notes;
Focus video: Entropy
Notes: Entropy
The Differences Between Data, Information and Knowledge;
Sep 02, 2020 Foundations: Optimization;
Data analysis toolbox;
KKT for inequality constrained optimization;
Sep 03, 2020 Project seminar 2
Sep 04, 2020 Office hours, Sep 3rd Q2: Information theory and optimization
Sep 07, 2020 Labor day (No class)
Sep 08, 2020 Project team composition due
Sep 09, 2020 Clustering Analysis and K-Means;
K-Means Class Notes;
Focus video: K-Means;
Notes: K-Means;
A1 due | A2 out Curse of dimensionality (Euclidean space example);
Jupyter Notbook (K-means, GMM and DBSCAN);
Sep 10, 2020 Project seminar 3
Sep 11, 2020 Office hours, Sep 10th Q3: Kmeans
Sep 14, 2020 Gaussian Mixture Model;
Class Notes;
GitHub Student Application;
Jupyter Notbook (K-means, GMM and DBSCAN);
Sep 16, 2020 Density-Based Clustering;
Class Notes;
Focus video: GMM;
Notes: GMM;
GitHub Student Application;
Jupyter Notbook (Kmeans and DBSCAN);
Understanding the concept of Hierarchical clustering Technique;
Sep 17, 2020 Project seminar 4
Sep 18, 2020 Q4: GMM
Sep 21, 2020 Hierarchical Clustering;
Evaluation of Clustering Algorithms;
Sep 23, 2020 Evaluation of Clustering Algorithms (cont.);
Focus video: Cluster eval - entropy measures;
Notes: Cluster eval - entropy measures;
Focus video: Cluster eval - internal measures;
Notes: Cluster eval - internal measures;
Sep 24, 2020 Project seminar 5
Sep 25, 2020 Q5: Cluster evaluation + density estimation
Sep 28, 2020 Density Estimation;
Focus video: Kernel density estimation;
Notes: Kernel density estimation;
Touch-point 1 deliverables due KDE interactive visualization ;
KDE sampling ;
KDE SKLearn and sampling ;
Jupyter Notebook Kernel Density Example;
Sep 30, 2020 Touch-point 1: Project proposal
Oct 02, 2020 Project proposal due Q6: Density estimation
Oct 05, 2020 Dimensionality reduction;
A2 due (deadline extended to Oct 07) Image reconstruction using PCA ;
Feature extraction using PCA ;
PCA for images ;
PCA as linear combination of features ;
PCA and Linear Discriminant Analysis ;
Simple Linear Regression in Matrix Format;
Adding Noise to Regression Predictors;
Oct 07, 2020 Linear Regression;
A2 due (originally Oct 5)| A3 out
Oct 09, 2020 Q7: PCA and Linear regression
Oct 12, 2020 Regularization and Linear Regression;
Oct 14, 2020 Naïve Bayes and Logistic Regression;
Focus video: Gaussian Naive Bayes;
Notes: Gaussian Naive Bayes;
Oct 16, 2020 Q8: Regularization and Naïve Bayes
Oct 19, 2020 Decision Tree;
Oct 21, 2020 Decision Tree;
Ensemble Learning and Random Forest;
Oct 23, 2020 Q9: Decision trees
Oct 26, 2020 Support Vector Machine;
A3 due | A4 out KKT and SVM;
Oct 28, 2020 Kernel Method \ SVM;
Oct 30, 2020 Touch-point 2 deliverables due Q10: SVM and kernel method
Nov 02, 2020 Touch-point 2: Unsupervised learning
Nov 04, 2020 No lecture (use the time to work on your project!) NN Playground ;
The role of a hidden layer;
Back propagation numerical example;
Nov 06, 2020 Project midpoint report
Nov 09, 2020 Introduction to neural networks;
NN Playground ;
The role of a hidden layer;
Back propagation numerical example;
More detailed introduction;
Nov 11, 2020 Neural Networks(Forward pass and Back propagation);
A4 due CNN Live Demo;
A guide to an efficient way to build CNN and optimize its hyper-parameters;
Back Propagation in CNN;
Transfer learning in CNN;
Project Scoring Guidance;
Nov 13, 2020 Q11: Neural networks
Nov 16, 2020 Convolutional neural networks;
Nov 18, 2020 Practical advice;
Nov 20, 2020 Q12: Practical advice
Nov 22, 2020 Touch-point 3 deliverables due (originally Nov 20)
Nov 23, 2020 Touch-point 3: Predictive model
Dec 7, 2020 Final project report + presentation due

Course policies

  • Attendance: Our class will be offered in a hybrid "touch point" mode. We will not meet in person for our regular lectures which will be live-streamed and recorded. The recordings will be made available to all students after class time. On-campus attendance for the in-person touch points is not mandatory and remote participation will be available. Attendance to the live-streamed lectures is mandatory and will count towards your participation grade.
  • Class deliverables: All class deliverables will be handled via Gradescope. The time span offered to complete the course projects and assignments is plentiful and deadlines will not be extended under any circumstances. To ensure the class is fair for all students, you will receive zero credit for work submitted after the deadline. Regrade requests should be submitted directly on Gradescope within one week of grade publication. Should you find yourself at an impasse with the TA responsible for your grading, feel free to contact the head TA or course instructor.
  • Exceptional circumstances: Any request for exceptions to these policies should be made in advance when at all possible. Requests should be due to incapacitating illness, personal emergencies, or similarly serious events. Your request should be accompanied by a supporting letter issued by the Dean of Students.
  • Communication: The most effective means of communication with the TAs and Instructor in this course is via private notes on Piazza. While the instructor diligently checks the Piazza page for communication from the students, please allow a 24 hour period for any special requests to be processed. Also note that messages submitted after 9:00pm EST or during the weekend will most likely receive an answer on the following business day.

Diversity and inclusion

Just as machine learning algorithms cannot accomplish complex tasks if trained on datasets of limited variability, our course cannot be successful without appreciating the diversity of our students. In this class we aim to create an environment where all voices are valued, respecting the diversity of gender, sexuality, age, socioeconomic status, ability, ethnicity, race, and culture. We always welcome suggestions that can help us achieve this goal. Additionally, if any of our class scheduled activities conflicts with religious events, please inform the instruction team so that we can make appropriate arrangements for you.

Students with disabilities: your access to this course is extremely important to us. The institute has policies regarding disability accommodation, which are administered through the Office of Disability Services. Please request your accommodation letter as early in the semester as possible, so that we have adequate time to arrange your approved academic accommodations.

In-person meetings

In line with institute policies effective July 15th 2020, face coverings will be required for all in-person meetings and every effort will be made to maintain the CDC recommended social distancing guidelines of six feet or more. In case of inclement weather, the student and instructor will meet in a mutually agreed-upon indoor location where all necessary precautions can be taken. You can find more information on Georia Tech’s policies and the CDC guidelines here and here.

Office hours and questions

We are very happy to offer you one-on-one office hours starting on the second week of clases. Please follow the instruction on this Excel sheet to signup for a ten-minute slot with one of the TAs. If you require more than ten minutes, please advise the TAs. They’ll return to your BlueJeans meeting once they have completed their appointments with other students. You just need to add your name, question of interest and your BlueJeans meeting link. Please do not change the other part of the Excel sheet. The TA meetings are designed to be one-on-one. Please do not join another student’s BlueJeans meeting. The sole exception to this policy being discussions about the project, in which your fellow team members can also join. In addition to the one-on-one meetings, open office hours with the instructor will be held weekly where you can ask general questions about the topics covered in class. In-person office hours are only available by appointment and will likely be held outdoors, in line with the aforementioned Georgia Tech's and CDC guidelines with respect to preventing the spread of the coronavirus.

Time Monday Tuesday Wednesday Thursday Friday
09:00am - 10:00am Tanvi Xueyu Yuening
11:00am - 12:00pm Prithvi Aishvaryaa
02:00pm - 03:00 pm Rafael
03:00pm - 04:00pm Bo
03:30pm - 04:30pm Vidisha
05:00pm - 06:00pm Danrong Nimisha
07:00pm - 08:00pm Instructor

Grading

  • Assignments (50%)

  • Project (35%)

    • Proposal (10%)

    • Midterm report (10%)

    • Final report (15%)

  • Quizzes (10%)

  • Class participation (5%)

  • Bonus points (up to 7%)

    • About bonus points: Bonus points will be counted to always be beneficial for your final grade. More information on bonus points for assignments will be provided as the semester progresses. If it becomes necessary to curve grades, bonus points will be applied after curving, not before.
    • Piazza participation: Piazza has statistics which give us many metrics regarding how much a student has been involved on Piazza's activities such as viewing posts, answering questions, asking questions and so on. Not only do we use this to account for a minor part of the Class Participation score, we will use the statistics to give students bonus points. Bonus points will be applied to students who answer the other students' questions correctly. At the end of the semester, we will define a minimum and maximum number of involvement considering all the students, and based on those, some students will receive at most 1% bonus points.
    • Distribution: Bonus points will also be available for the project and assignments. You can achieve 1% bonus points on the project, and an additional 5% by correctly completing the bonus questions on the assignments.

Assignments

  • There will be four assignments. Each one is designed to improve and test your understanding of the materials. Assignments will have both programming and written analysis components.
  • You will need to submit all your assignments using Gradescope. Instructions on how to submit your code and written portions will follow with every assignment. Handwritten solutions WILL NOT BE ACCEPTED and you will not receive credit for a handwritten submission.
  • You are required to use Markdown, Latex, or a word processing software to generate your solutions to the written questions. Again, handwritten solutions WILL NOT BE ACCEPTED.
  • All assignments follow the “no-late” policy. Assignments received after the due date and time will receive zero credit.
  • All students are expected to follow the Georgia Tech Academic Honor Code.
  • You can easily export your Jupyter Notebook to a Python file and import that to your desired python IDE to debug your code for assignments.
  • You are NOT allowed to share any assignment codes or answers with other students. Piazza is the best place to have discussion regarding assignments and course topics. Discussions are just for the better understanding of questions and should not directly answer the questions.

Project

  • In order for you to obtain hands-on experience applying the topics covered in this course, you are expected to complete a term project utilizing real-world data. The project will encompass both unsupervised and supervised learning. In the first five weeks of the semester, to motivate and inspire you, we will have TA led seminars in which two TAs will present a project in which they applied machine learning and discuss strategies to work on a remote project. You are expected to watch at least three of the seminars and post questions/discussions on the corresponding Piazza thread. Your participation in the thread discussion will count towards your participation grade. The seminar schedule is as follows.
  • Date Speaker Title
    08/27 Rafael Hanashiro PUBG placement prediction (CS 7641 project)
    08/27 Vidisha Goyal DREAM6 – FlowCAP2 Molecular Classification of Acute Myeloid Leukemia Challenge
    09/03 Gnanaguruparan Aishvaryaa Prediction of Hard Drive Failures (CS 7641 project)
    09/03 Danrong Zhang Soil type prediction
    09/10 Bo Zhao Combining Randomization with Jacobian Regularization for Robust Learning
    09/10 Prithvi Alva Suresh Movie Revenue Prediction (CS 7641 Project)
    09/17 Tanvi Bhagwat Text Classifiers
    09/24 Xueyu Wang Patent Grant Time Analysis
  • Each project needs to be completed in a team of four people (you will be forming your team on your own. In case you cannot find a team, we will randomly assign you a team). Team members need to clearly claim their contributions in the project report. Once your teams have been formed and you have selected a topic, you will be assigned a mentor, who will provide you with general guidance on your project. It is important to note that your team will lead the project effort: obtaining the data, researching data-driven approaches to accomplish your project goal and coordinate your own activities. The role of the mentor is solely to advise you, should you find yourself stuck and unable to make progress.
  • You will create a GitHub page page for your project, which you will use to publish your main deliverables. There will be three main deliverables published to your GitHub: a proposal, a midterm checkpoint, and a final report. For the final report, you will also submit a seven-minute project presentation, where you go over your final outcome while scrolling through your GitHub page. Your Github page should have the following structure at the time of the proposal submission:
    • Summary figure: one infographic prepared by your team that summarizes your project goal;
    • Introduction/Background: discussing the problem you aim to address, your motivation and goal;
    • Methods: outlining the dataset you are planning on utilizing and the techniques you intend to apply;
    • Results: describing the results your team is trying to achieve
    • Discussion: explaining what would be the best outcome, what it would mean, what is next, etc.);
    • References: list containing at least three references, preferably peer reviewed.
  • To help you conduct your project successfully, we will have three touchpoints during the semester. In these sessions, you will meet remotely with your mentor and other teams working on related projects to discuss your progress, debate different approches, and learn from your peers. For each touchpoint, you will submit the following deliverables:
    • Touch-point 1 deliverables: (1) Single-slide presentation of your project proposal; (2) Three-minute pre-recorded presentation with your project proposal pitch.
    • Touch-point 2 deliverables: (1) Single-slide presentation outlining progress highlights and current challenges; (2) Three-minute pre-recorded presentation with your progress and current challenges
    • Touch-point 3 deliverables: (1) Single-slide presentation outlining progress highlights and current challenges; (2) Three-minute pre-recorded presentation with your progress and current challenges
    An equivalent in-person session will take place in parallel with the remote versions and will be moderated by the instructor. To attend the in-person session, you will sign-up in advance in order to ensure social distancing and facilitate contact tracing. The touchpoints are scheduled a few days before the deadline for each deliverable so that your team has time to incorporate the feedback you receive during these sessions.
  • The world is currently facing a number of big challenges and machine learning can help! Students will get bonus points for working on topics that address such challenges, including but not restricted to: COVID-19 (anything ranging from healthcare, education, travel, economic impacts, the ways in which work and business have changed), social inequality, spread of misinformation, and climate crisis.
  • Refer to Project hints for your project's template, creating GitHub page, and also some general hints to improve the accuracy of your predictive model.
  • Google colaboratory allows free access to run your Jupyter Notebook. I strongly suggest you use it for your project, especially for teams that are going to employ Deep Learning.

Quizzes

  • There will be 13 quizzes throughout the semester.
  • We will consider your top 10 quizzes' scores. Each quiz will have 1% of your final score.
  • The topic of each quiz will coincide roughly with the content covered in class on that week.
  • Quizzes will have a duration of seven minutes with five multiple choice questions. They will be available from 6:00am EST of the day they are scheduled for until 6:00am EST of the following day.
  • Quizzes measure your understanding of the topics and they will be more conceptual questions.

Resources

No textbook will be required for this course, however you are strongly encouraged to complete the readings indicated for each class. You may also find the following books very helpful:

Other resources, such as machine learning toolboxes and datasets, will be provided throughout the course.

Dataset Ideas (may need API, or scraping) - Thanks to everyone who contributed with suggestions to these datasets