CS 4641 B: Machine Learning (Fall 2020)
Course Information
- Lecture time: Mondays and Wednesdays, 3:30pm-4:45pm
- Location: BlueJeans
- Piazza: https://piazza.com/gatech/fall2020/cs4641b/home
Course Overview
This course introduces techniques in machine learning with an emphasis on algorithms and their applications to real-world data. We will investigate the following question: how to computationally extract useful knowledge from data for decision making and task support? We will focus on machine learning methods, which are organized into three parts:
Basic math for data science and machine learning
- Linear algebra
- Probability and statistics
- Information theory
- Optimization
Unsupervised machine learning for data exploration
- Clustering analysis
- Dimensionality reduction
- Kernel density estimation
Supervised learning for predictive data analysis
- Tree-based models
- Support vector machines
- Linear classification and regression
- Neural networks
Prerequisites for this course include basic knowledge of probability, statistics, linear algebra, and basic programming experience in Python.
In addition to the technical content, this class includes the following learning objectives:
- Structuring a task into a machine learning work flow
- Collaborating effectively on team projects in a remote environment
- Conducting peer evaluation in a constructive format
- Communicating technical content in a concise and effective manner
Schedule
Course policies
- Attendance: Our class will be offered in a hybrid "touch point" mode. We will not meet in person for our regular lectures which will be live-streamed and recorded. The recordings will be made available to all students after class time. On-campus attendance for the in-person touch points is not mandatory and remote participation will be available. Attendance to the live-streamed lectures is mandatory and will count towards your participation grade.
- Class deliverables: All class deliverables will be handled via Gradescope. The time span offered to complete the course projects and assignments is plentiful and deadlines will not be extended under any circumstances. To ensure the class is fair for all students, you will receive zero credit for work submitted after the deadline. Regrade requests should be submitted directly on Gradescope within one week of grade publication. Should you find yourself at an impasse with the TA responsible for your grading, feel free to contact the head TA or course instructor.
- Exceptional circumstances: Any request for exceptions to these policies should be made in advance when at all possible. Requests should be due to incapacitating illness, personal emergencies, or similarly serious events. Your request should be accompanied by a supporting letter issued by the Dean of Students.
- Communication: The most effective means of communication with the TAs and Instructor in this course is via private notes on Piazza. While the instructor diligently checks the Piazza page for communication from the students, please allow a 24 hour period for any special requests to be processed. Also note that messages submitted after 9:00pm EST or during the weekend will most likely receive an answer on the following business day.
Diversity and inclusion
Just as machine learning algorithms cannot accomplish complex tasks if trained on datasets of limited variability, our course cannot be successful without appreciating the diversity of our students. In this class we aim to create an environment where all voices are valued, respecting the diversity of gender, sexuality, age, socioeconomic status, ability, ethnicity, race, and culture. We always welcome suggestions that can help us achieve this goal. Additionally, if any of our class scheduled activities conflicts with religious events, please inform the instruction team so that we can make appropriate arrangements for you.
Students with disabilities: your access to this course is extremely important to us. The institute has policies regarding disability accommodation, which are administered through the Office of Disability Services. Please request your accommodation letter as early in the semester as possible, so that we have adequate time to arrange your approved academic accommodations.
In-person meetings
In line with institute policies effective July 15th 2020, face coverings will be required for all in-person meetings and every effort will be made to maintain the CDC recommended social distancing guidelines of six feet or more. In case of inclement weather, the student and instructor will meet in a mutually agreed-upon indoor location where all necessary precautions can be taken. You can find more information on Georia Tech’s policies and the CDC guidelines here and here.Office hours and questions
We are very happy to offer you one-on-one office hours starting on the second week of clases. Please follow the instruction on this Excel sheet to signup for a ten-minute slot with one of the TAs. If you require more than ten minutes, please advise the TAs. They’ll return to your BlueJeans meeting once they have completed their appointments with other students. You just need to add your name, question of interest and your BlueJeans meeting link. Please do not change the other part of the Excel sheet. The TA meetings are designed to be one-on-one. Please do not join another student’s BlueJeans meeting. The sole exception to this policy being discussions about the project, in which your fellow team members can also join. In addition to the one-on-one meetings, open office hours with the instructor will be held weekly where you can ask general questions about the topics covered in class. In-person office hours are only available by appointment and will likely be held outdoors, in line with the aforementioned Georgia Tech's and CDC guidelines with respect to preventing the spread of the coronavirus.Time | Monday | Tuesday | Wednesday | Thursday | Friday |
---|---|---|---|---|---|
09:00am - 10:00am | Tanvi | Xueyu | Yuening | ||
11:00am - 12:00pm | Prithvi | Aishvaryaa | |||
02:00pm - 03:00 pm | Rafael | ||||
03:00pm - 04:00pm | Bo | 03:30pm - 04:30pm | Vidisha | ||
05:00pm - 06:00pm | Danrong | Nimisha | |||
07:00pm - 08:00pm | Instructor |
Grading
Assignments (50%)
Project (35%)
Proposal (10%)
Midterm report (10%)
Final report (15%)
Quizzes (10%)
Class participation (5%)
Bonus points (up to 7%)
- About bonus points: Bonus points will be counted to always be beneficial for your final grade. More information on bonus points for assignments will be provided as the semester progresses. If it becomes necessary to curve grades, bonus points will be applied after curving, not before.
- Piazza participation: Piazza has statistics which give us many metrics regarding how much a student has been involved on Piazza's activities such as viewing posts, answering questions, asking questions and so on. Not only do we use this to account for a minor part of the Class Participation score, we will use the statistics to give students bonus points. Bonus points will be applied to students who answer the other students' questions correctly. At the end of the semester, we will define a minimum and maximum number of involvement considering all the students, and based on those, some students will receive at most 1% bonus points.
- Distribution: Bonus points will also be available for the project and assignments. You can achieve 1% bonus points on the project, and an additional 5% by correctly completing the bonus questions on the assignments.
Assignments
- There will be four assignments. Each one is designed to improve and test your understanding of the materials. Assignments will have both programming and written analysis components.
- You will need to submit all your assignments using Gradescope. Instructions on how to submit your code and written portions will follow with every assignment. Handwritten solutions WILL NOT BE ACCEPTED and you will not receive credit for a handwritten submission.
- You are required to use Markdown, Latex, or a word processing software to generate your solutions to the written questions. Again, handwritten solutions WILL NOT BE ACCEPTED.
- All assignments follow the “no-late” policy. Assignments received after the due date and time will receive zero credit.
- All students are expected to follow the Georgia Tech Academic Honor Code.
- You can easily export your Jupyter Notebook to a Python file and import that to your desired python IDE to debug your code for assignments.
- You are NOT allowed to share any assignment codes or answers with other students. Piazza is the best place to have discussion regarding assignments and course topics. Discussions are just for the better understanding of questions and should not directly answer the questions.
Project
- In order for you to obtain hands-on experience applying the topics covered in this course, you are expected to complete a term project utilizing real-world data. The project will encompass both unsupervised and supervised learning. In the first five weeks of the semester, to motivate and inspire you, we will have TA led seminars in which two TAs will present a project in which they applied machine learning and discuss strategies to work on a remote project. You are expected to watch at least three of the seminars and post questions/discussions on the corresponding Piazza thread. Your participation in the thread discussion will count towards your participation grade. The seminar schedule is as follows.
- Each project needs to be completed in a team of four people (you will be forming your team on your own. In case you cannot find a team, we will randomly assign you a team). Team members need to clearly claim their contributions in the project report. Once your teams have been formed and you have selected a topic, you will be assigned a mentor, who will provide you with general guidance on your project. It is important to note that your team will lead the project effort: obtaining the data, researching data-driven approaches to accomplish your project goal and coordinate your own activities. The role of the mentor is solely to advise you, should you find yourself stuck and unable to make progress.
- You will create a GitHub page page for your project, which you will use to publish your main deliverables. There will be three main deliverables published to your GitHub: a proposal, a midterm checkpoint, and a final report. For the final report, you will also submit a seven-minute project presentation, where you go over your final outcome while scrolling through your GitHub page. Your Github page should have the following structure at the time of the proposal submission:
- Summary figure: one infographic prepared by your team that summarizes your project goal;
- Introduction/Background: discussing the problem you aim to address, your motivation and goal;
- Methods: outlining the dataset you are planning on utilizing and the techniques you intend to apply;
- Results: describing the results your team is trying to achieve
- Discussion: explaining what would be the best outcome, what it would mean, what is next, etc.);
- References: list containing at least three references, preferably peer reviewed.
- To help you conduct your project successfully, we will have three touchpoints during the semester.
In these sessions, you will meet remotely with your mentor and other teams working on related projects
to discuss your progress, debate different approches, and learn from your peers.
For each touchpoint, you will submit the following deliverables:
- Touch-point 1 deliverables: (1) Single-slide presentation of your project proposal; (2) Three-minute pre-recorded presentation with your project proposal pitch.
- Touch-point 2 deliverables: (1) Single-slide presentation outlining progress highlights and current challenges; (2) Three-minute pre-recorded presentation with your progress and current challenges
- Touch-point 3 deliverables: (1) Single-slide presentation outlining progress highlights and current challenges; (2) Three-minute pre-recorded presentation with your progress and current challenges
- The world is currently facing a number of big challenges and machine learning can help! Students will get bonus points for working on topics that address such challenges, including but not restricted to: COVID-19 (anything ranging from healthcare, education, travel, economic impacts, the ways in which work and business have changed), social inequality, spread of misinformation, and climate crisis.
- Refer to Project hints for your project's template, creating GitHub page, and also some general hints to improve the accuracy of your predictive model.
- Google colaboratory allows free access to run your Jupyter Notebook. I strongly suggest you use it for your project, especially for teams that are going to employ Deep Learning.
Date | Speaker | Title |
---|---|---|
08/27 | Rafael Hanashiro | PUBG placement prediction (CS 7641 project) |
08/27 | Vidisha Goyal | DREAM6 – FlowCAP2 Molecular Classification of Acute Myeloid Leukemia Challenge |
09/03 | Gnanaguruparan Aishvaryaa | Prediction of Hard Drive Failures (CS 7641 project) |
09/03 | Danrong Zhang | Soil type prediction |
09/10 | Bo Zhao | Combining Randomization with Jacobian Regularization for Robust Learning |
09/10 | Prithvi Alva Suresh | Movie Revenue Prediction (CS 7641 Project) |
09/17 | Tanvi Bhagwat | Text Classifiers |
09/24 | Xueyu Wang | Patent Grant Time Analysis |
Quizzes
- There will be 13 quizzes throughout the semester.
- We will consider your top 10 quizzes' scores. Each quiz will have 1% of your final score.
- The topic of each quiz will coincide roughly with the content covered in class on that week.
- Quizzes will have a duration of seven minutes with five multiple choice questions. They will be available from 6:00am EST of the day they are scheduled for until 6:00am EST of the following day.
- Quizzes measure your understanding of the topics and they will be more conceptual questions.
Resources
No textbook will be required for this course, however you are strongly encouraged to complete the readings indicated for each class. You may also find the following books very helpful:
- Learning from data, by Yaser S. Abu-Mostafa
- Pattern recognition and machine learning, by Christopher Bishop
- Machine learning, by Tom Mitchell
- Data Mining: Concepts and Techniques, by Jiawei Han, Micheline Kamber, and Jian Pei
- The Elements of Statistical Learning, by Trevor Hastie, Robert Tibshirani, and Jerome Friedman
- Deep Learning, by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
Other resources, such as machine learning toolboxes and datasets, will be provided throughout the course.
Dataset Ideas (may need API, or scraping) - Thanks to everyone who contributed with suggestions to these datasets
- Google Dataset Search
- Google public datasets.
- Kaggle public datasets
- Awesome Public Datasets.
- NYC Taxi data for 2013 Trip Data (11.0GB). 2013 Fare Data (7.7GB). Visualization for a days trip.
- Large datasets publicly available.
- Georgia Tech's campus data (has APIs): bus info, directory, building, T-square, room reservation, building facilities usage (e.g., electricity, lights, A/C, etc.), Oscar/course info/registration, etc.
- Yahoo WebScope
- Data.gov: U.S. Government's open data
- IPEDS data: Postsecondary education data from National Centre for Education Statistics
- Bureau of Labor Statistics data
- Uber data: Anonymized data from over 2 billion trips
- Freebase
- Yelp
- Microsoft Academic Graph
- Numerous APIs from Google (e.g., Maps, Freebase, YouTube, etc.)
- Zillow: real estate listing site
- Numerous graph datasets (large and small): SNAP, Konect
- Movies data: IMDB
- DREAM Challenges: Crowdsourcing challenges in biology and medicine
- ENCODE: Encyclopedia of DNA elements
- Human Cell Atlas
- List of lists of datasets for recommendations.
- FlowRepository: database of flow cytometry experiments
- Million song dataset by Echo Nest.
It contains not only the basic information of songs (artist, genre, year, length etc), but also some musical features(like tempo, pitch, key, brightness).
- Dataset about soccer games, players, clubs.
No API, but easy to scrape.
For a soccer player: transfer history, performance, nationality, birth date, etc.
For a soccer club: performance, squad, etc.
- The Free 'Big Data' Sources Everyone Should Know
-
Quandl - a dataset search engine for time-series data.
-
UCI also has a collection of links to various datasets sorted for various tasks (Classification, Regression, etc)
- Amazon AWS Public Data Sets
- KDD Cup: annual competition in data mining, like Kaggle
- Academic domain: Microsoft Academic Search, DBLP
- Retrosheet: MLB statistics (Game/Play logs)
-
Classification datasets
-
Various geophysical datasets for the oceans (magnetism, gravity, seismology, etc).
- Social trends
- Beer data Website offline :( . Older version at web.archive.org
- Academic torrents (terabytes)
- Article Search API from the New York Times (all the way back to 1851!)
- Civil Engineering Dataset
- (Kayak: flight, hotel, car, etc.)
- Data Science Initiative - Microsoft Research has various datasets and access to tools that can aid in data science research