NOTE: Some links in this syllabus page may only be accessible to currently enrolled students.
This course is primarily for students in the department of statistics pursuing a graduate degree in data science, and some CS students pursuing a data science track. It teaches foundational programming concepts and tools which are important in big data management.
Participation in Research
During this course students will be invited to participate in a research study on cybersecurity behavior and receive extra credit in return for completing the study. Students who elect not to participate will have an alternative activity to complete for extra credit. Students may only complete one extra credit activity (ether the research or the alternative).
Hi my name is Dr. Michael Curry and I am the instructor for CS 512, Data Science Tools and Programming.
This course was originally developed by Justin Wolford. I am an instructor in Computer Science and my background in data science has been using operational data sources to design analytical datasets and generate custom reports primarily using tools like SQL, programming languages like Python and visualization methods like Power BI. I have also been the instructor of this course for multiple terms and I really enjoy getting to interact with the students who take this course and seeing how they apply the tools we cover in their own projects. The goal of this class is to teach graduate students pursuing a master's in data science and statistics about some of the programming tools that are helpful and even necessary to do modern statistical analysis on large datasets. So we will spend about half the course covering traditional tools such as Python and SQL, and then the second half of the course covering Google’s cloud suite tools for very large datasets that are just too big to work with on a local computer. We have developed a very solid set of introductory assignments that should give you a firm introduction on how to program with those tools. But mastery of those tools will really depend upon your programming interests combined with the amount of time you put into doing independent work with those tools.
I encourage you to participate a lot on the Ed discussion board to get help from me and from your classmates. When everyone gets stuck then the instructor can help the class get unstuck. But the value of the class comes in your ability to get help from other people learning the same thing at the same time and having an instructor to help guide that learning. Students who are less active on the Ed discussion board frequently learn less in this class. While the students who participate do better in the class and also have a more positive class experience.
Well that’s enough about me and the class for now. I really look forward to getting to know each of you this term and reviewing your course projects which are always very fascinating.
There are several ways to communicate in an online class, they all have different purposes.
Ed Discussion Board
We will use Ed Discussion Board as the primary way to ask questions with the instructor and teaching staff. This is where you should post well prepared questions, both technical and theoretical about the content and assignments. You should follow similar steps to Getting Help With Tools when asking questions here.
If you want to ask a question that requires posting a fair bit of your code, you must create a private thread on Ed. Such threads are viewable by you, the Instructor and the GTA, but not by other students. Once I have reviewed the posting, if I feel that it does not reveal important code and that other students could benefit from the discussion I may ask that you mark the discussion public.
We will use Microsoft Teams to primarily hold office hours. Using Teams you can communicate via chat, audio, video and also share your screen so we can see what you are seeing. Please join the teams channel.
If you need to communicate about anything personal please send email. All emails to the instructor or the GTA should have CS512 in the subject line. This makes sure it goes to the right email box. If an email does not have this tag at the beginning, it may get overlooked. You must use your OSU email instead of Canvas email, as we do not check Canvas email as frequently.
You should email the instructor about general class concerns, extension requests or about issues you are unable to resolve with the GTA. In general questions about the assignment or help with troubleshooting should go to Ed Discussion where they can help other students. If you email the instructor such questions, they might just ask you to post them to Ed Discussion.
The TA may be responsible for grading and also hold office hours. You can reach out to them directly when you have a question. If, after emailing the TA your issue hasn't been resolved, then you should email the instructor and forward your exchange with the TA.
The Instructor and the TA will hold regular office hours for general troubleshooting and questions on Teams. This is often a good place to get help working through error messages. The times for the office hours are listed on the Course Homepage.
Expected Response Time
These times only apply to the weekday. The instructor is unlikely to be available over the weekend.
As an example if you email on Friday at 4pm, you might not get a response till Tuesday at 4pm. This is an upper bound. Responses will usually come more quickly but sometimes things happen and it takes a while to get back to you.
Grades usually come back within a week but some of these assignments can be complex so you might see a couple of assignments which take longer. If you have a question about your grade please email your instructor or the TA who graded your assignment.
If you experience any errors or problems while in your online course, contact 24-7 Canvas Support through the Help link within Canvas. If you experience computer difficulties, need help downloading a browser or plug-in, or need assistance logging into a course, contact the IS Service Desk for assistance. You can call (541) 737-8787 or visit the IS Service Desk online.
Accommodations for students with disabilities are determined and approved by Disability Access Services (DAS). If you, as a student, believe you are eligible for accommodations but have not obtained approval please contact DAS immediately at 541-737-4098 or at http://ds.oregonstate.edu. DAS notifies students and faculty members of approved academic accommodations and coordinates implementation of those accommodations. While not required, students and faculty members are encouraged to discuss details of the implementation of individual accommodations.
You can download the full PDF version of this syllabus.
Course Name: Data Science Tools and Programming
Course Number: CS 512
Credits: 4 Credits
Instructor name: Michael Curry
Instructor email: email@example.com
Teaching Assistant: Jing Wang firstname.lastname@example.org
Lab Assistant: Lavin, Sonja email@example.com
Accessing and distributing data in the cloud; relational and non-relational databases; map reduction; cloud data processing; load balancing; types of data-stores used in the cloud. This course is primarily for Department of Statistics graduate students working to MS and PhD degrees in Statistics and in some cases CS students pursuing a data science track.
Please post all course-related questions in the Ed Discussion Forum so that the whole class may benefit from our conversation. Please contact me privately for matters of a personal nature. I will attempt to reply to course-related questions within 48 business hours. I will strive to return your assignments and grades for course activities to you within one week of the due date.
On average this course combines approximately 150 hours of instruction, online activities and assignments for 4 credits. The majority of the time will be spent completing the activities and assignments. Assignments can be challenging, start early and if you have questions ask them on the Ed Discussion board. Waiting until the last minute is not an effective strategy.
Measurable Student Learning Outcomes
- Use appropriate tools to access and store distributed data
- Design programs to analyze distributed data
- Evaluate various storage and access options for large data sets
- Evaluate existing programs to find and remove bottlenecks in data processing
- Understand differences between relational and non-relational databases
- Use tools to clean and transform datasets for use in database systems
- Create visualizations of large data sets
There is no course text, and instead you will be assigned to read documentation online. It's very important to read this material which has instructions on how to use the programming tools covered. Some of the material is very well documented such as the Conda documentation which is well established. However other toolsets are newer and less complete like the Google Cloud toolset and may have changed since the videos were recorded or the assignment was created. Consequently, students are expected to supplement these resources by doing their own investigation (e.g. searching online forums, posting to Ed Discussion and consulting the documentation).
Note: Each student will need a Google Cloud account and as a class we have credits that are usually sufficient to cover students computing costs for the term. However, if you use all your credits you will have to purchase additional ones.
Evaluation of Student Performance
We provide grading rubrics for assignments. These rubrics are a guide to help students understand what is expected, and you are encouraged to review them carefully before submitting an assignment. Rubrics also make grading efficient and fair. However, the rubrics are subject to interpretation, and students may not always agree with how the rubrics are applied. Grading deductions help students learn where they can improve. They are not punitive, so arguing over how a rubric was applied may be counterproductive and, if handled unprofessionally, can violate the syllabus policy on civility. See the Harassment of the Instructional Staff policy for details.
Here is an approximate break down of how I expect to assess your work.
- Homework 70%
- Final Project 15%
- Midterm Project 15%
Each module will generally be available at 12 am on Wednesday one and a half weeks in advance.
|Overview and Tools
|SQL and Relational Databases
|Data Formats and Data Wrangling
Start Mid Term Project
Complete Mid Term Project
Reflections on Weeks 1 to 5 Learning
|Introduction to Big Data
|Non-Relational Databases and Spark
|Spark Wrap Up
Start Final Project
|Odds and Ends at the End
Reflections on Weeks 6 to 10 Learning
Complete Final Project
There will generally be an assignment every week. They are graded on correctness only. A program that fails to run will typically get no credit, even if all you had to do was add a
: to make it run. If you want feedback on specific choices you made in your program you can post the code and the specific questions to the discussion board 72 hours after the assignment is due and the instructor and TAs can provide additional feedback there that you and other classmates can benefit from.
Unless otherwise specified all files should be be submitted as
.py files if they are code files or as
.py files you wrote for a project even if we will not be running them so we have a record of your program if we need to come back and look at it later. Files should not be zipped because that makes them hard to view in the Canvas grading tool. If we want a zipped file set we will ask for it specifically.
This class will have a final project. This will require you to do calculations, visualization and analysis of a very large data set, ideally in the millions of records. More details will be released on this later in the course.
This course was developed for data science students who have completed CS 511 or have equivalent experience in Python programming. You will be expected to use many third-party programs and will need to be comfortable doing basic tasks using a command line. The emphasis of this course is on the data science tools and on programming. There should be many opportunities to use more advanced statistical methods, but they will not be covered in this course.
Difference with On Campus Classes
In an on campus class about 1⁄2 to 1⁄3 of the class would be the instructor lecturing at you. The rest of the time would be used to do in class examples and work as small groups. This is where you would test your understanding of the topics and ask your group mates for help. It is also where you would help your fellow students. If you were totally stuck the instructor would help.
In principle online learning should be about the same. But instead of a classroom you have a message board. You should be active on the message board to help and get help from your classmates. When everyone gets stuck the instructor can help get the class unstuck. But the value of the class comes in your ability to get help from other people learning the same thing at the same time and having an instructor to help guide that learning. If you are not active on the discussion board you will get much less out of the class.
This course will primarily be about using Googles suite of cloud and big data tools. These tools are generally based on Python 2.7 however other tools will likely leverage Python 3.6. This isn’t a fun place to be in but we can deal with it.
Ideally, if you can get Anaconda installed and running with two different environments, one running Python 2.7 and the other running Python 3.6 you will be ahead of the curve.
You may also find it useful to install the Google Cloud SDK but you should be aware this can be a little challenging.
Real World Stuff
The Google set of tools are professional tools used in industry. This is awesome because it means that employers will be excited you know how to use it. It also means that Google requires cash money to use it. As part of the class you will get $50 in Google cloud credit. This should be plenty to get through the course. But you need to be careful. If you leave a service running, get something stuck in an infinite loop or accidentally request to use a cluster of 1000 servers instead of 10 you might quickly burn through that budget.
The instructor will do what they can to help in that sort of situation, but if you do something wrong and it eats up all your credit you may need to pay for additional computation time if Google is unable to provide additional credit.
Getting Help With Tools
Speaking of challenges, when you run into trouble, please seek help. When it comes to software configuration issues like installing Anaconda or the Google SDK if you do the wrong thing to solve a problem it can make it harder to fix.
Try to follow the online documentation. If it does not work, don’t improvise unless you are confident in your skills. It is better to come to the message board and ask for help. When you do please post:
- The steps you have taken
- The last thing you did that seemed to work correctly
- The full error message, this might be very long. If it is more than a page there are sites like Pastebin that are designed for posting and sharing code snippets and errors.
- Where else you have looked for solutions so I don’t point you to something you already looked at.
I encourage you to participate a lot on the discussion boards to get help and get help from your classmates. When everyone gets stuck then the instructor can help the class get unstuck. But the value of the class comes in your ability to get help from other people learning the same thing at the same time and having an instructor to help guide that learning. If you are not active on the discussion board you will get much less out of the class. Students who do participate tend to do better in the class and have a more positive experience than those who don’t.
Late Work Policy
Assignments are generally due on Monday and Wednesday. Each assignment will remain open until three working days after the due date when it will be marked late. So if an assignment is due on Wednesday, you can still submit it until end of day on Monday and marked late.
If you need an extension for any reason the request must be submitted at least 48 hours in advance, typically a 72 hour is the maximum extension, but no documentation is required. Late assignments are deducted 10% except in extenuating circumstances. If you need an emergency extension it must be requested within 48 hours after the assignment was due and must be accompanied by documentation supporting the request (eg. doctors note, visit summary from the ER, police report etc.). As long as there is supporting documentation these are generally granted for up to a 72 hour extension. Longer extensions are handled on a case by case basis.
The only exception to this is the final project which cannot be turned in after the assignment closing date. If for any reason you believe that you may have difficulty turning in work on time, please contact your instructor as soon in advance as possible.
Sometimes you need to make changes to a program or an assignment to get it to run correctly. If you are stuck, please add a comment saying you had difficulty and that you plan to submit a revision. You will have 7 days from when an assignment is graded to resubmit the assignment for a maximum grade of 70%. These can only be minor changes to get a program to run. You can’t write a program from scratch and still get 70% credit. The only exception to this is the final project which cannot be turned in after the assignment closing date.
Statement Regarding Religious Accommodation
Oregon State University is required to provide reasonable accommodations for employee and student sincerely held religious beliefs. It is incumbent on the student making the request to make the faculty member aware of the request as soon as possible prior to the need for the accommodation. See the Religious Accommodation Process for Students.
Guidelines for a Productive and Effective Online Classroom
(Adapted from Dr. Susan Shaw, Oregon State University)
Students are expected to conduct themselves in the course (e.g., on discussion boards, email) in compliance with the university’s regulations regarding civility. Civility is an essential ingredient for academic discourse. All communications for this course should be conducted constructively, civilly, and respectfully. Differences in beliefs, opinions, and approaches are to be expected. In all you say and do for this course, be professional. Please bring any communications you believe to be in violation of this class policy to the attention of your instructor.
Active interaction with peers and your instructor is essential to success in this online course, paying particular attention to the following:
- Unless indicated otherwise, please complete the readings and view other instructional materials for each week before participating in the discussion board.
- Read your posts carefully before submitting them.
- Be respectful of others and their opinions, valuing diversity in backgrounds, abilities, and experiences.
- Challenging the ideas held by others is an integral aspect of critical thinking and the academic process. Please word your responses carefully, and recognize that others are expected to challenge your ideas. A positive atmosphere of healthy debate is encouraged.
Establishing a Positive Community
It is important you feel safe and welcome in this course. If somebody is making discriminatory comments against you, sexually harassing you, or excluding you in other ways, contact the instructor, your academic advisor, and/or report what happened at https://studentlife.oregonstate.edu/studentconduct/reporting so we can connect you with resources.
Expectations for Student Conduct
Student conduct is governed by the university’s policies, as explained in the Student Conduct Code. Students are expected to conduct themselves in the course (e.g., on discussion boards, email postings) in compliance with the university's regulations regarding civility.
Integrity is a character-driven commitment to honesty, doing what is right, and guiding others to do what is right. Oregon State University Ecampus students and faculty have a responsibility to act with integrity in all of our educational work, and that integrity enables this community of learners to interact in the spirit of trust, honesty, and fairness across the globe.
Academic misconduct, or violations of academic integrity, can fall into seven broad areas, including but not limited to: cheating; plagiarism; falsification; assisting; tampering; multiple submissions of work; and unauthorized recording and use.
It is important that you understand what student actions are defined as academic misconduct at Oregon State University. The OSU Libraries offer a tutorial on academic misconduct, and you can also refer to the OSU Student Code of Conduct and the Office of Student Conduct and Community Standard’s website for more information. More importantly, if you are unsure if something will violate our academic integrity policy, ask your professors, GTAs, academic advisors, or academic integrity officers.
Your citation of others work (to include code provided in class instructional material) should include:
- URL of your source
- Date you retrieved your source
- Title of the program or application you are using
- Type (eg. source code, application, full program, etc)
- Author name(s) if available
- Code version if available
Here are two methods for citing code from a snippet used from http://www.oregonstate.edu/mysource retrieved on December 2, 2022:
- You can cite your code in your written report or your README file. Here is an example on how to cite code in a written document:
Example: Safonte, D (December 2022) Citing source code (Version 1.0) [Source code] http://www.oregonstate.edu/mysource
- The second method cite your source right within your code using comments.
Example: a good citation can be observed on lines 1-5 below:
1 # Citation for the following function: 2 # Date: 12/02/2022 3 # Copied from /OR/ Adapted from /OR/ Based on
4 # (Explain degree of originality) 5 # Source URL: http://www.oregonstate.edu/mysource 6
Keeping with the idea of needing to communicate with your classmates, you will need to share code with each other to communicate ideas and to explain problems. That said, you should only share what is needed. This is good practice in general. Having to look through an entire file to find a single function is not helpful when you could have just posted the functions.
You will also get great help from your classmates. If you use a code fix or suggestion from them you need to do two things.
- You need to cite who it came from, that means you need to, in your code, add a comment saying that the code was not your original code and say what student or 3rd party provided it.
- You need to document what the code is doing. So you should write roughly a sentence per line of code explaining what the snippet is doing. So if you were to say, use an entire 100 line file provided by a student you would need to write something around a 2,000 word essay describing what the code is doing.
When giving help, don’t give more help than is needed. Try to keep answers deep in depth but narrow in focus. Address the problem the student is asking about and address it well, but keep from giving away too much about topics they might not have run into yet.
Your instructor may ask you to submit one or more of your writings to Turnitin, a plagiarism prevention service. Your assignment content will be checked for potential plagiarism against Internet sources, academic journal articles, and the papers of other OSU students, for common or borrowed content. Turnitin generates a report that highlights any potentially unoriginal text in your paper. The report may be submitted directly to your instructor or your instructor may elect to have you submit initial drafts through Turnitin, and you will receive the report allowing you the opportunity to make adjustments and ensure that all source material has been properly cited. Papers you submit through Turnitin for this or any class will be added to the OSU Turnitin database and may be checked against other OSU paper submissions. You will retain all rights to your written work. For further information, visit Academic Integrity for Students: Turnitin – What is it?
Harassment of the Instructional Staff
Statement Regarding Students with Disabilities
Accommodations for students with disabilities are determined and approved by Disability Access Services (DAS). If you, as a student, believe you are eligible for accommodations but have not obtained approval, please contact DAS immediately at 541-737-4098 or at the DAS website. DAS notifies students and faculty members of approved academic accommodations and coordinates implementation of those accommodations. While not required, students and faculty members are encouraged to discuss details of the implementation of individual accommodations.
Accessibility of Course Materials
All materials used in this course are accessible. If you require accommodations please contact Disability Access Services (DAS).
Additionally, Canvas, the learning management system through which this course is offered, provides a vendor statement certifying how the platform is accessible to students with disabilities.
Course Policy on the use of AI tools (e.g. ChatGPT or similar software).
You are expected to create your own work in this class. When you submit assignments (including assignments, projects, quizzes, or discussions), you are asserting that you have generated and written the text unless you indicate otherwise by the use of quotation marks and proper attribution for the source. You should know that AI tools are based on large collections of text that may have numerous errors, be highly biased and generate answers that are verifiably wrong. Submitting content obtained from other sources (e.g. ChatGPT, Grammarly, Chegg, and even Google’s autocomplete) as your own (e.g. without clear citations) is cheating and a violation of the Student Conduct Code. You may use simple editing tools to help with syntax, spelling and grammar, but you may not use AI tools to draft your work, even if you edit, revise, or paraphrase it. Do not underestimate our ability to detect the use of AI in your submissions! The penalty for violating the policy is severe. Your work may be flagged as potentially AI generated and reported per the syllabus policy on Academic Integrity for any of the following:
- Evidence of a different writing style from previous work submitted in the course.
- The topics discussed in the submission are not topics covered in class or part of the course.
- The answer for the submission does not directly address the prompt.
- The answer for the submission is vastly outside of the guidelines of the prompt (e.g., word count too high, writing style inconsistent with expectations, etc.).
- The submission is highly similar to a response generated by AI when the assignment prompt is entered in AI.
- Citations provided are fictitious.
Alternatively, AI tools are a valuable resource to help troubleshoot code and can also be used responsibly to supplement and enhance your learning and understanding of course materials. For example you may ask ChatGPT questions that you would otherwise ask the instructor after completing a lecture. E.g. “how is relational algebra used in modern database systems?” "Why does the following code generate the error 'Cannot add or update a child row: a foreign key constraint fails'?" As with other sources (e.g. the course text, Stack overflow, etc.), your use of these is guided by the Academic Integrity Policy and anything you submit that that is not your own must be clearly cited.
Ecampus Reach Out for Success
University students encounter setbacks from time to time. If you encounter difficulties and need assistance, it’s important to reach out. Consider discussing the situation with an instructor or academic advisor. Learn about resources that assist with wellness and academic success.
Ecampus students are always encouraged to discuss issues that impact your academic success with the Ecampus Success Team. Email firstname.lastname@example.org to identify strategies and resources that can support you in your educational goals.
For mental health
Learn about counseling and psychological resources for Ecampus students. If you are in immediate crisis, please contact the Crisis Text Line by texting OREGON to 741-741 or call the National Suicide Prevention Lifeline at 1-800-273-TALK (8255).
For financial hardship
Any student whose academic performance is impacted due to financial stress or the inability to afford groceries, housing, and other necessities for any reason is urged to contact the Director of Care for support (541-737-8748).
Student Evaluation of Courses
During Fall, Winter, and Spring term the online Student Evaluation of Teaching system opens to students the Wednesday of week 8 and closes the Sunday before Finals Week. Students receive notification, instructions and the link through their ONID. They may also log into the system via Online Services. Course evaluation results are extremely important and used to help improve courses and the hybrid learning experience for future students. Responses are anonymous (unless a student chooses to “sign” their comments, agreeing to relinquish anonymity) and unavailable to instructors until after grades have been posted. The results of scaled questions and signed comments go to both the instructor and their unit head/supervisor. Anonymous (unsigned) comments go to the instructor only.