Decode Data Science Speak With This Glossary for Higher Ed

Higher Education

Decode Data Science Speak With This Glossary for Higher Ed

By Mark Milliron and David Kil     May 5, 2016

Decode Data Science Speak With This Glossary for Higher Ed

Colleges and universities are awash in data and under increased pressure to quantify and improve student outcomes. As a result, higher-education practitioners and policymakers have, in turn, welcomed the application of data science in higher education. Pernicious challenges, they hope, can be better understood by data scientists who can bring disparate data sources together to help tell better stories around student progression, challenges,and success. Products and services that claim to bring “predictive analytics” and algorithmic analyses to bear on institutional efficiency and student services abound. And therein lies the challenge.

The application of predictive analytics is relatively nascent in higher education. The lexicon is still evolving. Presidents and provosts, enthusiastic about the potential, battle initiative fatigue and remain wary of the next big thing. In such a dynamic marketplace, it can be hard for institutional researchers and higher-education leaders to differentiate among hand-wavy generalizations to understand how data can demystify black-box problems or identify solutions that scale. How much data do we need to identify patterns over time? How do we know that predictive analytics are, in fact, predictive?

As this discipline evolves, institutions are finding that engaging in data for analytics offers powerful opportunities to learn together and collaborate across institutions to refine practice—but that first requires a common language. Last year, we developed a data science glossary with the goal of putting more substance behind the hype and demystifying the work happening across higher education. What follows is a modified version of the most popular, useful terms for higher education.

DESCRIPTIVE ANALYTICS: Examines historical data and identifies trends or patterns over time from known facts to inform future decisions.

Why it matters: Descriptive analytics allow institutions to understand trends, such as enrollment, retention and course selection, and to use quantitative data analysis to understand the underlying factors that influence those outcomes.

PREDICTIVE ANALYTICS: Encompass multiple techniques to learn relationships between historical events and what happened subsequent to the historical events, so that such relationships can be used to predict future outcomes based on current events.

Why it matters: Predictive analytics helps colleges and universities understand the unique challenges and opportunities for individual students, rather than just cohorts or trends, so that that they can identify the right supports and influence their trajectory.

PRESCRIPTIVE ANALYTICS: Examine the relationship between descriptive analytics and predictive analytics to determine the best way to achieve a desired outcome.

Why it matters: Prescriptive analytics inform the decision-making process, allowing institutions to weigh the impacts and effects of certain decisions that can lead to desired outcomes.

CANONICALIZATION: Sometimes called standardization. A process for translating raw data into a consistent and homogenous representation for analysis.

Why it matters: This process creates a common language for data scientists to compare and evaluate seemingly unrelated data features across institutions. It can be used to help institutions evaluate students’ course-taking patterns to determine whether they are taking the right courses to graduate on time, or wandering around taking courses unrelated to their majors without regard to their degree requirements.

REGRESSION: Predicts the outcomes of a continuous variable, such as the time it takes to master a topic over the semester in competency-based learning, or salary achieved after graduation.

Why it matters: Regression allows institutions to have a greater depth and understanding of success factors that have continuous, not discrete, values. On a related note, classification refers to predicting a discrete number of student success outcomes, such as persisting or not persisting.

LEARNING ALGORITHMS: Sophisticated mathematical equations that researchers use to predict future student success outcomes using historical data, and that continuously evolve according to accumulating data over time.

Why it matters: Just as any student changes and adapts over time based on the environment, learning algorithms also change and adapt to make predictions about the future. They process ongoing, real-time inputs to provide accurate, timely predictions on the future success of students, impact scores given multiple outreach choices to help deliver the most effective outreach, and whether or not nudges were effective. Learning algorithms are retrained as new data becomes available so that advisors and faculty have the right talking points to drive student success, and personalize nudges to unique student needs and preferences.

FEATURES: Data elements that describe a student’s academic standing or behavior. Features can either be raw features: independent data elements like GPA, ACT or SAT score; or derived features: descriptive data points that illustrate where a student stands in relation to his or her peers, such as level of LMS usage or course-taking patterns.

Why it matters: Features allow for a more contextual approach to analysis, assessing not only post-mortem outcomes like GPA, but also more timely behavioral data to answer the question of who is at risk and why. Derived features allow researchers to spot trends in time to intervene. For example a GPA might show a student remaining a B student from term-to-term. But with derived features we can see that the student was a 3.8 GPA B student two terms ago, and is now trending downward at 3.1 GPA.

PROPENSITY SCORE MATCHING: Allows institutions to, in the absence of randomization, use existing learning data to create control groups that most closely match the pilot or test group being studied. The more closely the control group is matched to the test group, the more statistically accurate the analysis becomes.

Why it matters: Randomized control groups are the gold standard in statistics, but rarely exist in the real world, or in observational data. Propensity-score matching allows for the next best thing: the creation of a control group that shares characteristics of the test group, and which can be used to demonstrate the counterfactual (e.g., how individuals would fare in the absence of the pilot/intervention). This allows institutions to account for confounding factors like selection bias. For example: if more students excel in an 8 a.m. class, does that mean the 8 a.m. class is superior to the noon class, or simply that students who choose to enroll in an 8 a.m. class share other qualities that lead to better academic outcomes, and would be successful in any class?

FEATURE RANKING AND OPTIMIZATION: Uses optimization algorithms to analyze features, and find the best combination of features to maximize prediction accuracy without over-fitting data.

Why it matters: Feature ranking and optimization enables data scientists to identify the most relevant and powerful features in predicting student success. Through feature curation, data scientists ensure that these top features are insightful and — where possible — actionable.

COMPLEX EVENT PROCESSING: A process for tracking events, inferring patterns by linking them, and responding to them in an appropriate manner.

Why it matters: Complex event processing is the underpinning of designing real-time student engagement triggers for nudging by leveraging multidimensional real-world events around students that influence student success outcomes significantly. For example, if a student exhibits withdrawal patterns after an adverse event, this could be an opportune moment for mindset coaching.

TIME-SERIES FEATURE EXTRACTION: Turns complex time-stamped events into regularly sampled historical data from which data scientists can extract meaningful features.

Why it matters: Time-series feature extraction allows institutions to identify complex trends and changes over time and collect analytic insights into how interventions and outreach by faculty and advisors are influencing these time-series features. For example, using time-series feature extraction from the LMS, an institution could identify times when students are cramming or studying consistently, or identify changes in student sentiment over time.

STUDENT ENGAGEMENT PREDICTION: Uses influenceable activity and behavioral factors to predict student engagement, which is highly associated with student success.

Why it matters: By knowing what influenceable factors are directly correlated to student success and when to reach out for receptivity and engagement, institutions can identify which students are at risk, know when to intervene proactively, and personalize interventions based on students’ level of engagement and academic performance.

INTERVENTION AND INSPIRATION SCIENCE: Examines how to engage and motivate students to succeed, and measures the impact of current interventions daily using dynamic propensity score matching for faster, actionable intervention and inspiration insights based on exposure to treatment or student touch points.

Why it matters: Intervention and inspiration science gives real-time feedback for institutions to determine the best way to re-engage each student, and shapes the likely impact of intervention and inspiration efforts.

IMPACT PREDICTION: Creates a forecast of which interventions will be most impactful for which students in which situations.

Why it matters: Impact prediction is the understanding of what micro-interventions, inspirations and nudges are the most impactful for each student, based on their needs and the institutional resources available. These predictions help institutions match the right student with the right support at the right time, enabling them to use resources effectively and provide support where and when it is needed most.

Mark D. Milliron (@markmilliron) is cofounder and chief learning officer at Civitas Learning, and David Kil is the company’s chief data scientist.

Learn more about EdSurge operations, ethics and policies here. Learn more about EdSurge supporters here.

More from EdSurge

Get our email newsletterSign me up
Keep up to date with our email newsletterSign me up