Careers in STEM: Why Should I Study Data Science?

Careers in data science are in high demand and offer high salaries and advancement opportunities. Learn five reasons to consider a career in the field.

Valerie Kirk

Data is often referred to as the new gold because it has become an essential raw material.

From smartphones to traffic cameras to weather satellites, modern technology devices are collecting massive amounts of data that support everything from cancer research to city planning.

The importance of data in today’s world spans all industries including healthcare, education, travel, and government. Business decisions are made based on data and improving customer experiences relies on data. It is also critical for our national defense. Simply put, today’s world runs on data.

But unlike gold, data does not have value in its raw state. To tap into the power of data to make smart, data-driven decisions, it has to be collected, cleaned, organized, and analyzed.

This is why data is also called the new oil, which also needs to be extracted and refined in order to have value.

That’s where the field of data science comes in.

What is Data Science?

Data science is the study of data to extract meaningful insights for business and government.

People who pursue a degree in data science study math and computer science. Their career path includes jobs where they handle, organize, and interpret massive volumes of information with the goal of discerning patterns. They also construct complex algorithms to build predictive models. Data science tasks include data processing, data analytics, and data visualization.

Data scientists are on the leading edge of innovation and emerging technology, including machine learning and artificial intelligence, which relies on a significant amount of digital data to generate insights.

Careers in data science are growing fast. Data science jobs are in high demand and can be found in nearly every industry. A few of the most common data science jobs include:

  • Chief Data Officer
  • Artificial Intelligence Engineer
  • Data Scientist
  • Data Engineer
  • Machine Learning Engineer
  • Software Engineer
  • Data Modeler
  • Data Analyst
  • Big Data Engineer

Why is Data Science Important?

Just as data is the new gold and the new oil, data is also the new currency. For businesses, the insights derived from data science are essential for data-driven decision-making. They guide everything from the product lifecycle to fulfillment to office or warehouse locations. Data scientists provide information that’s critical to a company’s growth.

The benefits of data science extend beyond business. Government agencies from the federal level down to state and local entities also rely on data insights for emergency planning and response, public safety, city planning, intelligence gathering, national defense, and many other services.

Another reason why data science is important? It taps into the potential of artificial intelligence, which can improve productivity and efficiencies, provide stronger cybersecurity, and personalize customer experiences. To be effective, artificial intelligence relies on a lot of data, which is often pulled from massive data repositories and organized and analyzed by data scientists.

Learn About Our Data Science Graduate Degree Program

5 Reasons to Study Data Science

The field of data science is a great career choice that offers high salaries, opportunities across several industries, and long-term job security. Here are five reasons to consider a career in data science.

1. Data Scientists Are in High Demand

According to the United States Bureau of Labor Statistics , data scientist jobs are projected to grow 36% by 2031, which is much faster than the average for all occupations. Data science careers also offer significant potential for advancement, with the relatively new role of chief data officer becoming a key C-suite position across all types of businesses.

Because the high-demand field requires a special skill set, professionals with data science degrees or certificates are more likely to land a desired position in a top company and enjoy more job security.

2. Careers in Data Science Have High Earning Potential

That high demand also leads to higher salaries relative to other careers. According to Glassdoor, the estimated total pay for a data scientist in the United States is $126,200 per year .

New data scientists can expect starting salaries of around $100,000 per year, with experienced data scientists earning more than $200,000 per year. The average annual salary for chief data officers is $636,000 , with top data executives clearing more than $1 million a year.

The salary potential is only expected to grow as data drives artificial intelligence innovations.

3. Data Science Skills are Going to Grow in Value

Think about this — smartphones, drones, satellites, sensors, security cameras, and other devices collect data 24 hours a day, seven days a week. Data is also being generated by organizations from every project, product launch, customer sale, employee action, and other business activities.

Then think about data that comes from every financial transaction, healthcare interaction, scholarly research project, and other initiatives outside of the business world. Data is continuously being generated from multiple sources for multiple uses — and that isn’t going to stop.

Turning all of that data into actionable insights is a unique, high-demand skill that will only grow in value as more data is generated. As technology advances, data scientists will be at the forefront of new breakthroughs and innovations. It’s an exciting and evolving career.

4. Data Science Provides a Wide Range of Job Opportunities

Every business, government agency, and educational institution generates data. They all need support in gaining insights from that data. Having a degree or certificate in data science gives people the flexibility to work in the industry that interests and inspires them.

5. Data Scientists Can Make the World a Better Place

While data scientists can offer insights to help businesses grow, they can also offer insights to help humanity. Data science careers include unique opportunities to make an impact on the world. Consider these initiatives where data science is playing a significant role:

  • Climate change. To support climate control measures that could lower carbon dioxide emissions, the California Air Resources Board, Plant Labs, and the Environmental Defense Fund are working together on a Climate Data Partnership to track climate change from space.
  • Medical research. The National Institutes of Health is working to improve biomedical research through its NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability Initiative , which enables access to rich datasets and breaking down data silos to support medical researchers.
  • Rural planning. The U.S. Department of Agriculture launched a Rural Data Gateway to support farmers and ranchers in accessing the resources they need to support everything from sustainable farming practices to how to lower energy costs.

Other data-driven service-oriented initiatives include making cities safer for pedestrians and bikers, supporting affordable housing for underserved communities, and improving access to social services. Hear about other initiatives that are tapping into the power of AI for good in this fireside chat with Harvard Extension School’s director of IT programs, Bruce Huang.

Study Data Science at Harvard Extension School

If you are ready to start, advance, or pivot to a career in this exciting and growing field, Harvard Extension School offers a Data Science Master’s Degree Program .

The program focuses on mastering the technical, analytical, and practical skills needed to solve real-world, data-driven problems. The program covers predictive modeling, data mining, machine learning, artificial intelligence, data visualization, and big data. You will also learn how to apply data science and analytical methods to address data-rich problems and develop the skills for quantitative thought leadership, including the ethical and legal dimensions of data analytics.

The program includes 11 courses that can be taken online and one on-campus course, in which you develop a plan for a capstone project with peers and faculty. In the final capstone course, you will apply your new skills to a real-world challenge. Capstone project teams collaborate with industry, government, or academic institutions to explore the possibilities of using data science and analytics for good. Recent capstone projects include:

  • Improving the climate change model used by NASA.
  • Developing a tool that combines aerial imagery and advanced georeferencing techniques to assess damage in disaster-stricken areas.
  • Using computer vision and video classification to develop a crime detection system for analyzing surveillance videos and identifying suspicious activities, contributing to enhanced public safety and crime prevention efforts.
  • Predicting patient MRI scans in a hospital system to optimize resource allocation and ensure efficient patient care delivery.
  • Streamlining the medical coding process to reduce errors and improve efficiencies.

You can also earn a Data Science Graduate Certificate through the Harvard Extension School . In this certificate program, you will:

  • Master key facets of data investigation, including data wrangling, cleaning, sampling, management, exploratory analysis, regression and classification, prediction, and data communication
  • Implement foundational concepts of data computation, such as data structure, algorithms, parallel computing, simulation, and analysis.
  • Leverage your knowledge of key subject areas, such as game theory, statistical quality control, exponential smoothing, seasonally adjusted trend analysis, or data visualization.

Four courses are required for the program and vary based on the data science career path you are interested in pursuing.

If you are thinking about advancing your career or making a career change into the growing data science field, learn more about the Data Science Master’s Degree program or the Data Science Graduate Certificate program including class requirements, tuition, and how to apply.

About the Author

Valerie Kirk is a freelance writer and corporate storyteller specializing in customer and community outreach and topics and trends in education, technology, and healthcare. Based in Maryland near the Chesapeake Bay, she spends her free time exploring nature by bike, paddleboard, or on long hikes with her family.

Computer Science vs. Systems Engineering Programs — Which is Right for You?

This blog post explores the difference between the fields of computer science and systems engineering —  and which might be right for you.

Harvard Division of Continuing Education

The Division of Continuing Education (DCE) at Harvard University is dedicated to bringing rigorous academics and innovative teaching capabilities to those seeking to improve their lives through education. We make Harvard education accessible to lifelong learners from high school to retirement.

Harvard Division of Continuing Education Logo

 Illustration showing the connection between analyzing data sources to draw insights and data-driven decisions

Data science combines math and statistics, specialized programming, advanced  analytics , artificial intelligence (AI)  and machine learning with specific subject matter expertise to uncover actionable insights hidden in an organization’s data. These insights can be used to guide decision making and strategic planning.

The accelerating volume of data sources, and subsequently data, has made data science is one of the fastest growing field across every industry. As a result, it is no surprise that the role of the data scientist was dubbed the “sexiest job of the 21st century” by  Harvard Business Review  (link resides outside ibm.com). Organizations are increasingly reliant on them to interpret data and provide actionable recommendations to improve business outcomes.

The data science lifecycle involves various roles, tools, and processes, which enables analysts to glean actionable insights. Typically, a data science project undergoes the following stages:

  • Data ingestion : The lifecycle begins with the data collection—both raw structured and unstructured data from all relevant sources using a variety of methods. These methods can include manual entry, web scraping, and real-time streaming data from systems and devices. Data sources can include structured data, such as customer data, along with unstructured data like log files, video, audio, pictures, the Internet of Things (IoT) , social media, and more.
  • Data storage and data processing : Since data can have different formats and structures, companies need to consider different storage systems based on the type of data that needs to be captured. Data management teams help to set standards around data storage and structure, which facilitate workflows around analytics, machine learning and deep learning models. This stage includes cleaning data, deduplicating, transforming and combining the data using  ETL  (extract, transform, load) jobs or other data integration technologies. This data preparation is essential for promoting data quality before loading into a  data warehouse ,  data lake , or other repository.
  • Data analysis : Here, data scientists conduct an exploratory data analysis to examine biases, patterns, ranges, and distributions of values within the data. This data analytics exploration drives hypothesis generation for a/b testing. It also allows analysts to determine the data’s relevance for use within modeling efforts for predictive analytics, machine learning, and/or deep learning. Depending on a model’s accuracy, organizations can become reliant on these insights for business decision making, allowing them to drive more scalability.
  • Communicate : Finally, insights are presented as reports and other data visualizations that make the insights—and their impact on business—easier for business analysts and other decision-makers to understand. A data science programming language such as R or Python includes components for generating visualizations; alternately, data scientists can use dedicated visualization tools.

Use this ebook to align with other leaders on the 3 key goals of MLOps and trustworthy AI: trust in data, trust in models and trust in processes.

Register for the ebook on generative AI

Data science is considered a discipline, while data scientists are the practitioners within that field. Data scientists are not necessarily directly responsible for all the processes involved in the data science lifecycle. For example, data pipelines are typically handled by data engineers—but the data scientist may make recommendations about what sort of data is useful or required. While data scientists can build machine learning models, scaling these efforts at a larger level requires more software engineering skills to optimize a program to run more quickly. As a result, it’s common for a data scientist to partner with machine learning engineers to scale machine learning models.

Data scientist responsibilities can commonly overlap with a data analyst, particularly with exploratory data analysis and data visualization. However, a data scientist’s skillset is typically broader than the average data analyst. Comparatively speaking, data scientist leverage common programming languages, such as R and Python, to conduct more statistical inference and data visualization.

To perform these tasks, data scientists require computer science and pure science skills beyond those of a typical business analyst or data analyst. The data scientist must also understand the specifics of the business, such as automobile manufacturing, eCommerce, or healthcare.

In short, a data scientist must be able to:

  • Know enough about the business to ask pertinent questions and identify business pain points.
  • Apply statistics and computer science, along with business acumen, to data analysis.
  • Use a wide range of tools and techniques for preparing and extracting data—everything from databases and SQL to data mining to data integration methods.
  • Extract insights from big data using predictive analytics and  artificial intelligence  (AI), including  machine learning models ,  natural language processing , and  deep learning .
  • Write programs that automate data processing and calculations.
  • Tell—and illustrate—stories that clearly convey the meaning of results to decision-makers and stakeholders at every level of technical understanding.
  • Explain how the results can be used to solve business problems.
  • Collaborate with other data science team members, such as data and business analysts, IT architects, data engineers, and application developers.

These skills are in high demand, and as a result, many individuals that are breaking into a data science career, explore a variety of data science programs, such as certification programs, data science courses, and degree programs offered by educational institutions.

The all new enterprise studio that brings together traditional machine learning along with new generative AI capabilities powered by foundation models.

Watson Studio

IBM Cloud Pak for Data

It may be easy to confuse the terms “data science” and “business intelligence” (BI) because they both relate to an organization’s data and analysis of that data, but they do differ in focus.

Business intelligence (BI) is typically an umbrella term for the technology that enables data preparation, data mining, data management, and data visualization. Business intelligence tools and processes allow end users to identify actionable information from raw data, facilitating data-driven decision-making within organizations across various industries. While data science tools overlap in much of this regard, business intelligence focuses more on data from the past, and the insights from BI tools are more descriptive in nature. It uses data to understand what happened before to inform a course of action. BI is geared toward static (unchanging) data that is usually structured. While data science uses descriptive data, it typically utilizes it to determine predictive variables, which are then used to categorize data or to make forecasts.

Data science and BI are not mutually exclusive—digitally savvy organizations use both to fully understand and extract value from their data.

Data scientists rely on popular programming languages to conduct exploratory data analysis and statistical regression. These open source tools support pre-built statistical modeling, machine learning, and graphics capabilities. These languages include the following (read more at " Python vs. R: What's the Difference? "):

  • R Studio:  An open source programming language and environment for developing statistical computing and graphics.
  • Python:  It is a dynamic and flexible programming language. The Python includes numerous libraries, such as NumPy, Pandas, Matplotlib, for analyzing data quickly.

To facilitate sharing code and other information, data scientists may use GitHub and Jupyter notebooks.

Some data scientists may prefer a user interface, and two common enterprise tools for statistical analysis include:

  • SAS:  A comprehensive tool suite, including visualizations and interactive dashboards, for analyzing, reporting, data mining, and predictive modeling.
  • IBM SPSS : Offers advanced statistical analysis, a large library of machine learning algorithms, text analysis, open source extensibility, integration with big data, and seamless deployment into applications.

Data scientists also gain proficiency in using big data processing platforms, such as Apache Spark, the open source framework Apache Hadoop, and NoSQL databases. They are also skilled with a wide range of data visualization tools, including simple graphics tools included with business presentation and spreadsheet applications (like Microsoft Excel), built-for-purpose commercial visualization tools like Tableau and IBM Cognos, and open source tools like D3.js (a JavaScript library for creating interactive data visualizations) and RAW Graphs. For building machine learning models, data scientists frequently turn to several frameworks like PyTorch, TensorFlow, MXNet, and Spark MLib.

Given the steep learning curve in data science, many companies are seeking to accelerate their return on investment for AI projects; they often struggle to hire the talent needed to realize data science project’s full potential. To address this gap, they are turning to multipersona data science and machine learning (DSML) platforms, giving rise to the role of “citizen data scientist.”

Multipersona DSML platforms use automation, self-service portals, and low-code/no-code user interfaces so that people with little or no background in digital technology or expert data science can create business value using data science and machine learning. These platforms also support expert data scientists by also offering a more technical interface. Using a multipersona DSML platform encourages collaboration across the enterprise.

Cloud computing  scales data science by providing access to additional processing power, storage, and other tools required for data science projects.

Since data science frequently leverages large data sets, tools that can scale with the size of the data is incredibly important, particularly for time-sensitive projects. Cloud storage solutions, such as data lakes, provide access to storage infrastructure, which are capable of ingesting and processing large volumes of data with ease. These storage systems provide flexibility to end users, allowing them to spin up large clusters as needed. They can also add incremental compute nodes to expedite data processing jobs, allowing the business to make short-term tradeoffs for a larger long-term outcome. Cloud platforms typically have different pricing models, such a per-use or subscriptions, to meet the needs of their end user—whether they are a large enterprise or a small startup.

Open source technologies are widely used in data science tool sets. When they’re hosted in the cloud, teams don’t need to install, configure, maintain, or update them locally. Several cloud providers, including IBM Cloud®, also offer prepackaged tool kits that enable data scientists to build models without coding, further democratizing access to technology innovations and data insights. 

Enterprises can unlock numerous benefits from data science. Common use cases include process optimization through intelligent automation and enhanced targeting and personalization to improve the customer experience (CX). However, more specific examples include:

Here are a few representative use cases for data science and artificial intelligence:

  • An international bank  delivers faster loan services with a mobile app  using machine learning-powered credit risk models and a  hybrid cloud computing  architecture that is both powerful and secure.
  • An electronics firm is developing ultra-powerful 3D-printed sensors to guide tomorrow’s driverless vehicles. The solution relies on data science and analytics tools to enhance its real-time object detection capabilities.
  • A robotic process automation (RPA) solution provider developed a  cognitive business process mining solution  that reduces incident handling times between 15% and 95% for its client companies. The solution is trained to understand the content and sentiment of customer emails, directing service teams to prioritize those that are most relevant and urgent.
  • A digital media technology company created an audience analytics platform that enables its clients to see what’s engaging TV audiences as they’re offered a growing range of digital channels. The solution employs deep analytics and machine learning to gather real-time insights into viewer behavior.
  • An  urban police department created statistical incident analysis tools  (link resides outside ibm.com) to help officers understand when and where to deploy resources in order to prevent crime. The data-driven solution creates reports and dashboards to augment situational awareness for field officers.
  • Shanghai Changjiang Science and Technology Development used IBM® Watson® technology to build an  AI-based medical assessment platform  that can analyze existing medical records to categorize patients based on their risk of experiencing a stroke and that can predict the success rate of different treatment plans.

Experiment with foundation models and build machine learning models automatically in our next-generation studio for AI builders.

Synchronize DevOps and ModelOps. Build and scale AI models with your cloud-native apps across virtually any cloud.

Increase AI interpretability. Assess and mitigate AI risks. Deploy AI with trust and confidence.

Build and train high-quality predictive models quickly. Simplify AI lifecycle management.

Autostrade per l’Italia implemented several IBM solutions for a complete digital transformation to improve how it monitors and maintains its vast array of infrastructure assets.

MANA Community teamed with IBM Garage to build an AI platform to mine huge volumes of environmental data volumes from multiple digital channels and thousands of sources.

Having a complete freedom in choice of programming languages, tools and frameworks improves creative thinking and evolvement.

Scale AI workloads for all your data, anywhere, with IBM watsonx.data, a fit-for-purpose data store built on an open data lakehouse architecture.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

Data Science and Analytics: An Overview from Data-Driven Smart Computing, Decision-Making and Applications Perspective

Iqbal h. sarker.

1 Swinburne University of Technology, Melbourne, VIC 3122 Australia

2 Department of Computer Science and Engineering, Chittagong University of Engineering & Technology, Chittagong, 4349 Bangladesh

The digital world has a wealth of data, such as internet of things (IoT) data, business data, health data, mobile data, urban data, security data, and many more, in the current age of the Fourth Industrial Revolution (Industry 4.0 or 4IR). Extracting knowledge or useful insights from these data can be used for smart decision-making in various applications domains. In the area of data science, advanced analytics methods including machine learning modeling can provide actionable insights or deeper knowledge about data, which makes the computing process automatic and smart. In this paper, we present a comprehensive view on “Data Science” including various types of advanced analytics methods that can be applied to enhance the intelligence and capabilities of an application through smart decision-making in different scenarios. We also discuss and summarize ten potential real-world application domains including business, healthcare, cybersecurity, urban and rural data science, and so on by taking into account data-driven smart computing and decision making. Based on this, we finally highlight the challenges and potential research directions within the scope of our study. Overall, this paper aims to serve as a reference point on data science and advanced analytics to the researchers and decision-makers as well as application developers, particularly from the data-driven solution point of view for real-world problems.

Introduction

We are living in the age of “data science and advanced analytics”, where almost everything in our daily lives is digitally recorded as data [ 17 ]. Thus the current electronic world is a wealth of various kinds of data, such as business data, financial data, healthcare data, multimedia data, internet of things (IoT) data, cybersecurity data, social media data, etc [ 112 ]. The data can be structured, semi-structured, or unstructured, which increases day by day [ 105 ]. Data science is typically a “concept to unify statistics, data analysis, and their related methods” to understand and analyze the actual phenomena with data. According to Cao et al. [ 17 ] “data science is the science of data” or “data science is the study of data”, where a data product is a data deliverable, or data-enabled or guided, which can be a discovery, prediction, service, suggestion, insight into decision-making, thought, model, paradigm, tool, or system. The popularity of “Data science” is increasing day-by-day, which is shown in Fig. ​ Fig.1 1 according to Google Trends data over the last 5 years [ 36 ]. In addition to data science, we have also shown the popularity trends of the relevant areas such as “Data analytics”, “Data mining”, “Big data”, “Machine learning” in the figure. According to Fig. ​ Fig.1, 1 , the popularity indication values for these data-driven domains, particularly “Data science”, and “Machine learning” are increasing day-by-day. This statistical information and the applicability of the data-driven smart decision-making in various real-world application areas, motivate us to study briefly on “Data science” and machine-learning-based “Advanced analytics” in this paper.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig1_HTML.jpg

The worldwide popularity score of data science comparing with relevant  areas in a range of 0 (min) to 100 (max) over time where x -axis represents the timestamp information and y -axis represents the corresponding score

Usually, data science is the field of applying advanced analytics methods and scientific concepts to derive useful business information from data. The emphasis of advanced analytics is more on anticipating the use of data to detect patterns to determine what is likely to occur in the future. Basic analytics offer a description of data in general, while advanced analytics is a step forward in offering a deeper understanding of data and helping to analyze granular data, which we are interested in. In the field of data science, several types of analytics are popular, such as "Descriptive analytics" which answers the question of what happened; "Diagnostic analytics" which answers the question of why did it happen; "Predictive analytics" which predicts what will happen in the future; and "Prescriptive analytics" which prescribes what action should be taken, discussed briefly in “ Advanced analytics methods and smart computing ”. Such advanced analytics and decision-making based on machine learning techniques [ 105 ], a major part of artificial intelligence (AI) [ 102 ] can also play a significant role in the Fourth Industrial Revolution (Industry 4.0) due to its learning capability for smart computing as well as automation [ 121 ].

Although the area of “data science” is huge, we mainly focus on deriving useful insights through advanced analytics, where the results are used to make smart decisions in various real-world application areas. For this, various advanced analytics methods such as machine learning modeling, natural language processing, sentiment analysis, neural network, or deep learning analysis can provide deeper knowledge about data, and thus can be used to develop data-driven intelligent applications. More specifically, regression analysis, classification, clustering analysis, association rules, time-series analysis, sentiment analysis, behavioral patterns, anomaly detection, factor analysis, log analysis, and deep learning which is originated from the artificial neural network, are taken into account in our study. These machine learning-based advanced analytics methods are discussed briefly in “ Advanced analytics methods and smart computing ”. Thus, it’s important to understand the principles of various advanced analytics methods mentioned above and their applicability to apply in various real-world application areas. For instance, in our earlier paper Sarker et al. [ 114 ], we have discussed how data science and machine learning modeling can play a significant role in the domain of cybersecurity for making smart decisions and to provide data-driven intelligent security services. In this paper, we broadly take into account the data science application areas and real-world problems in ten potential domains including the area of business data science, health data science, IoT data science, behavioral data science, urban data science, and so on, discussed briefly in “ Real-world application domains ”.

Based on the importance of machine learning modeling to extract the useful insights from the data mentioned above and data-driven smart decision-making, in this paper, we present a comprehensive view on “Data Science” including various types of advanced analytics methods that can be applied to enhance the intelligence and the capabilities of an application. The key contribution of this study is thus understanding data science modeling, explaining different analytic methods for solution perspective and their applicability in various real-world data-driven applications areas mentioned earlier. Overall, the purpose of this paper is, therefore, to provide a basic guide or reference for those academia and industry people who want to study, research, and develop automated and intelligent applications or systems based on smart computing and decision making within the area of data science.

The main contributions of this paper are summarized as follows:

  • To define the scope of our study towards data-driven smart computing and decision-making in our real-world life. We also make a brief discussion on the concept of data science modeling from business problems to data product and automation, to understand its applicability and provide intelligent services in real-world scenarios.
  • To provide a comprehensive view on data science including advanced analytics methods that can be applied to enhance the intelligence and the capabilities of an application.
  • To discuss the applicability and significance of machine learning-based analytics methods in various real-world application areas. We also summarize ten potential real-world application areas, from business to personalized applications in our daily life, where advanced analytics with machine learning modeling can be used to achieve the expected outcome.
  • To highlight and summarize the challenges and potential research directions within the scope of our study.

The rest of the paper is organized as follows. The next section provides the background and related work and defines the scope of our study. The following section presents the concepts of data science modeling for building a data-driven application. After that, briefly discuss and explain different advanced analytics methods and smart computing. Various real-world application areas are discussed and summarized in the next section. We then highlight and summarize several research issues and potential future directions, and finally, the last section concludes this paper.

Background and Related Work

In this section, we first discuss various data terms and works related to data science and highlight the scope of our study.

Data Terms and Definitions

There is a range of key terms in the field, such as data analysis, data mining, data analytics, big data, data science, advanced analytics, machine learning, and deep learning, which are highly related and easily confusing. In the following, we define these terms and differentiate them with the term “Data Science” according to our goal.

The term “Data analysis” refers to the processing of data by conventional (e.g., classic statistical, empirical, or logical) theories, technologies, and tools for extracting useful information and for practical purposes [ 17 ]. The term “Data analytics”, on the other hand, refers to the theories, technologies, instruments, and processes that allow for an in-depth understanding and exploration of actionable data insight [ 17 ]. Statistical and mathematical analysis of the data is the major concern in this process. “Data mining” is another popular term over the last decade, which has a similar meaning with several other terms such as knowledge mining from data, knowledge extraction, knowledge discovery from data (KDD), data/pattern analysis, data archaeology, and data dredging. According to Han et al. [ 38 ], it should have been more appropriately named “knowledge mining from data”. Overall, data mining is defined as the process of discovering interesting patterns and knowledge from large amounts of data [ 38 ]. Data sources may include databases, data centers, the Internet or Web, other repositories of data, or data dynamically streamed through the system. “Big data” is another popular term nowadays, which may change the statistical and data analysis approaches as it has the unique features of “massive, high dimensional, heterogeneous, complex, unstructured, incomplete, noisy, and erroneous” [ 74 ]. Big data can be generated by mobile devices, social networks, the Internet of Things, multimedia, and many other new applications [ 129 ]. Several unique features including volume, velocity, variety, veracity, value (5Vs), and complexity are used to understand and describe big data [ 69 ].

In terms of analytics, basic analytics provides a summary of data whereas the term “Advanced Analytics” takes a step forward in offering a deeper understanding of data and helps to analyze granular data. Advanced analytics is characterized or defined as autonomous or semi-autonomous data or content analysis using advanced techniques and methods to discover deeper insights, predict or generate recommendations, typically beyond traditional business intelligence or analytics. “Machine learning”, a branch of artificial intelligence (AI), is one of the major techniques used in advanced analytics which can automate analytical model building [ 112 ]. This is focused on the premise that systems can learn from data, recognize trends, and make decisions, with minimal human involvement [ 38 , 115 ]. “Deep Learning” is a subfield of machine learning that discusses algorithms inspired by the human brain’s structure and the function called artificial neural networks [ 38 , 139 ].

Unlike the above data-related terms, “Data science” is an umbrella term that encompasses advanced data analytics, data mining, machine, and deep learning modeling, and several other related disciplines like statistics, to extract insights or useful knowledge from the datasets and transform them into actionable business strategies. In [ 17 ], Cao et al. defined data science from the disciplinary perspective as “data science is a new interdisciplinary field that synthesizes and builds on statistics, informatics, computing, communication, management, and sociology to study data and its environments (including domains and other contextual aspects, such as organizational and social aspects) to transform data to insights and decisions by following a data-to-knowledge-to-wisdom thinking and methodology”. In “ Understanding data science modeling ”, we briefly discuss the data science modeling from a practical perspective starting from business problems to data products that can assist the data scientists to think and work in a particular real-world problem domain within the area of data science and analytics.

Related Work

In the area, several papers have been reviewed by the researchers based on data science and its significance. For example, the authors in [ 19 ] identify the evolving field of data science and its importance in the broader knowledge environment and some issues that differentiate data science and informatics issues from conventional approaches in information sciences. Donoho et al. [ 27 ] present 50 years of data science including recent commentary on data science in mass media, and on how/whether data science varies from statistics. The authors formally conceptualize the theory-guided data science (TGDS) model in [ 53 ] and present a taxonomy of research themes in TGDS. Cao et al. include a detailed survey and tutorial on the fundamental aspects of data science in [ 17 ], which considers the transition from data analysis to data science, the principles of data science, as well as the discipline and competence of data education.

Besides, the authors include a data science analysis in [ 20 ], which aims to provide a realistic overview of the use of statistical features and related data science methods in bioimage informatics. The authors in [ 61 ] study the key streams of data science algorithm use at central banks and show how their popularity has risen over time. This research contributes to the creation of a research vector on the role of data science in central banking. In [ 62 ], the authors provide an overview and tutorial on the data-driven design of intelligent wireless networks. The authors in [ 87 ] provide a thorough understanding of computational optimal transport with application to data science. In [ 97 ], the authors present data science as theoretical contributions in information systems via text analytics.

Unlike the above recent studies, in this paper, we concentrate on the knowledge of data science including advanced analytics methods, machine learning modeling, real-world application domains, and potential research directions within the scope of our study. The advanced analytics methods based on machine learning techniques discussed in this paper can be applied to enhance the capabilities of an application in terms of data-driven intelligent decision making and automation in the final data product or systems.

Understanding Data Science Modeling

In this section, we briefly discuss how data science can play a significant role in the real-world business process. For this, we first categorize various types of data and then discuss the major steps of data science modeling starting from business problems to data product and automation.

Types of Real-World Data

Typically, to build a data-driven real-world system in a particular domain, the availability of data is the key [ 17 , 112 , 114 ]. The data can be in different types such as (i) Structured—that has a well-defined data structure and follows a standard order, examples are names, dates, addresses, credit card numbers, stock information, geolocation, etc.; (ii) Unstructured—has no pre-defined format or organization, examples are sensor data, emails, blog entries, wikis, and word processing documents, PDF files, audio files, videos, images, presentations, web pages, etc.; (iii) Semi-structured—has elements of both the structured and unstructured data containing certain organizational properties, examples are HTML, XML, JSON documents, NoSQL databases, etc.; and (iv) Metadata—that represents data about the data, examples are author, file type, file size, creation date and time, last modification date and time, etc. [ 38 , 105 ].

In the area of data science, researchers use various widely-used datasets for different purposes. These are, for example, cybersecurity datasets such as NSL-KDD [ 127 ], UNSW-NB15 [ 79 ], Bot-IoT [ 59 ], ISCX’12 [ 15 ], CIC-DDoS2019 [ 22 ], etc., smartphone datasets such as phone call logs [ 88 , 110 ], mobile application usages logs [ 124 , 149 ], SMS Log [ 28 ], mobile phone notification logs [ 77 ] etc., IoT data [ 56 , 11 , 64 ], health data such as heart disease [ 99 ], diabetes mellitus [ 86 , 147 ], COVID-19 [ 41 , 78 ], etc., agriculture and e-commerce data [ 128 , 150 ], and many more in various application domains. In “ Real-world application domains ”, we discuss ten potential real-world application domains of data science and analytics by taking into account data-driven smart computing and decision making, which can help the data scientists and application developers to explore more in various real-world issues.

Overall, the data used in data-driven applications can be any of the types mentioned above, and they can differ from one application to another in the real world. Data science modeling, which is briefly discussed below, can be used to analyze such data in a specific problem domain and derive insights or useful information from the data to build a data-driven model or data product.

Steps of Data Science Modeling

Data science is typically an umbrella term that encompasses advanced data analytics, data mining, machine, and deep learning modeling, and several other related disciplines like statistics, to extract insights or useful knowledge from the datasets and transform them into actionable business strategies, mentioned earlier in “ Background and related work ”. In this section, we briefly discuss how data science can play a significant role in the real-world business process. Figure ​ Figure2 2 shows an example of data science modeling starting from real-world data to data-driven product and automation. In the following, we briefly discuss each module of the data science process.

  • Understanding business problems: This involves getting a clear understanding of the problem that is needed to solve, how it impacts the relevant organization or individuals, the ultimate goals for addressing it, and the relevant project plan. Thus to understand and identify the business problems, the data scientists formulate relevant questions while working with the end-users and other stakeholders. For instance, how much/many, which category/group, is the behavior unrealistic/abnormal, which option should be taken, what action, etc. could be relevant questions depending on the nature of the problems. This helps to get a better idea of what business needs and what we should be extracted from data. Such business knowledge can enable organizations to enhance their decision-making process, is known as “Business Intelligence” [ 65 ]. Identifying the relevant data sources that can help to answer the formulated questions and what kinds of actions should be taken from the trends that the data shows, is another important task associated with this stage. Once the business problem has been clearly stated, the data scientist can define the analytic approach to solve the problem.
  • Understanding data: As we know that data science is largely driven by the availability of data [ 114 ]. Thus a sound understanding of the data is needed towards a data-driven model or system. The reason is that real-world data sets are often noisy, missing values, have inconsistencies, or other data issues, which are needed to handle effectively [ 101 ]. To gain actionable insights, the appropriate data or the quality of the data must be sourced and cleansed, which is fundamental to any data science engagement. For this, data assessment that evaluates what data is available and how it aligns to the business problem could be the first step in data understanding. Several aspects such as data type/format, the quantity of data whether it is sufficient or not to extract the useful knowledge, data relevance, authorized access to data, feature or attribute importance, combining multiple data sources, important metrics to report the data, etc. are needed to take into account to clearly understand the data for a particular business problem. Overall, the data understanding module involves figuring out what data would be best needed and the best ways to acquire it.
  • Data pre-processing and exploration: Exploratory data analysis is defined in data science as an approach to analyzing datasets to summarize their key characteristics, often with visual methods [ 135 ]. This examines a broad data collection to discover initial trends, attributes, points of interest, etc. in an unstructured manner to construct meaningful summaries of the data. Thus data exploration is typically used to figure out the gist of data and to develop a first step assessment of its quality, quantity, and characteristics. A statistical model can be used or not, but primarily it offers tools for creating hypotheses by generally visualizing and interpreting the data through graphical representation such as a chart, plot, histogram, etc [ 72 , 91 ]. Before the data is ready for modeling, it’s necessary to use data summarization and visualization to audit the quality of the data and provide the information needed to process it. To ensure the quality of the data, the data  pre-processing technique, which is typically the process of cleaning and transforming raw data [ 107 ] before processing and analysis is important. It also involves reformatting information, making data corrections, and merging data sets to enrich data. Thus, several aspects such as expected data, data cleaning, formatting or transforming data, dealing with missing values, handling data imbalance and bias issues, data distribution, search for outliers or anomalies in data and dealing with them, ensuring data quality, etc. could be the key considerations in this step.
  • Machine learning modeling and evaluation: Once the data is prepared for building the model, data scientists design a model, algorithm, or set of models, to address the business problem. Model building is dependent on what type of analytics, e.g., predictive analytics, is needed to solve the particular problem, which is discussed briefly in “ Advanced analytics methods and smart computing ”. To best fits the data according to the type of analytics, different types of data-driven or machine learning models that have been summarized in our earlier paper Sarker et al. [ 105 ], can be built to achieve the goal. Data scientists typically separate training and test subsets of the given dataset usually dividing in the ratio of 80:20 or data considering the most popular k -folds data splitting method [ 38 ]. This is to observe whether the model performs well or not on the data, to maximize the model performance. Various model validation and assessment metrics, such as error rate, accuracy, true positive, false positive, true negative, false negative, precision, recall, f-score, ROC (receiver operating characteristic curve) analysis, applicability analysis, etc. [ 38 , 115 ] are used to measure the model performance, which can guide the data scientists to choose or design the learning method or model. Besides, machine learning experts or data scientists can take into account several advanced analytics such as feature engineering, feature selection or extraction methods, algorithm tuning, ensemble methods, modifying existing algorithms, or designing new algorithms, etc. to improve the ultimate data-driven model to solve a particular business problem through smart decision making.
  • Data product and automation: A data product is typically the output of any data science activity [ 17 ]. A data product, in general terms, is a data deliverable, or data-enabled or guide, which can be a discovery, prediction, service, suggestion, insight into decision-making, thought, model, paradigm, tool, application, or system that process data and generate results. Businesses can use the results of such data analysis to obtain useful information like churn (a measure of how many customers stop using a product) prediction and customer segmentation, and use these results to make smarter business decisions and automation. Thus to make better decisions in various business problems, various machine learning pipelines and data products can be developed. To highlight this, we summarize several potential real-world data science application areas in “ Real-world application domains ”, where various data products can play a significant role in relevant business problems to make them smart and automate.

Overall, we can conclude that data science modeling can be used to help drive changes and improvements in business practices. The interesting part of the data science process indicates having a deeper understanding of the business problem to solve. Without that, it would be much harder to gather the right data and extract the most useful information from the data for making decisions to solve the problem. In terms of role, “Data Scientists” typically interpret and manage data to uncover the answers to major questions that help organizations to make objective decisions and solve complex problems. In a summary, a data scientist proactively gathers and analyzes information from multiple sources to better understand how the business performs, and  designs machine learning or data-driven tools/methods, or algorithms, focused on advanced analytics, which can make today’s computing process smarter and intelligent, discussed briefly in the following section.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig2_HTML.jpg

An example of data science modeling from real-world data to data-driven system and decision making

Advanced Analytics Methods and Smart Computing

As mentioned earlier in “ Background and related work ”, basic analytics provides a summary of data whereas advanced analytics takes a step forward in offering a deeper understanding of data and helps in granular data analysis. For instance, the predictive capabilities of advanced analytics can be used to forecast trends, events, and behaviors. Thus, “advanced analytics” can be defined as the autonomous or semi-autonomous analysis of data or content using advanced techniques and methods to discover deeper insights, make predictions, or produce recommendations, where machine learning-based analytical modeling is considered as the key technologies in the area. In the following section, we first summarize various types of analytics and outcome that are needed to solve the associated business problems, and then we briefly discuss machine learning-based analytical modeling.

Types of Analytics and Outcome

In the real-world business process, several key questions such as “What happened?”, “Why did it happen?”, “What will happen in the future?”, “What action should be taken?” are common and important. Based on these questions, in this paper, we categorize and highlight the analytics into four types such as descriptive, diagnostic, predictive, and prescriptive, which are discussed below.

  • Descriptive analytics: It is the interpretation of historical data to better understand the changes that have occurred in a business. Thus descriptive analytics answers the question, “what happened in the past?” by summarizing past data such as statistics on sales and operations or marketing strategies, use of social media, and engagement with Twitter, Linkedin or Facebook, etc. For instance, using descriptive analytics through analyzing trends, patterns, and anomalies, etc., customers’ historical shopping data can be used to predict the probability of a customer purchasing a product. Thus, descriptive analytics can play a significant role to provide an accurate picture of what has occurred in a business and how it relates to previous times utilizing a broad range of relevant business data. As a result, managers and decision-makers can pinpoint areas of strength and weakness in their business, and eventually can take more effective management strategies and business decisions.
  • Diagnostic analytics: It is a form of advanced analytics that examines data or content to answer the question, “why did it happen?” The goal of diagnostic analytics is to help to find the root cause of the problem. For example, the human resource management department of a business organization may use these diagnostic analytics to find the best applicant for a position, select them, and compare them to other similar positions to see how well they perform. In a healthcare example, it might help to figure out whether the patients’ symptoms such as high fever, dry cough, headache, fatigue, etc. are all caused by the same infectious agent. Overall, diagnostic analytics enables one to extract value from the data by posing the right questions and conducting in-depth investigations into the answers. It is characterized by techniques such as drill-down, data discovery, data mining, and correlations.
  • Predictive analytics: Predictive analytics is an important analytical technique used by many organizations for various purposes such as to assess business risks, anticipate potential market patterns, and decide when maintenance is needed, to enhance their business. It is a form of advanced analytics that examines data or content to answer the question, “what will happen in the future?” Thus, the primary goal of predictive analytics is to identify and typically answer this question with a high degree of probability. Data scientists can use historical data as a source to extract insights for building predictive models using various regression analyses and machine learning techniques, which can be used in various application domains for a better outcome. Companies, for example, can use predictive analytics to minimize costs by better anticipating future demand and changing output and inventory, banks and other financial institutions to reduce fraud and risks by predicting suspicious activity, medical specialists to make effective decisions through predicting patients who are at risk of diseases, retailers to increase sales and customer satisfaction through understanding and predicting customer preferences, manufacturers to optimize production capacity through predicting maintenance requirements, and many more. Thus predictive analytics can be considered as the core analytical method within the area of data science.
  • Prescriptive analytics: Prescriptive analytics focuses on recommending the best way forward with actionable information to maximize overall returns and profitability, which typically answer the question, “what action should be taken?” In business analytics, prescriptive analytics is considered the final step. For its models, prescriptive analytics collects data from several descriptive and predictive sources and applies it to the decision-making process. Thus, we can say that it is related to both descriptive analytics and predictive analytics, but it emphasizes actionable insights instead of data monitoring. In other words, it can be considered as the opposite of descriptive analytics, which examines decisions and outcomes after the fact. By integrating big data, machine learning, and business rules, prescriptive analytics helps organizations to make more informed decisions to produce results that drive the most successful business decisions.

In summary, to clarify what happened and why it happened, both descriptive analytics and diagnostic analytics look at the past. Historical data is used by predictive analytics and prescriptive analytics to forecast what will happen in the future and what steps should be taken to impact those effects. In Table ​ Table1, 1 , we have summarized these analytics methods with examples. Forward-thinking organizations in the real world can jointly use these analytical methods to make smart decisions that help drive changes in business processes and improvements. In the following, we discuss how machine learning techniques can play a big role in these analytical methods through their learning capabilities from the data.

Various types of analytical methods with examples

Machine Learning Based Analytical Modeling

In this section, we briefly discuss various advanced analytics methods based on machine learning modeling, which can make the computing process smart through intelligent decision-making in a business process. Figure ​ Figure3 3 shows a general structure of a machine learning-based predictive modeling considering both the training and testing phase. In the following, we discuss a wide range of methods such as regression and classification analysis, association rule analysis, time-series analysis, behavioral analysis, log analysis, and so on within the scope of our study.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig3_HTML.jpg

A general structure of a machine learning based predictive model considering both the training and testing phase

Regression Analysis

In data science, one of the most common statistical approaches used for predictive modeling and data mining tasks is regression techniques [ 38 ]. Regression analysis is a form of supervised machine learning that examines the relationship between a dependent variable (target) and independent variables (predictor) to predict continuous-valued output [ 105 , 117 ]. The following equations Eqs. 1 , 2 , and 3 [ 85 , 105 ] represent the simple, multiple or multivariate, and polynomial regressions respectively, where x represents independent variable and y is the predicted/target output mentioned above:

Regression analysis is typically conducted for one of two purposes: to predict the value of the dependent variable in the case of individuals for whom some knowledge relating to the explanatory variables is available, or to estimate the effect of some explanatory variable on the dependent variable, i.e., finding the relationship of causal influence between the variables. Linear regression cannot be used to fit non-linear data and may cause an underfitting problem. In that case, polynomial regression performs better, however, increases the model complexity. The regularization techniques such as Ridge, Lasso, Elastic-Net, etc. [ 85 , 105 ] can be used to optimize the linear regression model. Besides, support vector regression, decision tree regression, random forest regression techniques [ 85 , 105 ] can be used for building effective regression models depending on the problem type, e.g., non-linear tasks. Financial forecasting or prediction, cost estimation, trend analysis, marketing, time-series estimation, drug response modeling, etc. are some examples where the regression models can be used to solve real-world problems in the domain of data science and analytics.

Classification Analysis

Classification is one of the most widely used and best-known data science processes. This is a form of supervised machine learning approach that also refers to a predictive modeling problem in which a class label is predicted for a given example [ 38 ]. Spam identification, such as ‘spam’ and ‘not spam’ in email service providers, can be an example of a classification problem. There are several forms of classification analysis available in the area such as binary classification—which refers to the prediction of one of two classes; multi-class classification—which involves the prediction of one of more than two classes; multi-label classification—a generalization of multiclass classification in which the problem’s classes are organized hierarchically [ 105 ].

Several popular classification techniques, such as k-nearest neighbors [ 5 ], support vector machines [ 55 ], navies Bayes [ 49 ], adaptive boosting [ 32 ], extreme gradient boosting [ 85 ], logistic regression [ 66 ], decision trees ID3 [ 92 ], C4.5 [ 93 ], and random forests [ 13 ] exist to solve classification problems. The tree-based classification technique, e.g., random forest considering multiple decision trees, performs better than others to solve real-world problems in many cases as due to its capability of producing logic rules [ 103 , 115 ]. Figure ​ Figure4 4 shows an example of a random forest structure considering multiple decision trees. In addition, BehavDT recently proposed by Sarker et al. [ 109 ], and IntrudTree [ 106 ] can be used for building effective classification or prediction models in the relevant tasks within the domain of data science and analytics.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig4_HTML.jpg

An example of a random forest structure considering multiple decision trees

Cluster Analysis

Clustering is a form of unsupervised machine learning technique and is well-known in many data science application areas for statistical data analysis [ 38 ]. Usually, clustering techniques search for the structures inside a dataset and, if the classification is not previously identified, classify homogeneous groups of cases. This means that data points are identical to each other within a cluster, and different from data points in another cluster. Overall, the purpose of cluster analysis is to sort various data points into groups (or clusters) that are homogeneous internally and heterogeneous externally [ 105 ]. To gain insight into how data is distributed in a given dataset or as a preprocessing phase for other algorithms, clustering is often used. Data clustering, for example, assists with customer shopping behavior, sales campaigns, and retention of consumers for retail businesses, anomaly detection, etc.

Many clustering algorithms with the ability to group data have been proposed in machine learning and data science literature [ 98 , 138 , 141 ]. In our earlier paper Sarker et al. [ 105 ], we have summarized this based on several perspectives, such as partitioning methods, density-based methods, hierarchical-based methods, model-based methods, etc. In the literature, the popular K-means [ 75 ], K-Mediods [ 84 ], CLARA [ 54 ] etc. are known as partitioning methods; DBSCAN [ 30 ], OPTICS [ 8 ] etc. are known as density-based methods; single linkage [ 122 ], complete linkage [ 123 ], etc. are known as hierarchical methods. In addition, grid-based clustering methods, such as STING [ 134 ], CLIQUE [ 2 ], etc.; model-based clustering such as neural network learning [ 141 ], GMM [ 94 ], SOM [ 18 , 104 ], etc.; constrained-based methods such as COP K-means [ 131 ], CMWK-Means [ 25 ], etc. are used in the area. Recently, Sarker et al. [ 111 ] proposed a hierarchical clustering method, BOTS [ 111 ] based on bottom-up agglomerative technique for capturing user’s similar behavioral characteristics over time. The key benefit of agglomerative hierarchical clustering is that the tree-structure hierarchy created by agglomerative clustering is more informative than an unstructured set of flat clusters, which can assist in better decision-making in relevant application areas in data science.

Association Rule Analysis

Association rule learning is known as a rule-based machine learning system, an unsupervised learning method is typically used to establish a relationship among variables. This is a descriptive technique often used to analyze large datasets for discovering interesting relationships or patterns. The association learning technique’s main strength is its comprehensiveness, as it produces all associations that meet user-specified constraints including minimum support and confidence value [ 138 ].

Association rules allow a data scientist to identify trends, associations, and co-occurrences between data sets inside large data collections. In a supermarket, for example, associations infer knowledge about the buying behavior of consumers for different items, which helps to change the marketing and sales plan. In healthcare, to better diagnose patients, physicians may use association guidelines. Doctors can assess the conditional likelihood of a given illness by comparing symptom associations in the data from previous cases using association rules and machine learning-based data analysis. Similarly, association rules are useful for consumer behavior analysis and prediction, customer market analysis, bioinformatics, weblog mining, recommendation systems, etc.

Several types of association rules have been proposed in the area, such as frequent pattern based [ 4 , 47 , 73 ], logic-based [ 31 ], tree-based [ 39 ], fuzzy-rules [ 126 ], belief rule [ 148 ] etc. The rule learning techniques such as AIS [ 3 ], Apriori [ 4 ], Apriori-TID and Apriori-Hybrid [ 4 ], FP-Tree [ 39 ], Eclat [ 144 ], RARM [ 24 ] exist to solve the relevant business problems. Apriori [ 4 ] is the most commonly used algorithm for discovering association rules from a given dataset among the association rule learning techniques [ 145 ]. The recent association rule-learning technique ABC-RuleMiner proposed in our earlier paper by Sarker et al. [ 113 ] could give significant results in terms of generating non-redundant rules that can be used for smart decision making according to human preferences, within the area of data science applications.

Time-Series Analysis and Forecasting

A time series is typically a series of data points indexed in time order particularly, by date, or timestamp [ 111 ]. Depending on the frequency, the time-series can be different types such as annually, e.g., annual budget, quarterly, e.g., expenditure, monthly, e.g., air traffic, weekly, e.g., sales quantity, daily, e.g., weather, hourly, e.g., stock price, minute-wise, e.g., inbound calls in a call center, and even second-wise, e.g., web traffic, and so on in relevant domains.

A mathematical method dealing with such time-series data, or the procedure of fitting a time series to a proper model is termed time-series analysis. Many different time series forecasting algorithms and analysis methods can be applied to extract the relevant information. For instance, to do time-series forecasting for future patterns, the autoregressive (AR) model [ 130 ] learns the behavioral trends or patterns of past data. Moving average (MA) [ 40 ] is another simple and common form of smoothing used in time series analysis and forecasting that uses past forecasted errors in a regression-like model to elaborate an averaged trend across the data. The autoregressive moving average (ARMA) [ 12 , 120 ] combines these two approaches, where autoregressive extracts the momentum and pattern of the trend and moving average capture the noise effects. The most popular and frequently used time-series model is the autoregressive integrated moving average (ARIMA) model [ 12 , 120 ]. ARIMA model, a generalization of an ARMA model, is more flexible than other statistical models such as exponential smoothing or simple linear regression. In terms of data, the ARMA model can only be used for stationary time-series data, while the ARIMA model includes the case of non-stationarity as well. Similarly, seasonal autoregressive integrated moving average (SARIMA), autoregressive fractionally integrated moving average (ARFIMA), autoregressive moving average model with exogenous inputs model (ARMAX model) are also used in time-series models [ 120 ].

In addition to the stochastic methods for time-series modeling and forecasting, machine and deep learning-based approach can be used for effective time-series analysis and forecasting. For instance, in our earlier paper, Sarker et al. [ 111 ] present a bottom-up clustering-based time-series analysis to capture the mobile usage behavioral patterns of the users. Figure ​ Figure5 5 shows an example of producing aggregate time segments Seg_i from initial time slices TS_i based on similar behavioral characteristics that are used in our bottom-up clustering approach, where D represents the dominant behavior BH_i of the users, mentioned above [ 111 ]. The authors in [ 118 ], used a long short-term memory (LSTM) model, a kind of recurrent neural network (RNN) deep learning model, in forecasting time-series that outperform traditional approaches such as the ARIMA model. Time-series analysis is commonly used these days in various fields such as financial, manufacturing, business, social media, event data (e.g., clickstreams and system events), IoT and smartphone data, and generally in any applied science and engineering temporal measurement domain. Thus, it covers a wide range of application areas in data science.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig5_HTML.jpg

An example of producing aggregate time segments from initial time slices based on similar behavioral characteristics

Opinion Mining and Sentiment Analysis

Sentiment analysis or opinion mining is the computational study of the opinions, thoughts, emotions, assessments, and attitudes of people towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes [ 71 ]. There are three kinds of sentiments: positive, negative, and neutral, along with more extreme feelings such as angry, happy and sad, or interested or not interested, etc. More refined sentiments to evaluate the feelings of individuals in various situations can also be found according to the problem domain.

Although the task of opinion mining and sentiment analysis is very challenging from a technical point of view, it’s very useful in real-world practice. For instance, a business always aims to obtain an opinion from the public or customers about its products and services to refine the business policy as well as a better business decision. It can thus benefit a business to understand the social opinion of their brand, product, or service. Besides, potential customers want to know what consumers believe they have when they use a service or purchase a product. Document-level, sentence level, aspect level, and concept level, are the possible levels of opinion mining in the area [ 45 ].

Several popular techniques such as lexicon-based including dictionary-based and corpus-based methods, machine learning including supervised and unsupervised learning, deep learning, and hybrid methods are used in sentiment analysis-related tasks [ 70 ]. To systematically define, extract, measure, and analyze affective states and subjective knowledge, it incorporates the use of statistics, natural language processing (NLP), machine learning as well as deep learning methods. Sentiment analysis is widely used in many applications, such as reviews and survey data, web and social media, and healthcare content, ranging from marketing and customer support to clinical practice. Thus sentiment analysis has a big influence in many data science applications, where public sentiment is involved in various real-world issues.

Behavioral Data and Cohort Analysis

Behavioral analytics is a recent trend that typically reveals new insights into e-commerce sites, online gaming, mobile and smartphone applications, IoT user behavior, and many more [ 112 ]. The behavioral analysis aims to understand how and why the consumers or users behave, allowing accurate predictions of how they are likely to behave in the future. For instance, it allows advertisers to make the best offers with the right client segments at the right time. Behavioral analytics, including traffic data such as navigation paths, clicks, social media interactions, purchase decisions, and marketing responsiveness, use the large quantities of raw user event information gathered during sessions in which people use apps, games, or websites. In our earlier papers Sarker et al. [ 101 , 111 , 113 ] we have discussed how to extract users phone usage behavioral patterns utilizing real-life phone log data for various purposes.

In the real-world scenario, behavioral analytics is often used in e-commerce, social media, call centers, billing systems, IoT systems, political campaigns, and other applications, to find opportunities for optimization to achieve particular outcomes. Cohort analysis is a branch of behavioral analytics that involves studying groups of people over time to see how their behavior changes. For instance, it takes data from a given data set (e.g., an e-commerce website, web application, or online game) and separates it into related groups for analysis. Various machine learning techniques such as behavioral data clustering [ 111 ], behavioral decision tree classification [ 109 ], behavioral association rules [ 113 ], etc. can be used in the area according to the goal. Besides, the concept of RecencyMiner, proposed in our earlier paper Sarker et al. [ 108 ] that takes into account recent behavioral patterns could be effective while analyzing behavioral data as it may not be static in the real-world changes over time.

Anomaly Detection or Outlier Analysis

Anomaly detection, also known as Outlier analysis is a data mining step that detects data points, events, and/or findings that deviate from the regularities or normal behavior of a dataset. Anomalies are usually referred to as outliers, abnormalities, novelties, noise, inconsistency, irregularities, and exceptions [ 63 , 114 ]. Techniques of anomaly detection may discover new situations or cases as deviant based on historical data through analyzing the data patterns. For instance, identifying fraud or irregular transactions in finance is an example of anomaly detection.

It is often used in preprocessing tasks for the deletion of anomalous or inconsistency in the real-world data collected from various data sources including user logs, devices, networks, and servers. For anomaly detection, several machine learning techniques can be used, such as k-nearest neighbors, isolation forests, cluster analysis, etc [ 105 ]. The exclusion of anomalous data from the dataset also results in a statistically significant improvement in accuracy during supervised learning [ 101 ]. However, extracting appropriate features, identifying normal behaviors, managing imbalanced data distribution, addressing variations in abnormal behavior or irregularities, the sparse occurrence of abnormal events, environmental variations, etc. could be challenging in the process of anomaly detection. Detection of anomalies can be applicable in a variety of domains such as cybersecurity analytics, intrusion detections, fraud detection, fault detection, health analytics, identifying irregularities, detecting ecosystem disturbances, and many more. This anomaly detection can be considered a significant task for building effective systems with higher accuracy within the area of data science.

Factor Analysis

Factor analysis is a collection of techniques for describing the relationships or correlations between variables in terms of more fundamental entities known as factors [ 23 ]. It’s usually used to organize variables into a small number of clusters based on their common variance, where mathematical or statistical procedures are used. The goals of factor analysis are to determine the number of fundamental influences underlying a set of variables, calculate the degree to which each variable is associated with the factors, and learn more about the existence of the factors by examining which factors contribute to output on which variables. The broad purpose of factor analysis is to summarize data so that relationships and patterns can be easily interpreted and understood [ 143 ].

Exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) are the two most popular factor analysis techniques. EFA seeks to discover complex trends by analyzing the dataset and testing predictions, while CFA tries to validate hypotheses and uses path analysis diagrams to represent variables and factors [ 143 ]. Factor analysis is one of the algorithms for unsupervised machine learning that is used for minimizing dimensionality. The most common methods for factor analytics are principal components analysis (PCA), principal axis factoring (PAF), and maximum likelihood (ML) [ 48 ]. Methods of correlation analysis such as Pearson correlation, canonical correlation, etc. may also be useful in the field as they can quantify the statistical relationship between two continuous variables, or association. Factor analysis is commonly used in finance, marketing, advertising, product management, psychology, and operations research, and thus can be considered as another significant analytical method within the area of data science.

Log Analysis

Logs are commonly used in system management as logs are often the only data available that record detailed system runtime activities or behaviors in production [ 44 ]. Log analysis is thus can be considered as the method of analyzing, interpreting, and capable of understanding computer-generated records or messages, also known as logs. This can be device log, server log, system log, network log, event log, audit trail, audit record, etc. The process of creating such records is called data logging.

Logs are generated by a wide variety of programmable technologies, including networking devices, operating systems, software, and more. Phone call logs [ 88 , 110 ], SMS Logs [ 28 ], mobile apps usages logs [ 124 , 149 ], notification logs [ 77 ], game Logs [ 82 ], context logs [ 16 , 149 ], web logs [ 37 ], smartphone life logs [ 95 ], etc. are some examples of log data for smartphone devices. The main characteristics of these log data is that it contains users’ actual behavioral activities with their devices. Similar other log data can be search logs [ 50 , 133 ], application logs [ 26 ], server logs [ 33 ], network logs [ 57 ], event logs [ 83 ], network and security logs [ 142 ] etc.

Several techniques such as classification and tagging, correlation analysis, pattern recognition methods, anomaly detection methods, machine learning modeling, etc. [ 105 ] can be used for effective log analysis. Log analysis can assist in compliance with security policies and industry regulations, as well as provide a better user experience by encouraging the troubleshooting of technical problems and identifying areas where efficiency can be improved. For instance, web servers use log files to record data about website visitors. Windows event log analysis can help an investigator draw a timeline based on the logging information and the discovered artifacts. Overall, advanced analytics methods by taking into account machine learning modeling can play a significant role to extract insightful patterns from these log data, which can be used for building automated and smart applications, and thus can be considered as a key working area in data science.

Neural Networks and Deep Learning Analysis

Deep learning is a form of machine learning that uses artificial neural networks to create a computational architecture that learns from data by combining multiple processing layers, such as the input, hidden, and output layers [ 38 ]. The key benefit of deep learning over conventional machine learning methods is that it performs better in a variety of situations, particularly when learning from large datasets [ 114 , 140 ].

The most common deep learning algorithms are: multi-layer perceptron (MLP) [ 85 ], convolutional neural network (CNN or ConvNet) [ 67 ], long short term memory recurrent neural network (LSTM-RNN) [ 34 ]. Figure ​ Figure6 6 shows a structure of an artificial neural network modeling with multiple processing layers. The Backpropagation technique [ 38 ] is used to adjust the weight values internally while building the model. Convolutional neural networks (CNNs) [ 67 ] improve on the design of traditional artificial neural networks (ANNs), which include convolutional layers, pooling layers, and fully connected layers. It is commonly used in a variety of fields, including natural language processing, speech recognition, image processing, and other autocorrelated data since it takes advantage of the two-dimensional (2D) structure of the input data. AlexNet [ 60 ], Xception [ 21 ], Inception [ 125 ], Visual Geometry Group (VGG) [ 42 ], ResNet [ 43 ], etc., and other advanced deep learning models based on CNN are also used in the field.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig6_HTML.jpg

A structure of an artificial neural network modeling with multiple processing layers

In addition to CNN, recurrent neural network (RNN) architecture is another popular method used in deep learning. Long short-term memory (LSTM) is a popular type of recurrent neural network architecture used broadly in the area of deep learning. Unlike traditional feed-forward neural networks, LSTM has feedback connections. Thus, LSTM networks are well-suited for analyzing and learning sequential data, such as classifying, sorting, and predicting data based on time-series data. Therefore, when the data is in a sequential format, such as time, sentence, etc., LSTM can be used, and it is widely used in the areas of time-series analysis, natural language processing, speech recognition, and so on.

In addition to the most popular deep learning methods mentioned above, several other deep learning approaches [ 104 ] exist in the field for various purposes. The self-organizing map (SOM) [ 58 ], for example, uses unsupervised learning to represent high-dimensional data as a 2D grid map, reducing dimensionality. Another learning technique that is commonly used for dimensionality reduction and feature extraction in unsupervised learning tasks is the autoencoder (AE) [ 10 ]. Restricted Boltzmann machines (RBM) can be used for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling, according to [ 46 ]. A deep belief network (DBN) is usually made up of a backpropagation neural network and unsupervised networks like restricted Boltzmann machines (RBMs) or autoencoders (BPNN) [ 136 ]. A generative adversarial network (GAN) [ 35 ] is a deep learning network that can produce data with characteristics that are similar to the input data. Transfer learning is common worldwide presently because it can train deep neural networks with a small amount of data, which is usually the re-use of a pre-trained model on a new problem [ 137 ]. These deep learning methods can perform  well, particularly, when learning from large-scale datasets [ 105 , 140 ]. In our previous article Sarker et al. [ 104 ], we have summarized a brief discussion of various artificial neural networks (ANN) and deep learning (DL) models mentioned above, which can be used in a variety of data science and analytics tasks.

Real-World Application Domains

Almost every industry or organization is impacted by data, and thus “Data Science” including advanced analytics with machine learning modeling can be used in business, marketing, finance, IoT systems, cybersecurity, urban management, health care, government policies, and every possible industries, where data gets generated. In the following, we discuss ten most popular application areas based on data science and analytics.

  • Business or financial data science: In general, business data science can be considered as the study of business or e-commerce data to obtain insights about a business that can typically lead to smart decision-making as well as taking high-quality actions [ 90 ]. Data scientists can develop algorithms or data-driven models predicting customer behavior, identifying patterns and trends based on historical business data, which can help companies to reduce costs, improve service delivery, and generate recommendations for better decision-making. Eventually, business automation, intelligence, and efficiency can be achieved through the data science process discussed earlier, where various advanced analytics methods and machine learning modeling based on the collected data are the keys. Many online retailers, such as Amazon [ 76 ], can improve inventory management, avoid out-of-stock situations, and optimize logistics and warehousing using predictive modeling based on machine learning techniques [ 105 ]. In terms of finance, the historical data is related to financial institutions to make high-stakes business decisions, which is mostly used for risk management, fraud prevention, credit allocation, customer analytics, personalized services, algorithmic trading, etc. Overall, data science methodologies can play a key role in the future generation business or finance industry, particularly in terms of business automation, intelligence, and smart decision-making and systems.
  • Manufacturing or industrial data science: To compete in global production capability, quality, and cost, manufacturing industries have gone through many industrial revolutions [ 14 ]. The latest fourth industrial revolution, also known as Industry 4.0, is the emerging trend of automation and data exchange in manufacturing technology. Thus industrial data science, which is the study of industrial data to obtain insights that can typically lead to optimizing industrial applications, can play a vital role in such revolution. Manufacturing industries generate a large amount of data from various sources such as sensors, devices, networks, systems, and applications [ 6 , 68 ]. The main categories of industrial data include large-scale data devices, life-cycle production data, enterprise operation data, manufacturing value chain sources, and collaboration data from external sources [ 132 ]. The data needs to be processed, analyzed, and secured to help improve the system’s efficiency, safety, and scalability. Data science modeling thus can be used to maximize production, reduce costs and raise profits in manufacturing industries.
  • Medical or health data science: Healthcare is one of the most notable fields where data science is making major improvements. Health data science involves the extrapolation of actionable insights from sets of patient data, typically collected from electronic health records. To help organizations, improve the quality of treatment, lower the cost of care, and improve the patient experience, data can be obtained from several sources, e.g., the electronic health record, billing claims, cost estimates, and patient satisfaction surveys, etc., to analyze. In reality, healthcare analytics using machine learning modeling can minimize medical costs, predict infectious outbreaks, prevent preventable diseases, and generally improve the quality of life [ 81 , 119 ]. Across the global population, the average human lifespan is growing, presenting new challenges to today’s methods of delivery of care. Thus health data science modeling can play a role in analyzing current and historical data to predict trends, improve services, and even better monitor the spread of diseases. Eventually, it may lead to new approaches to improve patient care, clinical expertise, diagnosis, and management.
  • IoT data science: Internet of things (IoT) [ 9 ] is a revolutionary technical field that turns every electronic system into a smarter one and is therefore considered to be the big frontier that can enhance almost all activities in our lives. Machine learning has become a key technology for IoT applications because it uses expertise to identify patterns and generate models that help predict future behavior and events [ 112 ]. One of the IoT’s main fields of application is a smart city, which uses technology to improve city services and citizens’ living experiences. For example, using the relevant data, data science methods can be used for traffic prediction in smart cities, to estimate the total usage of energy of the citizens for a particular period. Deep learning-based models in data science can be built based on a large scale of IoT datasets [ 7 , 104 ]. Overall, data science and analytics approaches can aid modeling in a variety of IoT and smart city services, including smart governance, smart homes, education, connectivity, transportation, business, agriculture, health care, and industry, and many others.
  • Cybersecurity data science: Cybersecurity, or the practice of defending networks, systems, hardware, and data from digital attacks, is one of the most important fields of Industry 4.0 [ 114 , 121 ]. Data science techniques, particularly machine learning, have become a crucial cybersecurity technology that continually learns to identify trends by analyzing data, better detecting malware in encrypted traffic, finding insider threats, predicting where bad neighborhoods are online, keeping people safe while surfing, or protecting information in the cloud by uncovering suspicious user activity [ 114 ]. For instance, machine learning and deep learning-based security modeling can be used to effectively detect various types of cyberattacks or anomalies [ 103 , 106 ]. To generate security policy rules, association rule learning can play a significant role to build rule-based systems [ 102 ]. Deep learning-based security models can perform better when utilizing the large scale of security datasets [ 140 ]. Thus data science modeling can enable professionals in cybersecurity to be more proactive in preventing threats and reacting in real-time to active attacks, through extracting actionable insights from the security datasets.
  • Behavioral data science: Behavioral data is information produced as a result of activities, most commonly commercial behavior, performed on a variety of Internet-connected devices, such as a PC, tablet, or smartphones [ 112 ]. Websites, mobile applications, marketing automation systems, call centers, help desks, and billing systems, etc. are all common sources of behavioral data. Behavioral data is much more than just data, which is not static data [ 108 ]. Advanced analytics of these data including machine learning modeling can facilitate in several areas such as predicting future sales trends and product recommendations in e-commerce and retail; predicting usage trends, load, and user preferences in future releases in online gaming; determining how users use an application to predict future usage and preferences in application development; breaking users down into similar groups to gain a more focused understanding of their behavior in cohort analysis; detecting compromised credentials and insider threats by locating anomalous behavior, or making suggestions, etc. Overall, behavioral data science modeling typically enables to make the right offers to the right consumers at the right time on various common platforms such as e-commerce platforms, online games, web and mobile applications, and IoT. In social context, analyzing the behavioral data of human being using advanced analytics methods and the extracted insights from social data can be used for data-driven intelligent social services, which can be considered as social data science.
  • Mobile data science: Today’s smart mobile phones are considered as “next-generation, multi-functional cell phones that facilitate data processing, as well as enhanced wireless connectivity” [ 146 ]. In our earlier paper [ 112 ], we have shown that users’ interest in “Mobile Phones” is more and more than other platforms like “Desktop Computer”, “Laptop Computer” or “Tablet Computer” in recent years. People use smartphones for a variety of activities, including e-mailing, instant messaging, online shopping, Internet surfing, entertainment, social media such as Facebook, Linkedin, and Twitter, and various IoT services such as smart cities, health, and transportation services, and many others. Intelligent apps are based on the extracted insight from the relevant datasets depending on apps characteristics, such as action-oriented, adaptive in nature, suggestive and decision-oriented, data-driven, context-awareness, and cross-platform operation [ 112 ]. As a result, mobile data science, which involves gathering a large amount of mobile data from various sources and analyzing it using machine learning techniques to discover useful insights or data-driven trends, can play an important role in the development of intelligent smartphone applications.
  • Multimedia data science: Over the last few years, a big data revolution in multimedia management systems has resulted from the rapid and widespread use of multimedia data, such as image, audio, video, and text, as well as the ease of access and availability of multimedia sources. Currently, multimedia sharing websites, such as Yahoo Flickr, iCloud, and YouTube, and social networks such as Facebook, Instagram, and Twitter, are considered as valuable sources of multimedia big data [ 89 ]. People, particularly younger generations, spend a lot of time on the Internet and social networks to connect with others, exchange information, and create multimedia data, thanks to the advent of new technology and the advanced capabilities of smartphones and tablets. Multimedia analytics deals with the problem of effectively and efficiently manipulating, handling, mining, interpreting, and visualizing various forms of data to solve real-world problems. Text analysis, image or video processing, computer vision, audio or speech processing, and database management are among the solutions available for a range of applications including healthcare, education, entertainment, and mobile devices.
  • Smart cities or urban data science: Today, more than half of the world’s population live in urban areas or cities [ 80 ] and considered as drivers or hubs of economic growth, wealth creation, well-being, and social activity [ 96 , 116 ]. In addition to cities, “Urban area” can refer to the surrounding areas such as towns, conurbations, or suburbs. Thus, a large amount of data documenting daily events, perceptions, thoughts, and emotions of citizens or people are recorded, that are loosely categorized into personal data, e.g., household, education, employment, health, immigration, crime, etc., proprietary data, e.g., banking, retail, online platforms data, etc., government data, e.g., citywide crime statistics, or government institutions, etc., Open and public data, e.g., data.gov, ordnance survey, and organic and crowdsourced data, e.g., user-generated web data, social media, Wikipedia, etc. [ 29 ]. The field of urban data science typically focuses on providing more effective solutions from a data-driven perspective, through extracting knowledge and actionable insights from such urban data. Advanced analytics of these data using machine learning techniques [ 105 ] can facilitate the efficient management of urban areas including real-time management, e.g., traffic flow management, evidence-based planning decisions which pertain to the longer-term strategic role of forecasting for urban planning, e.g., crime prevention, public safety, and security, or framing the future, e.g., political decision-making [ 29 ]. Overall, it can contribute to government and public planning, as well as relevant sectors including retail, financial services, mobility, health, policing, and utilities within a data-rich urban environment through data-driven smart decision-making and policies, which lead to smart cities and improve the quality of human life.
  • Smart villages or rural data science: Rural areas or countryside are the opposite of urban areas, that include villages, hamlets, or agricultural areas. The field of rural data science typically focuses on making better decisions and providing more effective solutions that include protecting public safety, providing critical health services, agriculture, and fostering economic development from a data-driven perspective, through extracting knowledge and actionable insights from the collected rural data. Advanced analytics of rural data including machine learning [ 105 ] modeling can facilitate providing new opportunities for them to build insights and capacity to meet current needs and prepare for their futures. For instance, machine learning modeling [ 105 ] can help farmers to enhance their decisions to adopt sustainable agriculture utilizing the increasing amount of data captured by emerging technologies, e.g., the internet of things (IoT), mobile technologies and devices, etc. [ 1 , 51 , 52 ]. Thus, rural data science can play a very important role in the economic and social development of rural areas, through agriculture, business, self-employment, construction, banking, healthcare, governance, or other services, etc. that lead to smarter villages.

Overall, we can conclude that data science modeling can be used to help drive changes and improvements in almost every sector in our real-world life, where the relevant data is available to analyze. To gather the right data and extract useful knowledge or actionable insights from the data for making smart decisions is the key to data science modeling in any application domain. Based on our discussion on the above ten potential real-world application domains by taking into account data-driven smart computing and decision making, we can say that the prospects of data science and the role of data scientists are huge for the future world. The “Data Scientists” typically analyze information from multiple sources to better understand the data and business problems, and develop machine learning-based analytical modeling or algorithms, or data-driven tools, or solutions, focused on advanced analytics, which can make today’s computing process smarter, automated, and intelligent.

Challenges and Research Directions

Our study on data science and analytics, particularly data science modeling in “ Understanding data science modeling ”, advanced analytics methods and smart computing in “ Advanced analytics methods and smart computing ”, and real-world application areas in “ Real-world application domains ” open several research issues in the area of data-driven business solutions and eventual data products. Thus, in this section, we summarize and discuss the challenges faced and the potential research opportunities and future directions to build data-driven products.

  • Understanding the real-world business problems and associated data including nature, e.g., what forms, type, size, labels, etc., is the first challenge in the data science modeling, discussed briefly in “ Understanding data science modeling ”. This is actually to identify, specify, represent and quantify the domain-specific business problems and data according to the requirements. For a data-driven effective business solution, there must be a well-defined workflow before beginning the actual data analysis work. Furthermore, gathering business data is difficult because data sources can be numerous and dynamic. As a result, gathering different forms of real-world data, such as structured, or unstructured, related to a specific business issue with legal access, which varies from application to application, is challenging. Moreover, data annotation, which is typically the process of categorization, tagging, or labeling of raw data, for the purpose of building data-driven models, is another challenging issue. Thus, the primary task is to conduct a more in-depth analysis of data collection and dynamic annotation methods. Therefore, understanding the business problem, as well as integrating and managing the raw data gathered for efficient data analysis, may be one of the most challenging aspects of working in the field of data science and analytics.
  • The next challenge is the extraction of the relevant and accurate information from the collected data mentioned above. The main focus of data scientists is typically to disclose, describe, represent, and capture data-driven intelligence for actionable insights from data. However, the real-world data may contain many ambiguous values, missing values, outliers, and meaningless data [ 101 ]. The advanced analytics methods including machine and deep learning modeling, discussed in “ Advanced analytics methods and smart computing ”, highly impact the quality, and availability of the data. Thus understanding real-world business scenario and associated data, to whether, how, and why they are insufficient, missing, or problematic, then extend or redevelop the existing methods, such as large-scale hypothesis testing, learning inconsistency, and uncertainty, etc. to address the complexities in data and business problems is important. Therefore, developing new techniques to effectively pre-process the diverse data collected from multiple sources, according to their nature and characteristics could be another challenging task.
  • Understanding and selecting the appropriate analytical methods to extract the useful insights for smart decision-making for a particular business problem is the main issue in the area of data science. The emphasis of advanced analytics is more on anticipating the use of data to detect patterns to determine what is likely to occur in the future. Basic analytics offer a description of data in general, while advanced analytics is a step forward in offering a deeper understanding of data and helping to granular data analysis. Thus, understanding the advanced analytics methods, especially machine and deep learning-based modeling is the key. The traditional learning techniques mentioned in “ Advanced analytics methods and smart computing ” may not be directly applicable for the expected outcome in many cases. For instance, in a rule-based system, the traditional association rule learning technique [ 4 ] may  produce redundant rules from the data that makes the decision-making process complex and ineffective [ 113 ]. Thus, a scientific understanding of the learning algorithms, mathematical properties, how the techniques are robust or fragile to input data, is needed to understand. Therefore, a deeper understanding of the strengths and drawbacks of the existing machine and deep learning methods [ 38 , 105 ] to solve a particular business problem is needed, consequently to improve or optimize the learning algorithms according to the data characteristics, or to propose the new algorithm/techniques with higher accuracy becomes a significant challenging issue for the future generation data scientists.
  • The traditional data-driven models or systems typically use a large amount of business data to generate data-driven decisions. In several application fields, however, the new trends are more likely to be interesting and useful for modeling and predicting the future than older ones. For example, smartphone user behavior modeling, IoT services, stock market forecasting, health or transport service, job market analysis, and other related areas where time-series and actual human interests or preferences are involved over time. Thus, rather than considering the traditional data analysis, the concept of RecencyMiner, i.e., recent pattern-based extracted insight or knowledge proposed in our earlier paper Sarker et al. [ 108 ] might be effective. Therefore, to propose the new techniques by taking into account the recent data patterns, and consequently to build a recency-based data-driven model for solving real-world problems, is another significant challenging issue in the area.
  • The most crucial task for a data-driven smart system is to create a framework that supports data science modeling discussed in “ Understanding data science modeling ”. As a result, advanced analytical methods based on machine learning or deep learning techniques can be considered in such a system to make the framework capable of resolving the issues. Besides, incorporating contextual information such as temporal context, spatial context, social context, environmental context, etc. [ 100 ] can be used for building an adaptive, context-aware, and dynamic model or framework, depending on the problem domain. As a result, a well-designed data-driven framework, as well as experimental evaluation, is a very important direction to effectively solve a business problem in a particular domain, as well as a big challenge for the data scientists.
  • In several important application areas such as autonomous cars, criminal justice, health care, recruitment, housing, management of the human resource, public safety, where decisions made by models, or AI agents, have a direct effect on human lives. As a result, there is growing concerned about whether these decisions can be trusted, to be right, reasonable, ethical, personalized, accurate, robust, and secure, particularly in the context of adversarial attacks [ 104 ]. If we can explain the result in a meaningful way, then the model can be better trusted by the end-user. For machine-learned models, new trust properties yield new trade-offs, such as privacy versus accuracy; robustness versus efficiency; fairness versus robustness. Therefore, incorporating trustworthy AI particularly, data-driven or machine learning modeling could be another challenging issue in the area.

In the above, we have summarized and discussed several challenges and the potential research opportunities and directions, within the scope of our study in the area of data science and advanced analytics. The data scientists in academia/industry and the researchers in the relevant area have the opportunity to contribute to each issue identified above and build effective data-driven models or systems, to make smart decisions in the corresponding business domains.

In this paper, we have presented a comprehensive view on data science including various types of advanced analytical methods that can be applied to enhance the intelligence and the capabilities of an application. We have also visualized the current popularity of data science and machine learning-based advanced analytical modeling and also differentiate these from the relevant terms used in the area, to make the position of this paper. A thorough study on the data science modeling with its various processing modules that are needed to extract the actionable insights from the data for a particular business problem and the eventual data product. Thus, according to our goal, we have briefly discussed how different data modules can play a significant role in a data-driven business solution through the data science process. For this, we have also summarized various types of advanced analytical methods and outcomes as well as machine learning modeling that are needed to solve the associated business problems. Thus, this study’s key contribution has been identified as the explanation of different advanced analytical methods and their applicability in various real-world data-driven applications areas including business, healthcare, cybersecurity, urban and rural data science, and so on by taking into account data-driven smart computing and decision making.

Finally, within the scope of our study, we have outlined and discussed the challenges we faced, as well as possible research opportunities and future directions. As a result, the challenges identified provide promising research opportunities in the field that can be explored with effective solutions to improve the data-driven model and systems. Overall, we conclude that our study of advanced analytical solutions based on data science and machine learning methods, leads in a positive direction and can be used as a reference guide for future research and applications in the field of data science and its real-world applications by both academia and industry professionals.

Declarations

The author declares no conflict of interest.

This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K N and M. Shivakumar.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Everything you need to know about Data Science

post img

Checked : Soha K. , Eddy L.

Latest Update 18 Jan, 2024

Table of content

What is Data Science?

Are data science and business analytics the same, why use data science, the data science process, 1) predictive causal analysis, 2) prescription analysis, 3) machine learning to make predictions, the main phases of the data science process, 1) knowledge and analysis of the problem, 2) data preparation, 3) model planning, 4) the realization of the model, 5) communicating the results, conclusions.

We commonly talk about Data Science, because today data is a competitive advantage for companies, but what exactly does it mean? We will try to deepen this theme in this essential guide.

Data Science is the study that concerns the retrieval and analysis of data sets, intending to identify information and correspondences hidden in the unprocessed data, defined as raw. Data Science, in other words, is the science that combines programming skills and mathematical and statistical knowledge to extract meaningful information from data.

Data Science consists of the application of machine learning algorithms to numerical, textual data, images, video, and audio content. The algorithms, therefore, perform specific tasks that concern the extraction, cleaning, and processing of data, generating in turn, data that are transformed into real value for each organization.

Often the terms Data Science and Business Analytics are considered synonymous. After all, both the Business Analytics and Data Science activities deal with the data, their acquisition, and the development of models and information processing.

What then is the difference between Data Science and Business Analytics? As the name suggests, Business Analytics is focused on the processing of data, business or sectorial, to extract information useful to the company, focused on its market and on that of its competitors.

Data Science instead responds to questions about the influence of customer behavior on the company's business results. Data Science combines the potential of data with the creation of algorithms and the use of technology to answer a series of questions. Recently the functions of machine learning and artificial intelligence have evolved and will bring data science to levels that are still difficult to imagine. Business Analytics, on the other hand, continues to be a form of business data analysis with statistical concepts to obtain solutions and in-depth analysis by relating past data to those relating to the present.

The Data Science aims to identify the most significant datasets to answer the questions asked by the companies, elaborate them to extract new data related to behaviors, needs, and trends that are the basis of the data-driven decisions of their managers.

The data thus identified can help an organization contain costs, increase efficiency, recognize new market opportunities and increase competitive advantage.

Can the data produce other useful data? Of course yes! Data Science was created to understand the data and their relationships, analyze them, but above all to extract value and ensure that, properly interrogated and correlated, they generate information that is useful not only to understand the phenomena but above all to orient them.

Data Science is indispensable for companies dealing with digital transformation because it allows them to direct their products or services towards the customer, their purchasing behavior and respond to their needs. Leading companies in the global market, such as Netflix, Amazon, and Spotify use applications developed by Data Scientists. Thanks to   artificial intelligence , allow creating recommendation engines that suggest what to buy, what to listen to and which films to see based on the tastes of the individual user. These algorithms are also able to evaluate what were the suggestions that did not affect the user's interest thanks to the machine learning process, which allows refining the proposals more and more and thus increase conversions and optimizing the ROI.

Data Science is mainly used to provide forecasts and trends. It also used to make decisions using tools for predictive analysis, prescriptive analysis, and machine learning.

If the data analysis has the purpose of obtaining a prediction that a certain event will occur in the future, it is necessary to apply the predictive causal analysis. Suppose that a bank that provides loans wants to predict the likelihood that customers will repay the loan in the future. In this case, Data Science uses a model that can perform predictive analysis on the customer's payment history to predict whether future payments will be properly received.

On the other hand, if you want to create a model or pattern that applies AI to make decisions autonomously and can constantly update with dynamic self-learning functions, it is certainly necessary to create a prescriptive analysis model. This relatively recent area of Data Science consists of providing advice or directly assuming consequent behavior.

In other words, this model is not only able to predict but suggests or applies a series of prescribed actions. The best example of this is the self-driving car: the data collected by the vehicles are used to optimize the software that drives the car without human intervention. The model will be able to make decisions independently, establishing when to turn, which path to take, when to slow down or break decisively.

If you have, for example, transactional data from a credit card company and you need to build a model to determine the future trend, you need to use machine learning algorithms through supervised learning. It is called supervised because the data based on which the algorithm can be trained is already available. An example could be the continuous optimization of the voice recognition of Alexa or Google voice assistants.

The concrete application of Data Science involves a series of sequential phases, now codified in a sort of process.

Before starting an analysis project, it is essential to understand the objectives, the context of reference, the priorities and the budget available. In this phase the Data Scientist must identify the needs of those who commission the analysis, the questions to which the project must respond, the data sets already available and those to be found to make the analysis work more effective. Finally, it is necessary to formulate the initial hypotheses, in a research framework open to the answers generated by relating the data, whose combinations can reserve surprises.

In this phase, the data coming from various sources, generally inhomogeneous, are extracted and cleaning is performed to transform them into elements that can be analyzed. In this phase, an analytical sandbox is needed in which it is possible to perform analyzes for the entire duration of the project. Often we use models in R language to clean, transform and   display data . This will help identify outliers and establish a relationship between the variables. Once the data has been cleaned and prepared, it is now possible to perform the data analysis activity by entering them in a data warehouse.

We then proceed to determine the methods and techniques for identifying the relationships between the variables. These relationships will be the basis of the algorithms that will be implemented for that function. In this phase, we use R, which has a complete set of modeling features and provides a good environment for the construction of interpretative models. SQL analysis services that perform processing using data mining functions and basic predictive models are also useful. Although there are many tools on the market, R is the most used programming language for these activities.

After investigating the nature of the data available and designing the algorithms to be used, it is time to apply the model. This is tested with data sets specifically identified and made available for self-learning of the algorithm. We will evaluate if the existing tools will be sufficient for the execution of the models or we will need a more structured elaboration, then we move on to the optimization of the model and the elaboration is launched.

image banner

We Will Write an Essay for You Quickly

Here is the moment in which the Data Science activity is called to make the relationships identified between the data and the answers to the questions envisaged in the project understandable. In this phase, we reach the objective of the analysis. It is, therefore, necessary to elaborate one or more reports, destined to the managers of the various business functions, making the data emerged from the data science process easily understandable, adopting elements of graphic display, such as infographics and graphics. The text will be understandable even to those who do not have too much experience with data and will simplify their interpretation. It is also useful for those who are involved in product design, marketing management like top managers, who can make data-driven decisions based on data.

Data Science is revolutionizing in many sectors. It is just all about to know your client, analyzing his behavior by identifying relationships between data that can turn into predictive results regarding market trends and orientations. Today we are at an early stage, which already allows us to obtain results, but through the development of the IoT, sensors and other tools for data collection will be possible developments now only imaginable.

Looking for a Skilled Essay Writer?

creator avatar

  • University of California, Los Angeles (UCLA) Bachelor of Arts

No reviews yet, be the first to write your comment

Write your review

Thanks for review.

It will be published after moderation

Latest News

article image

What happens in the brain when learning?

10 min read

20 Jan, 2024

article image

How Relativism Promotes Pluralism and Tolerance

article image

Everything you need to know about short-term memory

Advertisement

Advertisement

Data science: a game changer for science and innovation

  • Regular Paper
  • Open access
  • Published: 19 April 2021
  • Volume 11 , pages 263–278, ( 2021 )

Cite this article

You have full access to this open access article

essay about data science

  • Valerio Grossi 1 ,
  • Fosca Giannotti 1 ,
  • Dino Pedreschi 2 ,
  • Paolo Manghi 3 ,
  • Pasquale Pagano 3 &
  • Massimiliano Assante 3  

14k Accesses

19 Citations

57 Altmetric

Explore all metrics

This paper shows data science’s potential for disruptive innovation in science, industry, policy, and people’s lives. We present how data science impacts science and society at large in the coming years, including ethical problems in managing human behavior data and considering the quantitative expectations of data science economic impact. We introduce concepts such as open science and e-infrastructure as useful tools for supporting ethical data science and training new generations of data scientists. Finally, this work outlines SoBigData Research Infrastructure as an easy-to-access platform for executing complex data science processes. The services proposed by SoBigData are aimed at using data science to understand the complexity of our contemporary, globally interconnected society.

Similar content being viewed by others

essay about data science

What Is Data Science?

Introduction to applied data science, data science.

Avoid common mistakes on your manuscript.

1 Introduction: from data to knowledge

Data science is an interdisciplinary and pervasive paradigm where different theories and models are combined to transform data into knowledge (and value). Experiments and analyses over massive datasets are functional not only to the validation of existing theories and models but also to the data-driven discovery of patterns emerging from data, which can help scientists in the design of better theories and models, yielding a deeper understanding of the complexity of the social, economic, biological, technological, cultural, and natural phenomenon. The products of data science are the result of re-interpreting available data for analysis goals that differ from the original reasons motivating data collection. All these aspects are producing a change in the scientific method, in research and in the way our society makes decisions [ 2 ].

Data science emerges to concurring facts: (i) the advent of big data that provides the critical mass of actual examples to learn from, (ii) the advances in data analysis and learning techniques that can produce predictive models and behavioral patterns from big data, and (iii) the advances in high-performance computing infrastructures that make it possible to ingest and manage big data and perform complex analysis [ 16 ].

Paper organization Section 2 discusses how data science impacts our science and society at large in the coming years. Section 3 outlines the main issues related to the ethical problems in studying human behaviors that data science introduces. In Sect.  4 , we show how concepts such as open science and e-infrastructure are effective tools for supporting, disseminating ethical uses of the data, and training new generations of data scientists. We will illustrate the importance of an open data science with examples provided later in the paper. Finally, we show some use cases of data science through thematic environments that bind the datasets with social mining methods.

2 Data science for society, science, industry and business

figure 1

Data science as an ecosystem: on the left, the figure shows the main components enabling data science (data, analytical methods, and infrastructures). On the right, we can find the impact of data science into society, science, and business. All the activities related to data science should be done under rigid ethical principles

The quality of business decision making, government administration, and scientific research can potentially be improved by analyzing data. Data science offers important insights into many complicated issues, in many instances, with remarkable accuracy and timeliness.

figure 2

The data science pipeline starts with raw data and transforms them into data used for analytics. The next step is to transform these data into knowledge through analytical methods and then provide results and evaluation measures

As shown in Fig.  1 , data science is an ecosystem where the following scientific, technological, and socioeconomic factors interact:

Data Availability of data and access to data sources;

Analytics & computing infrastructures Availability of high performance analytical processing and open-source analytics;

Skills Availability of highly and rightly skilled data scientists and engineers;

Ethical & legal aspects Availability of regulatory environments for data ownership and usage, data protection and privacy, security, liability, cybercrime, and intellectual property rights;

Applications Business and market ready applications;

Social aspects Focus on major societal global challenges.

Data science envisioned as the intersection between data mining, big data analytics, artificial intelligence, statistical modeling, and complex systems is capable of monitoring data quality and analytical processes results transparently. If we want data science to face the global challenges and become a determinant factor of sustainable development, it is necessary to push towards an open global ecosystem for science, industrial, and societal innovation [ 48 ]. We need to build an ecosystem of socioeconomic activities, where each new idea, product, and service create opportunities for further purposes, and products. An open data strategy, innovation, interoperability, and suitable intellectual property rights can catalyze such an ecosystem and boost economic growth and sustainable development. This strategy also requires a “networked thinking” and a participatory, inclusive approach.

Data are relevant in almost all the scientific disciplines, and a data-dominated science could lead to the solution of problems currently considered hard or impossible to tackle. It is impossible to cover all the scientific sectors where a data-driven revolution is ongoing; here, we shall only provide just a few examples.

The Sloan Digital Sky Survey Footnote 1 has become a central resource for astronomers over the world. Astronomy is being transformed from the one where taking pictures of the sky was a large part of an astronomer’s job, to the one where the images are already in a database, and the astronomer’s task is to find interesting objects and phenomenon in the database. In biological sciences, data are stored in public repositories. There is an entire discipline of bioinformatics that is devoted to the analysis of such data. Footnote 2 Data-centric approaches based on personal behaviors can also support medical applications analyzing data at both human behavior levels and lower molecular ones. For example, integrating genome data of medical reactions with the habits of the users, enabling a computational drug science for high-precision personalized medicine. In humans, as in other organisms, most cellular components exert their functions through interactions with other cellular components. The totality of these interactions (representing the human “interactome”) is a network with hundreds of thousand nodes and a much larger number of links. A disease is rarely a consequence of an abnormality in a single gene. Instead, the disease phenotype is a reflection of various pathological processes that interact in a complex network. Network-based approaches can have multiple biological and clinical applications, especially in revealing the mechanisms behind complex diseases [ 6 ].

Now, we illustrate the typical data science pipeline [ 50 ]. People, machines, systems, factories, organizations, communities, and societies produce data. Data are collected in every aspect of our life, when: we submit a tax declaration; a customer orders an item online; a social media user posts a comment; a X-ray machine is used to take a picture; a traveler sends a review on a restaurant; a sensor in a supply chain sends an alert; or a scientist conducts an experiment. This huge and heterogeneous quantity of data needs to be extracted, loaded, understood, transformed, and in many cases, anonymized before they may be used for analysis. Analysis results include routines, automated decisions, predictions, and recommendations, and outcomes that need to be interpreted to produce actions and feedback. Furthermore, this scenario must also consider ethical problems in managing social data. Figure 2 depicts the data science pipeline. Footnote 3 Ethical aspects are important in the application of data science in several sectors, and they are addressed in Sect.  3 .

2.1 Impact on society

Data science is an opportunity for improving our society and boosting social progress. It can support policymaking; it offers novel ways to produce high-quality and high-precision statistical information and empower citizens with self-awareness tools. Furthermore, it can help to promote ethical uses of big data.

Modern cities are perfect environments densely traversed by large data flows. Using traffic monitoring systems, environmental sensors, GPS individual traces, and social information, we can organize cities as a collective sharing of resources that need to be optimized, continuously monitored, and promptly adjusted when needed. It is easy to understand the potentiality of data science by introducing terms such as urban planning , public transportation , reduction of energy consumption , ecological sustainability, safety , and management of mass events. These terms represent only the front line of topics that can benefit from the awareness that big data might provide to the city stakeholders [ 22 , 27 , 29 ]. Several methods allowing human mobility analysis and prediction are available in the literature: MyWay [ 47 ] exploits individual systematic behaviors to predict future human movements by combining individual and collective learned models. Carpooling [ 22 ] is based on mobility data from travelers in a given territory and constructs a network of potential carpooling users, by exploiting topological properties, highlighting sub-populations with higher chances to create a carpooling community and the propensity of users to be either drivers or passengers in a shared car. Event attendance prediction [ 13 ] analyzes users’ call habits and classifies people into behavioral categories, dividing them among residents, commuters, and visitors and allows to observe the variety of behaviors of city users and the attendance in big events in cities.

Electric mobility is expected to gain importance for the world. The impact of a complete switch to electric mobility is still under investigation, and what appears to be critical is the intensity of flows due to charge (and fast recharge) systems that may challenge the stability of the power network. To avoid instabilities regarding the charging infrastructure, an accurate prediction of power flows associated with mobility is needed. The use of personal mobility data can estimate the mobility flow and simulate the impact of different charging behavioral patterns to predict power flows and optimize the position of the charging infrastructures [ 25 , 49 ]. Lorini et al. [ 26 ] is an example of an urban flood prediction that integrates data provided by CEM system Footnote 4 and Twitter data. Twitter data are processed using massive multilingual approaches for classification. The model is a supervised model which requires a careful data collection and validation of ground truth about confirmed floods from multiple sources.

Another example of data science for society can be found in the development of applications with functions aimed directly at the individual. In this context, concepts such as personal data stores and personal data analytics are aimed at implementing a new deal on personal data, providing a user-centric view where data are collected, integrated and analyzed at the individual level, and providing the user with better awareness of own behavioral, health, and consumer profiles. Within this user-centric perspective, there is room for an even broader market of business applications, such as high-precision real-time targeted marketing, e.g., self-organizing decision making to preserve desired global properties, and sustainability of the transportation or the healthcare system. Such contexts emphasize two essential aspects of data science: the need for creativeness to exploit and combine the several data sources in novel ways and the need to give awareness and control of the personal data to the users that generate them, to sustain a transparent, trust-based, crowd-sourced data ecosystem [ 19 ].

The impact of online social networks in our society has changed the mechanisms behind information spreading and news production. The transformation of media ecosystems and news consumption are having consequences in several fields. A relevant example is the impact of misinformation on society, as for the Brexit referendum when the massive diffusion of fake news has been considered one of the most relevant factors of the outcome of this political event. Examples of achievements are provided by the results regarding the influence of external news media on polarization in online social networks. These achievements indicate that users are highly polarized towards news sources, i.e., they cite (and tend to cite) sources that they identify as ideologically similar to them. Other results regard echo chambers and the role of social media users: there is a strong correlation between the orientation of the content produced and consumed. In other words, an opinion “echoes” back to the user when others are sharing it in the “chamber” (i.e., the social network around the user) [ 36 ]. Other results worth mentioning regard efforts devoted to uncovering spam and bot activities in stock microblogs on Twitter: taking inspiration from biological DNA, the idea is to model the online users’ behavior through strings of characters representing sequences of online users’ actions. As a result of the following papers, [ 11 , 12 ] report that 71% of suspicious users were classified as bots; furthermore, 37% of them also got suspended by Twitter few months after our investigation. Several approaches can be found in the literature. However, they generally display some limitations. Some of them work only on some of the features of the diffusion of misinformation (bot detections, segregation of users due to their opinions or other social analysis), or there is a lack of comprehensive frameworks for interpreting results. While the former case is somehow due to the innovation of the research field and it is explainable, the latter showcases a more fundamental need, as, without strict statistical validation, it is hard to state which are the crucial elements that permit a well-grounded description of a system. For avoiding fake news diffusion, we can state that building a comprehensive fake news dataset providing all information about publishers, shared contents, and the engagements of users over space and time, together with their profile stories, can help the development of innovative and effective learning models. Both unsupervised and supervised methods will work together to identify misleading information. Multidisciplinary teams made up of journalists, linguists, and behavioral scientists and similar will be needed to identify what amounts to information warfare campaigns. Cyberwarfare and information warfare will be two of the biggest threats the world will face in the 21st Century.

Social sensing methods collect data produced by digital citizens, by either opportunistic or participatory crowd-sensing, depending on users’ awareness of their involvement. These approaches present a variety of technological and ethical challenges. An example is represented by Twitter Monitor [ 10 ], that is crowd-sensing tool designed to access Twitter streams through the Twitter Streaming API. It allows launching parallel listening for collecting different sets of data. Twitter Monitor represents a tool for creating services for listening campaigns regarding relevant events such as political elections, natural and human-made disasters, popular national events, etc. [ 11 ]. This campaign can be carried out, specifying keywords, accounts, and geographical areas of interest.

Nowcasting Footnote 5 financial and economic indicators focus on the potential of data science as a proxy for well-being and socioeconomic applications. The development of innovative research methods has demonstrated that poverty indicators can be approximated by social and behavioral mobility metrics extracted from mobile phone data and GPS data [ 34 ]; and the Gross Domestic Product can be accurately nowcasted by using retail supermarket market data [ 18 ]. Furthermore, nowcasting of demographic aspects of territory based on Twitter data [ 1 ] can support official statistics, through the estimation of location, occupation, and semantics. Networks are a convenient way to represent the complex interaction among the elements of a large system. In economics, networks are gaining increasing attention because the underlying topology of a networked system affects the aggregate output, the propagation of shocks, or financial distress; or the topology allows us to learn something about a node by looking at the properties of its neighbors. Among the most investigated financial and economic networks, we cite a work that analyzes the interbank systems, the payment networks between firms, the banks-firms bipartite networks, and the trading network between investors [ 37 ]. Another interesting phenomenon is the advent of blockchain technology that has led to the innovation of bitcoin crypto-currency [ 31 ].

Data science is an excellent opportunity for policy, data journalism, and marketing. The online media arena is now available as a real-time experimenting society for understanding social mechanisms, like harassment, discrimination, hate, and fake news. In our vision, the use of data science approaches is necessary for better governance. These new approaches integrate and change the Official Statistics representing a cheaper and more timely manner of computing them. The impact of data science-driven applications can be particularly significant when the applications help to build new infrastructures or new services for the population.

The availability of massive data portraying soccer performance has facilitated recent advances in soccer analytics. Rossi et al. [ 42 ] proposed an innovative machine learning approach to the forecasting of non-contact injuries for professional soccer players. In [ 3 ], we can find the definition of quantitative measures of pressing in defensive phases in soccer. Pappalardo et al. [ 33 ] outlined the automatic and data-driven evaluation of performance in soccer, a ranking system for soccer teams. Sports data science is attracting much interest and is now leading to the release of a large and public dataset of sports events.

Finally, data science has unveiled a shift from population statistics to interlinked entities statistics, connected by mutual interactions. This change of perspective reveals universal patterns underlying complex social, economic, technological, and biological systems. It is helpful to understand the dynamics of how opinions, epidemics, or innovations spread in our society, as well as the mechanisms behind complex systemic diseases, such as cancer and metabolic disorders revealing hidden relationships between them. Considering diffusive models and dynamic networks, NDlib [ 40 ] is a Python package for the description, simulation, and observation of diffusion processes in complex networks. It collects diffusive models from epidemics and opinion dynamics and allows a scientist to compare simulation over synthetic systems. For community discovery, two tools are available for studying the structure of a community and understand its habits: Demon [ 9 ] extracts ego networks (i.e., the set of nodes connected to an ego node) and identifies the real communities by adopting a democratic, bottom-up merging approach of such structures. Tiles [ 41 ] is dedicated to dynamic network data and extracts overlapping communities and tracks their evolution in time following an online iterative procedure.

2.2 Impact on industry and business

Data science can create an ecosystem of novel data-driven business opportunities. As a general trend across all sectors, massive quantities of data will be made accessible to everybody, allowing entrepreneurs to recognize and to rank shortcomings in business processes, to spot potential threads and win-win situations. Ideally, every citizen could establish from these patterns new business ideas. Co-creation enables data scientists to design innovative products and services. The value of joining different datasets is much larger than the sum of the value of the separated datasets by sharing data of various nature and provenance.

The gains from data science are expected across all sectors, from industry and production to services and retail. In this context, we cite several macro-areas where data science applications are especially promising. In energy and environment , the digitization of the energy systems (from production to distribution) enables the acquisition of real-time, high-resolution data. Coupled with other data sources, such as weather data, usage patterns, and market data (accompanied by advanced analytics), efficiency levels can be increased immensely. The positive impact to the environment is also enhanced by geospatial data that help to understand how our planet and its climate are changing and to confront major issues such as global warming, preservation of the species, the role and effects of human activities.

The manufacturing and production sector with the growing investments into Industry 4.0 and smart factories with sensor-equipped machinery that are both intelligent and networked (see internet of things . Cyber-physical systems ) will be one of the major producers of data in the world. The application of data science into this sector will bring efficiency gains and predictive maintenance. Entirely new business models are expected since the mass production of individualized products becomes possible where consumers may have direct access to influence and control.

As already stated in Sect.  2.1 , data science will contribute to increasing efficiency in public administrations processes and healthcare. In the physical and the cyber-domain, security will be enhanced. From financial fraud to public security, data science will contribute to establishing a framework that enables a safe and secure digital economy. Big data exploitation will open up opportunities for innovative, self-organizing ways of managing logistical business processes. Deliveries could be based on predictive monitoring, using data from stores, semantic product memories, internet forums, and weather forecasts, leading to both economic and environmental savings. Let us also consider the impact of personalized services for creating real experiences for tourists. The analysis of real-time and context-aware data (with the help of historical and cultural heritage data) will provide customized information to each tourist, and it will contribute to the better and more efficient management of the whole tourism value chain.

3 Data science ethics

Data science creates great opportunities but also new risks. The use of advanced tools for data analysis could expose sensitive knowledge of individual persons and could invade their privacy. Data science approaches require access to digital records of personal activities that contain potentially sensitive information. Personal information can be used to discriminate people based on their presumed characteristics. Data-driven algorithms yield classification and prediction models of behavioral traits of individuals, such as credit score, insurance risk, health status, personal preferences, and religious, ethnic, or political orientation, based on personal data disseminated in the digital environment by users (with or often without their awareness). The achievements of data science are the result of re-interpreting available data for analysis goals that differ from the original reasons motivating data collection. For example, mobile phone call records are initially collected by telecom operators for billing and operational aims, but they can be used for accurate and timely demography and human mobility analysis at a country or regional scale. This re-purposing of data clearly shows the importance of legal compliance and data ethics technologies and safeguards to protect privacy and anonymity; to secure data; to engage users; to avoid discrimination and misuse; to account for transparency; and to the purpose of seizing the opportunities of data science while controlling the associated risks.

Several aspects should be considered to avoid to harm individual privacy. Ethical elements should include the: (i) monitoring of the compliance of experiments, research protocols, and applications with ethical and juridical standards; (ii) developing of big data analytics and social mining tools with value-sensitive design and privacy-by-design methodologies; (iii) boosting of excellence and international competitiveness of Europe’s big data research in safe and fair use of big data for research. It is essential to highlight that data scientists using personal and social data also through infrastructures have the responsibility to get acquainted with the fundamental ethical aspects relating to becoming a “data controller.” This aspect has to be considered to define courses for informing and training data scientists about the responsibilities, the possibilities, and the boundaries they have in data manipulation.

Recalling Fig.  2 , it is crucial to inject into the data science pipeline the ethical values of fairness : how to avoid unfair and discriminatory decisions; accuracy : how to provide reliable information; confidentiality : how to protect the privacy of the involved people and transparency : how to make models and decisions comprehensible to all stakeholders. This value-sensitive design has to be aimed at boosting widespread social acceptance of data science, without inhibiting its power. Finally, it is essential to consider also the impact of the General Data Protection Regulation (GDPR) on (i) companies’ duties and how these European companies should comply with the limits in data manipulation the Regulation requires; and on (ii) researchers’ duties and to highlight articles and recitals which specifically mention and explain how research is intended in GDPR’s legal system.

figure 3

The relationship between big and open data and how they relate to the broad concept of open government

We complete this section with another important aspect related to open data, i.e., accessible public data that people, companies, and organizations can use to launch new ventures, analyze patterns and trends, make data-driven decisions, and solve complex problems. All the definitions of open data include two features: (i) the data must be publicly available for anyone to use, and (ii) data must be licensed in a way that allows for its reuse. All over the world, initiatives are launched to make data open by government agencies and public organizations; listing them is impossible, but an UN initiative has to be mentioned. Global Pulse Footnote 6 meant to implement the vision for a future in which big data is harnessed safely and responsibly as a public good.

Figure 3 shows the relationships between open data and big data. Currently, the problem is not only that government agencies (and some business companies) are collecting personal data about us, but also that we do not know what data are being collected and we do not have access to the information about ourselves. As reported by the World Economic forum in 2013, it is crucial to understand the value of personal data to let the users make informed decisions. A new branch of philosophy and ethics is emerging to handle personal data related issues. On the one hand, in all cases where the data might be used for the social good (i.e., medical research, improvement of public transports, contrasting epidemics), and understanding the personal data value means to correctly evaluate the balance between public benefits and personal loss of protection. On the other hand, when data are aimed to be used for commercial purposes, the value mentioned above might instead translate into simple pricing of personal information that the user might sell to a company for its business. In this context, discrimination discovery consists of searching for a-priori unknown contexts of suspect discrimination against protected-by-law social groups, by analyzing datasets of historical decision records. Machine learning and data mining approaches may be affected by discrimination rules, and these rules may be deeply hidden within obscure artificial intelligence models. Thus, discrimination discovery consists of understanding whether a predictive model makes direct or indirect discrimination. DCube [ 43 ] is a tool for data-driven discrimination discovery, a library of methods on fairness analysis.

It is important to evaluate how a mining model or algorithm takes its decision. The growing field of methods for explainable machine learning provides and continuously expands a set of comprehensive tool-kits [ 21 ]. For example, X-Lib is a library containing state-of-the-art explanation methods organized within a hierarchical structure and wrapped in a similar fashion way such that they can be easily accessed and used from different users. The library provides support for explaining classification on tabular data and images and for explaining the logic of complex decision systems. X-Lib collects, among the others, the following collection of explanation methods: LIME [ 38 ], Anchor [ 39 ], DeepExplain that includes Saliency maps [ 44 ], Gradient * Input, Integrated Gradients, and DeepLIFT [ 46 ]. Saliency method is a library containing code for SmoothGrad [ 45 ], as well as implementations of several other saliency techniques: Vanilla Gradients, Guided Backpropogation, and Grad-CAM. Another improvement in this context is the use of robotics and AI in data preparation, curation, and in detecting bias in data, information and knowledge as well as in the misuse and abuse of these assets when it comes to legal, privacy, and ethical issues and when it comes to transparency and trust. We cannot rely on human beings to do these tasks. We need to exploit the power of robotics and AI to help provide the protections required. Data and information lawyers will play a key role in legal and privacy issues, ethical use of these assets, and the problem of bias in both algorithms and the data, information, and knowledge used to develop analytics solutions. Finally, we can state that data science can help to fill the gap between legislators and technology.

4 Big data ecosystem: the role of research infrastructures

Research infrastructures (RIs) play a crucial role in the advent and development of data science. A social mining experiment exploits the main components of data science depicted in Fig.  1 (i.e., data, infrastructures, analytical methods) to enable multidisciplinary scientists and innovators to extract knowledge and to make the experiment reusable by the scientific community, innovators providing an impact on science and society.

Resources such as data and methods help domain and data scientists to transform research or an innovation question into a responsible data-driven analytical process. This process is executed onto the platform, thus supporting experiments that yield scientific output, policy recommendations, or innovative proofs-of-concept. Furthermore, an operational ethical board’s stewardship is a critical factor in the success of a RI.

An infrastructure typically offers easy-to-use means to define complex analytical processes and workflows , thus bridging the gap between domain experts and analytical technology. In many instances, domain experts may become a reference for their scientific communities, thus facilitating new users engagement within the RI activities. As a collateral feedback effect, experiments will generate new relevant data, methods, and workflows that can be integrated into the platform by data scientists, contributing to the resource expansion of the RI. An experiment designed in a node of the RI and executed on the platform returns its results to the entire RI community.

Well defined thematic environments amplify new experiments achievements towards the vertical scientific communities (and potential stakeholders) by activating appropriate dissemination channels.

4.1 The SoBigData Research Infrastructure

The SoBigData Research Infrastructure Footnote 7 is an ecosystem of human and digital resources, comprising data scientists, analytics, and processes. As shown in Fig.  4 , SoBigData is designed to enable multidisciplinary scientists and innovators to realize social mining experiments and to make them reusable by the scientific communities. All the components have been introduced for implementing data science from raw data management to knowledge extraction, with particular attention to legal and ethical aspects as reported in Fig.  1 . SoBigData supports data science serving a cross-disciplinary community of data scientists studying all the elements of societal complexity from a data- and model-driven perspective.

Currently, SoBigData includes scientific, industrial, and other stakeholders. In particular, our stakeholders are data analysts and researchers (35.6%), followed by companies (33.3%) and policy and lawmakers (20%). The following sections provide a short but comprehensive overview of the services provided by SoBigData RI with special attention on supporting ethical and open data science [ 15 , 16 ].

4.1.1 Resources, facilities, and access opportunities

Over the past decade, Europe has developed world-leading expertise in building and operating e-infrastructures. They are large-scale, federated and distributed online research environments through which researchers can share access to scientific resources (including data, instruments, computing, and communications), regardless of their location. They are meant to support unprecedented scales of international collaboration in science, both within and across disciplines, investing in economy-of-scale and common behavior, policies, best practices, and standards. They shape up a common environment where scientists can create , validate , assess , compare , and share their digital results of science, such as research data and research methods, by using a common “digital laboratory” consisting of agreed-on services and tools.

figure 4

The SoBigData Research Infrastructure: an ecosystem of human and digital resources, comprising data scientists, analytical methods, and processes. SoBigData enables multidisciplinary scientists and innovators to carry out experiments and to make them reusable by the community

However, the implementation of workflows, possibly following Open Science principles of reproducibility and transparency, is hindered by a multitude of real-world problems. One of the most prominent is that e-infrastructures available to research communities today are far from being well-designed and consistent digital laboratories, neatly designed to share and reuse resources according to common policies, data models, standards, language platforms, and APIs. They are instead “patchworks of systems,” assembling online tools, services, and data sources and evolving to match the requirements of the scientific process, to include new solutions. The degree of heterogeneity excludes the adoption of uniform workflow management systems, standard service-oriented approaches, routine monitoring and accounting methods. The realization of scientific workflows is typically realized by writing ad hoc code, manipulating data on desktops, alternating the execution of online web services, sharing software libraries implementing research methods in different languages, desktop tools, web-accessible execution engines (e.g., Taverna, Knime, Galaxy).

The SoBigData e-infrastructure is based on D4Science services, which provides researchers and practitioners with a working environment where open science practices are transparently promoted, and data science practices can be implemented by minimizing the technological integration cost highlighted above.

D4Science is a deployed instance of the gCube Footnote 8 technology [ 4 ], a software conceived to facilitate the integration of web services, code, and applications as resources of different types in a common framework, which in turn enables the construction of Virtual Research Environments (VREs) [ 7 ] as combinations of such resources (Fig.  5 ). As there is no common framework that can be trusted enough, sustained enough, to convince resource providers that converging to it would be a worthwhile effort, D4Science implements a “system of systems.” In such a framework, resources are integrated with minimal cost, to gain in scalability, performance, accounting, provenance tracking, seamless integration with other resources, visibility to all scientists. The principle is that the cost of “participation” to the framework is on the infrastructure rather than on resource providers. The infrastructure provides the necessary bridges to include and combine resources that would otherwise be incompatible.

figure 5

D4Science: resources from external systems, virtual research environments, and communities

More specifically, via D4Science, SoBigData scientists can integrate and share resources such as datasets, research methods, web services via APIs, and web applications via Portlets. Resources can then be integrated, combined, and accessed via VREs, intended as web-based working environments tailored to support the needs of their designated communities, each working on a research question. Research methods are integrated as executable code, implementing WPS APIs in different programming languages (e.g., Java, Python, R, Knime, Galaxy), which can be executed via the Data Miner analytics platform in parallel, transparently to the users, over powerful and extensible clusters, and via simple VRE user interfaces. Scientists using Data Miner in the context of a VRE can select and execute the available methods and share the results with other scientists, who can repeat or reproduce the experiment with a simple click.

D4Science VREs are equipped with core services supporting data analysis and collaboration among its users: ( i ) a shared workspace to store and organize any version of a research artifact; ( ii ) a social networking area to have discussions on any topic (including working version and released artifacts) and be informed on happenings; ( iii ) a Data Miner analytics platform to execute processing tasks (research methods) either natively provided by VRE users or borrowed from other VREs to be applied to VRE users’ cases and datasets; and iv ) a catalogue-based publishing platform to make the existence of a certain artifact public and disseminated. Scientists operating within VREs use such facilities continuously and transparently track the record of their research activities (actions, authorship, provenance), as well as products and links between them (lineage) resulting from every phase of the research life cycle, thus facilitating publishing of science according to Open Science principles of transparency and reproducibility [ 5 ].

Today, SoBigData integrates the resources in Table  1 . By means of such resources, SoBigData scientists have created VREs to deliver the so-called SoBigData exploratories : Explainable Machine Learning , Sports Data Science , Migration Studies , Societal Debates , Well-being & Economy , and City of Citizens . Each exploratory includes the resources required to perform Data science workflows in a controlled and shared environment. Resources range from data to methods, described more in detail in the following, together with their exploitation within the exploratories.

All the resources and instruments integrate into SoBigData RI are structured in such a way as to operate within the confines of the current data protection law with the focus on General Data Protection Regulation (GDPR) and ethical analysis of the fundamental values involved in social mining and AI. Each item into the catalogue has specific fields for managing ethical issues (e.g., if a dataset contains personal info) and fields for describing and managing intellectual properties.

4.1.2 Data resources: social mining and big data ecosystem

SoBigData RI defines policies supporting users in the collection, description, preservation, and sharing of their data sets. It implements data science making such data available for collaborative research by adopting various strategies, ranging from sharing the open data sets with the scientific community at large, to share the data with disclosure restriction allowing data access within secure environments.

Several big data sets are available through SoBigData RI including network graphs from mobile phone call data; networks crawled from many online social networks, including Facebook and Flickr, transaction micro-data from diverse retailers, query logs both from search engines and e-commerce, society-wide mobile phone call data records, GPS tracks from personal navigation devices, survey data about customer satisfaction or market research, extensive web archives, billions of tweets, and data from location-aware social networks.

4.1.3 Data science through SoBigData exploratories

Exploratories are thematic environments built on top of the SoBigData RI. An exploratory binds datasets with social mining methods providing the research context for supporting specific data science applications by: (i) providing the scientific context for performing the application. This context can be considered a container for binding specific methods, applications, services, and datasets; (ii) stimulating communities on the effectiveness of the analytical process related to the analysis, promoting scientific dissemination, result sharing, and reproducibility. The use of exploratories promotes the effectiveness of the data science trough research infrastructure services. The following sections report a short description of the six SoBigData exploratories. Figure 6 shows the main thematic areas covered by each exploratory. Due to its nature, Explainable Machine Learning exploratory can be applied to each sector where a black-box machine learning approach is used. The list of exploratories (and the data and methods inside them) are updated continuously and continue to grow over time. Footnote 9

figure 6

SoBigData covers six thematic areas listed horizontally. Each exploratory covers more than one thematic area

City of citizens. This exploratory aims to collect data science applications and methods related to geo-referenced data. The latter describes the movements of citizens in a city, a territory, or an entire region. There are several studies and different methods that employ a wide variety of data sources to build models about the mobility of people and city characteristics in the scientific literature [ 30 , 32 ]. Like ecosystems, cities are open systems that live and develop utilizing flows of energy, matter, and information. What distinguishes a city from a colony is the human component (i.e., the process of transformation by cultural and technological evolution). Through this combination, cities are evolutionary systems that develop and co-evolve continuously with their inhabitants [ 24 ]. Cities are kaleidoscopes of information generated by a myriad of digital devices weaved into the urban fabric. The inclusion of tracking technologies in personal devices enabled the analysis of large sets of mobility data like GPS traces and call detail records.

Data science applied to human mobility is one of the critical topics investigated in SoBigData thanks to the decennial experience of partners in European projects. The study of human mobility led to the integration into the SoBigData of unique Global Positioning System (GPS) and call detail record (CDR) datasets of people and vehicle movements, and geo-referenced social network data as well as several mobility services: O/D (origin-destination) matrix computation, Urban Mobility Atlas Footnote 10 (a visual interface to city mobility patterns), GeoTopics Footnote 11 (for exploring patterns of urban activity from Foursquare), and predictive models: MyWay Footnote 12 (trajectory prediction), TripBuilder Footnote 13 (tourists to build personalized tours of a city). In human mobility, research questions come from geographers, urbanists, complexity scientists, data scientists, policymakers, and Big Data providers, as well as innovators aiming to provide applications for any service for the smart city ecosystem. The idea is to investigate the impact of political events on the well-being of citizens. This exploratory supports the development of “happiness” and “peace” indicators through text mining/opinion mining pipeline on repositories of online news. These indicators reveal that the level of crime of a territory can be well approximated by analyzing the news related to that territory. Generally, we study the impact of the economy on well-being and vice versa, e.g., also considering the propagation of shocks of financial distress in an economic or financial system crucially depends on the topology of the network interconnecting the different elements.

Well-being and economy. This exploratory tests the hypothesis that well-being is correlated to the business performance of companies. The idea is to combine statistical methods and traditional economic data (typically at low-frequency) with high-frequency data from non-traditional sources, such as, i.e., web, supermarkets, for now-casting economic, socioeconomic and well-being indicators. These indicators allow us to study and measure real-life costs by studying price variation and socioeconomic status inference. Furthermore, this activity supports studies on the correlation between people’s well-being and their social and mobility data. In this context, some basic hypothesis can be summarized as: (i) there are curves of age- and gender-based segregation distribution in boards of companies, which are characteristic to mean credit risk of companies in a region; (ii) low mean credit risk of companies in a region has a positive correlation to well-being; (iii) systemic risk correlates highly with well-being indices at a national level. The final aim is to provide a set of guidelines to national governments, methods, and indices for decision making on regulations affecting companies to improve well-being in the country, also considering effective policies to reduce operational risks such as credit risk, and external threats of companies [ 17 ].

Big Data, analyzed through the lenses of data science, provides means to understand our complex socioeconomic and financial systems. On the one hand, this offers new opportunities to measure the patterns of well-being and poverty at a local and global scale, empowering governments and policymakers with the unprecedented opportunity to nowcast relevant economic quantities and compare different countries, regions, and cities. On the other hand, this allows us to investigate the network underlying the complex systems of economy and finance, and it affects the aggregate output, the propagation of shocks or financial distress and systemic risk.

Societal debates. This exploratory employs data science approaches to answer research questions such as who is participating in public debates? What is the “big picture” response from citizens to a policy, election, referendum, or other political events? This kind of analysis allows scientists, policymakers, and citizens to understand the online discussion surrounding polarized debates [ 14 ]. The personal perception of online discussions on social media is often biased by the so-called filter bubble, in which automatic curation of content and relationships between users negatively affects the diversity of opinions available to them. Making a complete analysis of online polarized debates enables the citizens to be better informed and prepared for political outcomes. By analyzing content and conversations on social media and newspaper articles, data scientists study public debates and also assess public sentiment around debated topics, opinion diffusion dynamics, echo chambers formation and polarized discussions, fake news analysis, and propaganda bots. Misinformation is often the result of a distorted perception of concepts that, although unrelated, suddenly appear together in the same narrative. Understanding the details of this process at an early stage may help to prevent the birth and the diffusion of fake news. The misinformation fight includes the development of dynamical models of misinformation diffusion (possibly in contrast to the spread of mainstream news) as well as models of how attention cycles are accelerated and amplified by the infrastructures of online media.

Another important topic covered by this exploratory concerns the analysis of how social bots activity affects fake news diffusion. Determining whether a human or a bot controls a user account is a complex task. To the best of our knowledge, the only openly accessible solution to detect social bots is Botometer, an API that allows us to interact with an underlying machine learning system. Although Botometer has been proven to be entirely accurate in detecting social bots, it has limitations due to the Twitter API features: hence, an algorithm overcoming the barriers of current recipes is needed.

The resources related to Societal Debates exploratory, especially in the domain of media ecology and the fight against misinformation online, provide easy-to-use services to public bodies, media outlets, and social/political scientists. Furthermore, SoBigData supports new simulation models and experimental processes to validate in vivo the algorithms for fighting misinformation, curbing the pathological acceleration and amplification of online attention cycles, breaking the bubbles, and explore alternative media and information ecosystems.

Migration studies. Data science is also useful to understand the migration phenomenon. Knowledge about the number of immigrants living in a particular region is crucial to devise policies that maximize the benefits for both locals and immigrants. These numbers can vary rapidly in space and time, especially in periods of crisis such as wars or natural disasters.

This exploratory provides a set of data and tools for trying to answer some questions about migration flows. Through this exploratory, a data scientist studies economic models of migration and can observe how migrants choose their destination countries. A scientist can discover what is the meaning of “opportunities” that a country provides to migrants, and whether there are correlations between the number of incoming migrants and opportunities in the host countries [ 8 ]. Furthermore, this exploratory tries to understand how public perception of migration is changing using an opinion mining analysis. For example, social network analysis enables us to analyze the migrant’s social network and discover the structure of the social network for people who decided to start a new life in a different country [ 28 ].

Finally, we can also evaluate current integration indices based on official statistics and survey data, which can be complemented by Big Data sources. This exploratory aims to build combined integration indexes that take into account multiple data sources to evaluate integration on various levels. Such integration includes mobile phone data to understand patterns of communication between immigrants and natives; social network data to assess sentiment towards immigrants and immigration; professional network data (such as LinkedIn) to understand labor market integration, and local data to understand to what extent moving across borders is associated with a change in the cultural norms of the migrants. These indexes are fundamental to evaluate the overall social and economic effects of immigration. The new integration indexes can be applied with various space and time resolutions (small area methods) to obtain a complete image of integration, and complement official index.

Sports data science. The proliferation of new sensing technologies that provide high-fidelity data streams extracted from every game, is changing the way scientists, fans and practitioners conceive sports performance. The combination of these (big) data with the tools of data science provides the possibility to unveil complex models underlying sports performance and enables to perform many challenging tasks: from automatic tactical analysis to data-driven performance ranking; game outcome prediction, and injury forecasting. The idea is to foster research on sports data science in several directions. The application of explainable AI and deep learning techniques can be hugely beneficial to sports data science. For example, by using adversarial learning, we can modify the training plans of players that are associated with high injury risk and develop training plans that maximize the fitness of players (minimizing their injury risk). The use of gaming, simulation, and modeling is another set of tools that can be used by coaching staff to test tactics that can be employed against a competitor. Furthermore, by using deep learning on time series, we can forecast the evolution of the performance of players and search for young talents.

This exploratory examines the factors influencing sports success and how to build simulation tools for boosting both individual and collective performance. Furthermore, this exploratory describes performances employing data, statistics, and models, allowing coaches, fans, and practitioners to understand (and boost) sports performance [ 42 ].

Explainable machine learning. Artificial Intelligence, increasingly based on Big Data analytics, is a disruptive technology of our times. This exploratory provides a forum for studying effects of AI on the future society. In this context, SoBigData studies the future of labor and the workforce, also through data- and model-driven analysis, simulations, and the development of methods that construct human understandable explanations of AI black-box models [ 20 ].

Black box systems for automated decision making map a user’s features into a class that predicts the behavioral traits of individuals, such as credit risk, health status, without exposing the reasons why. Most of the time, the internal reasoning of these algorithms is obscure even to their developers. For this reason, the last decade has witnessed the rise of a black box society. This exploratory is developing a set of techniques and tools which allow data analysts to understand why an algorithm produce a decision. These approaches are designed not for discovering a lack of transparency but also for discovering possible biases inherited by the algorithms from human prejudices and artefacts hidden in the training data (which may lead to unfair or wrong decisions) [ 35 ].

5 Conclusions: individual and collective intelligence

The world’s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s [ 23 ]. Since 2012, every day 2.5 exabytes (2.5 \(\times \) 10 \(^18\) bytes) of data were created; as of 2014, every day 2.3 zettabytes (2.3 \(\times \) 10 \(^21\) bytes) of data were generated by Super-power high-tech Corporation worldwide. Soon zettabytes of useful public and private data will be widely and openly available. In the next years, smart applications such as smart grids, smart logistics, smart factories, and smart cities will be widely deployed across the continent and beyond. Ubiquitous broadband access, mobile technology, social media, services, and internet of think on billions of devices will have contributed to the explosion of generated data to a total global estimate of 40 zettabytes.

In this work, we have introduced data science as a new challenge and opportunity for the next years. In this context, we have tried to summarize in a concise way several aspects related to data science applications and their impacts on society, considering both the new services available and the new job perspectives. We have also introduced issues in managing data representing human behavior and showed how difficult it is to preserve personal information and privacy. With the introduction of SoBigData RI and exploratories, we have provided virtual environments where it is possible to understand the potentiality of data science in different research contexts.

Concluding, we can state that social dilemmas occur when there is a conflict between the individual and public interest. Such problems also appear in the ecosystem of distributed AI systems (based on data science tools) and humans, with additional difficulties due: on the one hand, to the relative rigidity of the trained AI systems and the necessity of achieving social benefit, and, on the other hand, to the necessity of keeping individuals interested. What are the principles and solutions for individual versus social optimization using AI, and how can an optimum balance be achieved? The answer is still open, but these complex systems have to work on fulfilling collective goals, and requirements, with the challenge that human needs change over time and move from one context to another. Every AI system should operate within an ethical and social framework in understandable, verifiable, and justifiable way. Such systems must, in any case, work within the bounds of the rule of law, incorporating protection of fundamental rights into the AI infrastructure. In other words, the challenge is to develop mechanisms that will result in the system converging to an equilibrium that complies with European values and social objectives (e.g., social inclusion) but without unnecessary losses of efficiency.

Interestingly, data science can play a vital role in enhancing desirable behaviors in the system, e.g., by supporting coordination and cooperation that is, more often than not, crucial to achieving any meaningful improvements. Our ultimate goal is to build the blueprint of a sociotechnical system in which AI not only cooperates with humans but, if necessary, helps them to learn how to collaborate, as well as other desirable behaviors. In this context, it is also essential to understand how to achieve robustness of the human and AI ecosystems in respect of various types of malicious behaviors, such as abuse of power and exploitation of AI technical weaknesses.

We conclude by paraphrasing Stephen Hawking in his Brief Answers to the Big Questions: the availability of data on its own will not take humanity to the future, but its intelligent and creative use will.

http://www.sdss3.org/collaboration/ .

e.g., https://www.nature.com/sdata/policies/repositories .

Responsible Data Science program: https://redasci.org/ .

https://emergency.copernicus.eu/ .

Nowcasting in economics is the prediction of the present, the very near future, and the very recent past state of an economic indicator.

https://www.unglobalpulse.org/ .

http://sobigdata.eu .

https://www.gcube-system.org/ .

https://sobigdata.d4science.org/catalogue-sobigdata .

http://www.sobigdata.eu/content/urban-mobility-atlas .

http://data.d4science.org/ctlg/ResourceCatalogue/geotopics_-_a_method_and_system_to_explore_urban_activity .

http://data.d4science.org/ctlg/ResourceCatalogue/myway_-_trajectory_prediction .

http://data.d4science.org/ctlg/ResourceCatalogue/tripbuilder .

Abitbol, J.L., Fleury, E., Karsai, M.: Optimal proxy selection for socioeconomic status inference on twitter. Complexity 2019 , 60596731–605967315 (2019). https://doi.org/10.1155/2019/6059673

Article   Google Scholar  

Amato, G., Candela, L., Castelli, D., Esuli, A., Falchi, F., Gennaro, C., Giannotti, F., Monreale, A., Nanni, M., Pagano, P., Pappalardo, L., Pedreschi, D., Pratesi, F., Rabitti, F., Rinzivillo, S., Rossetti, G., Ruggieri, S., Sebastiani, F., Tesconi, M.: How data mining and machine learning evolved from relational data base to data science. In: Flesca, S., Greco, S., Masciari, E., Saccà, D. (eds.) A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years, Studies in Big Data, vol. 31, pp. 287–306. Springer, Berlin (2018). https://doi.org/10.1007/978-3-319-61893-7_17

Chapter   Google Scholar  

Andrienko, G.L., Andrienko, N.V., Budziak, G., Dykes, J., Fuchs, G., von Landesberger, T., Weber, H.: Visual analysis of pressure in football. Data Min. Knowl. Discov. 31 (6), 1793–1839 (2017). https://doi.org/10.1007/s10618-017-0513-2

Article   MathSciNet   Google Scholar  

Assante, M., Candela, L., Castelli, D., Cirillo, R., Coro, G., Frosini, L., Lelii, L., Mangiacrapa, F., Marioli, V., Pagano, P., Panichi, G., Perciante, C., Sinibaldi, F.: The gcube system: delivering virtual research environments as-a-service. Future Gener. Comput. Syst. 95 , 445–453 (2019). https://doi.org/10.1016/j.future.2018.10.035

Assante, M., Candela, L., Castelli, D., Cirillo, R., Coro, G., Frosini, L., Lelii, L., Mangiacrapa, F., Pagano, P., Panichi, G., Sinibaldi, F.: Enacting open science by d4science. Future Gener. Comput. Syst. (2019). https://doi.org/10.1016/j.future.2019.05.063

Barabasi, A.L., Gulbahce, N., Loscalzo, J.: Network medicine: a network-based approach to human disease. Nature reviews. Genetics 12 , 56–68 (2011). https://doi.org/10.1038/nrg2918

Candela, L., Castelli, D., Pagano, P.: Virtual research environments: an overview and a research agenda. Data Sci. J. 12 , GRDI75–GRDI81 (2013). https://doi.org/10.2481/dsj.GRDI-013

Coletto, M., Esuli, A., Lucchese, C., Muntean, C.I., Nardini, F.M., Perego, R., Renso, C.: Sentiment-enhanced multidimensional analysis of online social networks: perception of the mediterranean refugees crisis. In: Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM’16, pp. 1270–1277. IEEE Press, Piscataway, NJ, USA (2016). http://dl.acm.org/citation.cfm?id=3192424.3192657

Coscia, M., Rossetti, G., Giannotti, F., Pedreschi, D.: Uncovering hierarchical and overlapping communities with a local-first approach. TKDD 9 (1), 6:1–6:27 (2014). https://doi.org/10.1145/2629511

Cresci, S., Minutoli, S., Nizzoli, L., Tardelli, S., Tesconi, M.: Enriching digital libraries with crowdsensed data. In: P. Manghi, L. Candela, G. Silvello (eds.) Digital Libraries: Supporting Open Science—15th Italian Research Conference on Digital Libraries, IRCDL 2019, Pisa, Italy, 31 Jan–1 Feb 2019, Proceedings, Communications in Computer and Information Science, vol. 988, pp. 144–158. Springer (2019). https://doi.org/10.1007/978-3-030-11226-4_12

Cresci, S., Petrocchi, M., Spognardi, A., Tognazzi, S.: Better safe than sorry: an adversarial approach to improve social bot detection. In: P. Boldi, B.F. Welles, K. Kinder-Kurlanda, C. Wilson, I. Peters, W.M. Jr. (eds.) Proceedings of the 11th ACM Conference on Web Science, WebSci 2019, Boston, MA, USA, June 30–July 03, 2019, pp. 47–56. ACM (2019). https://doi.org/10.1145/3292522.3326030

Cresci, S., Pietro, R.D., Petrocchi, M., Spognardi, A., Tesconi, M.: Social fingerprinting: detection of spambot groups through dna-inspired behavioral modeling. IEEE Trans. Dependable Sec. Comput. 15 (4), 561–576 (2018). https://doi.org/10.1109/TDSC.2017.2681672

Furletti, B., Trasarti, R., Cintia, P., Gabrielli, L.: Discovering and understanding city events with big data: the case of rome. Information 8 (3), 74 (2017). https://doi.org/10.3390/info8030074

Garimella, K., De Francisci Morales, G., Gionis, A., Mathioudakis, M.: Reducing controversy by connecting opposing views. In: Proceedings of the 10th ACM International Conference on Web Search and Data Mining, WSDM’17, pp. 81–90. ACM, New York, NY, USA (2017). https://doi.org/10.1145/3018661.3018703

Giannotti, F., Trasarti, R., Bontcheva, K., Grossi, V.: Sobigdata: social mining & big data ecosystem. In: P. Champin, F.L. Gandon, M. Lalmas, P.G. Ipeirotis (eds.) Companion of the The Web Conference 2018 on The Web Conference 2018, WWW 2018, Lyon , France, April 23–27, 2018, pp. 437–438. ACM (2018). https://doi.org/10.1145/3184558.3186205

Grossi, V., Rapisarda, B., Giannotti, F., Pedreschi, D.: Data science at sobigdata: the european research infrastructure for social mining and big data analytics. I. J. Data Sci. Anal. 6 (3), 205–216 (2018). https://doi.org/10.1007/s41060-018-0126-x

Grossi, V., Romei, A., Ruggieri, S.: A case study in sequential pattern mining for it-operational risk. In: W. Daelemans, B. Goethals, K. Morik (eds.) Machine Learning and Knowledge Discovery in Databases, European Conference, ECML/PKDD 2008, Antwerp, Belgium, 15–19 Sept 2008, Proceedings, Part I, Lecture Notes in Computer Science, vol. 5211, pp. 424–439. Springer (2008). https://doi.org/10.1007/978-3-540-87479-9_46

Guidotti, R., Coscia, M., Pedreschi, D., Pennacchioli, D.: Going beyond GDP to nowcast well-being using retail market data. In: A. Wierzbicki, U. Brandes, F. Schweitzer, D. Pedreschi (eds.) Advances in Network Science—12th International Conference and School, NetSci-X 2016, Wroclaw, Poland, 11–13 Jan 2016, Proceedings, Lecture Notes in Computer Science, vol. 9564, pp. 29–42. Springer (2016). https://doi.org/10.1007/978-3-319-28361-6_3

Guidotti, R., Monreale, A., Nanni, M., Giannotti, F., Pedreschi, D.: Clustering individual transactional data for masses of users. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 Aug 2017, pp. 195–204. ACM (2017). https://doi.org/10.1145/3097983.3098034

Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., Pedreschi, D.: A survey of methods for explaining black box models. ACM Comput. Surv. 51 (5), 93:1–93:42 (2019). https://doi.org/10.1145/3236009

Guidotti, R., Monreale, A., Turini, F., Pedreschi, D., Giannotti, F.: A survey of methods for explaining black box models. CoRR abs/1802.01933 (2018). arxiv: 1802.01933

Guidotti, R., Nanni, M., Rinzivillo, S., Pedreschi, D., Giannotti, F.: Never drive alone: boosting carpooling with network analysis. Inf. Syst. 64 , 237–257 (2017). https://doi.org/10.1016/j.is.2016.03.006

Hilbert, M., Lopez, P.: The world’s technological capacity to store, communicate, and compute information. Science 332 (6025), 60–65 (2011)

Kennedy, C.A., Stewart, I., Facchini, A., Cersosimo, I., Mele, R., Chen, B., Uda, M., Kansal, A., Chiu, A., Kim, K.g., Dubeux, C., Lebre La Rovere, E., Cunha, B., Pincetl, S., Keirstead, J., Barles, S., Pusaka, S., Gunawan, J., Adegbile, M., Nazariha, M., Hoque, S., Marcotullio, P.J., González Otharán, F., Genena, T., Ibrahim, N., Farooqui, R., Cervantes, G., Sahin, A.D., : Energy and material flows of megacities. Proc. Nat. Acad. Sci. 112 (19), 5985–5990 (2015). https://doi.org/10.1073/pnas.1504315112

Korjani, S., Damiano, A., Mureddu, M., Facchini, A., Caldarelli, G.: Optimal positioning of storage systems in microgrids based on complex networks centrality measures. Sci. Rep. (2018). https://doi.org/10.1038/s41598-018-35128-6

Lorini, V., Castillo, C., Dottori, F., Kalas, M., Nappo, D., Salamon, P.: Integrating social media into a pan-european flood awareness system: a multilingual approach. In: Z. Franco, J.J. González, J.H. Canós (eds.) Proceedings of the 16th International Conference on Information Systems for Crisis Response and Management, València, Spain, 19–22 May 2019. ISCRAM Association (2019). http://idl.iscram.org/files/valeriolorini/2019/1854-_ValerioLorini_etal2019.pdf

Lulli, A., Gabrielli, L., Dazzi, P., Dell’Amico, M., Michiardi, P., Nanni, M., Ricci, L.: Scalable and flexible clustering solutions for mobile phone-based population indicators. Int. J. Data Sci. Anal. 4 (4), 285–299 (2017). https://doi.org/10.1007/s41060-017-0065-y

Moise, I., Gaere, E., Merz, R., Koch, S., Pournaras, E.: Tracking language mobility in the twitter landscape. In: C. Domeniconi, F. Gullo, F. Bonchi, J. Domingo-Ferrer, R.A. Baeza-Yates, Z. Zhou, X. Wu (eds.) IEEE International Conference on Data Mining Workshops, ICDM Workshops 2016, 12–15 Dec 2016, Barcelona, Spain., pp. 663–670. IEEE Computer Society (2016). https://doi.org/10.1109/ICDMW.2016.0099

Nanni, M.: Advancements in mobility data analysis. In: F. Leuzzi, S. Ferilli (eds.) Traffic Mining Applied to Police Activities—Proceedings of the 1st Italian Conference for the Traffic Police (TRAP-2017), Rome, Italy, 25–26 Oct 2017, Advances in Intelligent Systems and Computing, vol. 728, pp. 11–16. Springer (2017). https://doi.org/10.1007/978-3-319-75608-0_2

Nanni, M., Trasarti, R., Monreale, A., Grossi, V., Pedreschi, D.: Driving profiles computation and monitoring for car insurance crm. ACM Trans. Intell. Syst. Technol. 8 (1), 14:1–14:26 (2016). https://doi.org/10.1145/2912148

Pappalardo, G., di Matteo, T., Caldarelli, G., Aste, T.: Blockchain inefficiency in the bitcoin peers network. EPJ Data Sci. 7 (1), 30 (2018). https://doi.org/10.1140/epjds/s13688-018-0159-3

Pappalardo, L., Barlacchi, G., Pellungrini, R., Simini, F.: Human mobility from theory to practice: Data, models and applications. In: S. Amer-Yahia, M. Mahdian, A. Goel, G. Houben, K. Lerman, J.J. McAuley, R.A. Baeza-Yates, L. Zia (eds.) Companion of The 2019 World Wide Web Conference, WWW 2019, San Francisco, CA, USA, 13–17 May 2019., pp. 1311–1312. ACM (2019). https://doi.org/10.1145/3308560.3320099

Pappalardo, L., Cintia, P., Ferragina, P., Massucco, E., Pedreschi, D., Giannotti, F.: Playerank: data-driven performance evaluation and player ranking in soccer via a machine learning approach. ACM TIST 10 (5), 59:1–59:27 (2019). https://doi.org/10.1145/3343172

Pappalardo, L., Vanhoof, M., Gabrielli, L., Smoreda, Z., Pedreschi, D., Giannotti, F.: An analytical framework to nowcast well-being using mobile phone data. CoRR abs/1606.06279 (2016). arxiv: 1606.06279

Pasquale, F.: The Black Box Society: The Secret Algorithms That Control Money and Information. Harvard University Press, Cambridge (2015)

Book   Google Scholar  

Piškorec, M., Antulov-Fantulin, N., Miholić, I., Šmuc, T., Šikić, M.: Modeling peer and external influence in online social networks: Case of 2013 referendum in croatia. In: Cherifi, C., Cherifi, H., Karsai, M., Musolesi, M. (eds.) Complex Networks & Their Applications VI. Springer, Cham (2018)

Google Scholar  

Ranco, G., Aleksovski, D., Caldarelli, G., Mozetic, I.: Investigating the relations between twitter sentiment and stock prices. CoRR abs/1506.02431 (2015). arxiv: 1506.02431

Ribeiro, M.T., Singh, S., Guestrin, C.: “why should I trust you?”: Explaining the predictions of any classifier. In: B. Krishnapuram, M. Shah, A.J. Smola, C.C. Aggarwal, D. Shen, R. Rastogi (eds.) Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 Aug 2016, pp. 1135–1144. ACM (2016). https://doi.org/10.1145/2939672.2939778

Ribeiro, M.T., Singh, S., Guestrin, C.: Anchors: High-precision model-agnostic explanations. In: S.A. McIlraith, K.Q. Weinberger (eds.) Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, 2–7 Feb 2018, pp. 1527–1535. AAAI Press (2018). https://www.aaai.org/ocs/index.php/AAAI/AAAI18/-paper/view/16982

Rossetti, G., Milli, L., Rinzivillo, S., Sîrbu, A., Pedreschi, D., Giannotti, F.: Ndlib: a python library to model and analyze diffusion processes over complex networks. Int. J. Data Sci. Anal. 5 (1), 61–79 (2018). https://doi.org/10.1007/s41060-017-0086-6

Rossetti, G., Pappalardo, L., Pedreschi, D., Giannotti, F.: Tiles: an online algorithm for community discovery in dynamic social networks. Mach. Learn. 106 (8), 1213–1241 (2017). https://doi.org/10.1007/s10994-016-5582-8

Rossi, A., Pappalardo, L., Cintia, P., Fernández, J., Iaia, M.F., Medina, D.: Who is going to get hurt? predicting injuries in professional soccer. In: J. Davis, M. Kaytoue, A. Zimmermann (eds.) Proceedings of the 4th Workshop on Machine Learning and Data Mining for Sports Analytics co-located with 2017 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2017), Skopje, Macedonia, 18 Sept 2017., CEUR Workshop Proceedings, vol. 1971, pp. 21–30. CEUR-WS.org (2017). http://ceur-ws.org/Vol-1971/paper-04.pdf

Ruggieri, S., Pedreschi, D., Turini, F.: DCUBE: discrimination discovery in databases. In: A.K. Elmagarmid, D. Agrawal (eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, 6–10 June 2010, pp. 1127–1130. ACM (2010). https://doi.org/10.1145/1807167.1807298

Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR abs/1312.6034 (2013). http://dblp.uni-trier.de/db/journals/corr/corr1312.html#SimonyanVZ13

Smilkov, D., Thorat, N., Kim, B., Viégas, F.B., Wattenberg, M.: Smoothgrad: removing noise by adding noise. CoRR abs/1706.03825 (2017). arxiv: 1706.03825

Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: D. Precup, Y.W. Teh (eds.) Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 70, pp. 3319–3328. PMLR, International Convention Centre, Sydney, Australia (2017). http://proceedings.mlr.press/v70/sundararajan17a.html

Trasarti, R., Guidotti, R., Monreale, A., Giannotti, F.: Myway: location prediction via mobility profiling. Inf. Syst. 64 , 350–367 (2017). https://doi.org/10.1016/j.is.2015.11.002

Traub, J., Quiané-Ruiz, J., Kaoudi, Z., Markl, V.: Agora: Towards an open ecosystem for democratizing data science & artificial intelligence. CoRR abs/1909.03026 (2019). arxiv: 1909.03026

Vazifeh, M.M., Zhang, H., Santi, P., Ratti, C.: Optimizing the deployment of electric vehicle charging stations using pervasive mobility data. Transp Res A Policy Practice 121 (C), 75–91 (2019). https://doi.org/10.1016/j.tra.2019.01.002

Vermeulen, A.F.: Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets, 1st edn. Apress, New York (2018)

Download references

Acknowledgements

This work is supported by the European Community’s H2020 Program under the scheme ‘INFRAIA-1-2014-2015: Research Infrastructures’, grant agreement #654024 ‘SoBigData: Social Mining and Big Data Ecosystem’ and the scheme ‘INFRAIA-01-2018-2019: Research and Innovation action’, grant agreement #871042 ’SoBigData \(_{++}\) : European Integrated Infrastructure for Social Mining and Big Data Analytics’

Open access funding provided by Università di Pisa within the CRUI-CARE Agreement.

Author information

Authors and affiliations.

CNR - Istituto Scienza e Tecnologia dell’Informazione A. Faedo, KDDLab, Pisa, Italy

Valerio Grossi & Fosca Giannotti

Department of Computer Science, University of Pisa, Pisa, Italy

Dino Pedreschi

CNR - Istituto Scienza e Tecnologia dell’Informazione A. Faedo, NeMIS, Pisa, Italy

Paolo Manghi, Pasquale Pagano & Massimiliano Assante

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Dino Pedreschi .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Grossi, V., Giannotti, F., Pedreschi, D. et al. Data science: a game changer for science and innovation. Int J Data Sci Anal 11 , 263–278 (2021). https://doi.org/10.1007/s41060-020-00240-2

Download citation

Received : 13 July 2019

Accepted : 15 December 2020

Published : 19 April 2021

Issue Date : May 2021

DOI : https://doi.org/10.1007/s41060-020-00240-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Responsible data science
  • Research infrastructure
  • Social mining
  • Find a journal
  • Publish with us
  • Track your research
  • Business Essentials
  • Leadership & Management
  • Credential of Leadership, Impact, and Management in Business (CLIMB)
  • Entrepreneurship & Innovation
  • Digital Transformation
  • Finance & Accounting
  • Business in Society
  • For Organizations
  • Support Portal
  • Media Coverage
  • Founding Donors
  • Leadership Team

essay about data science

  • Harvard Business School →
  • HBS Online →
  • Business Insights →

Business Insights

Harvard Business School Online's Business Insights Blog provides the career insights you need to achieve your goals and gain confidence in your business skills.

  • Career Development
  • Communication
  • Decision-Making
  • Earning Your MBA
  • Negotiation
  • News & Events
  • Productivity
  • Staff Spotlight
  • Student Profiles
  • Work-Life Balance
  • AI Essentials for Business
  • Alternative Investments
  • Business Analytics
  • Business Strategy
  • Business and Climate Change
  • Design Thinking and Innovation
  • Digital Marketing Strategy
  • Disruptive Strategy
  • Economics for Managers
  • Entrepreneurship Essentials
  • Financial Accounting
  • Global Business
  • Launching Tech Ventures
  • Leadership Principles
  • Leadership, Ethics, and Corporate Accountability
  • Leading Change and Organizational Renewal
  • Leading with Finance
  • Management Essentials
  • Negotiation Mastery
  • Organizational Leadership
  • Power and Influence for Positive Impact
  • Strategy Execution
  • Sustainable Business Strategy
  • Sustainable Investing
  • Winning with Digital Platforms

What Is Data Science? 5 Applications in Business

Businesswoman engaging with data on laptop

  • 14 Jan 2021

At a time when 1.7 megabytes of data are generated every second for every person on Earth, it’s crucial to know how to wade through information, and structure, interpret, and present it in a meaningful way.

This enormous volume of data, known as big data , has prompted greater demand for skilled data science professionals. According to the US Bureau of Labor Statistics, employment of data scientists is expected to rise 15 percent by 2029—far faster than the four percent average for all occupations. Yet, to harness the power of big data, it isn’t necessary to be a data scientist.

Anyone with access to data can reap its benefits. Data science can be used to gain knowledge about behaviors and processes, write algorithms that process large amounts of information quickly and efficiently, increase security and privacy of sensitive data, and guide data-driven decision-making .

In a business world with no shortage of data, knowing how to make sense of it, the terminology used to navigate it, and ways to leverage it to make a positive impact can be invaluable tools in your career. Here’s a primer on what data science is and how you can use it in business.

Access your free e-book today.

What Is Data Science?

Data science is the process of building, cleaning, and structuring datasets to analyze and extract meaning. It’s not to be confused with data analytics , which is the act of analyzing and interpreting data. These processes share many similarities and are both valuable in the workplace.

Data science requires you to:

  • Form hypotheses
  • Run experiments to gather data
  • Assess data’s quality
  • Clean and streamline datasets
  • Organize and structure data for analysis

Data scientists often write algorithms—in coding languages like SQL and R—to collect and analyze big data. When designed correctly and tested thoroughly, algorithms can catch information or trends that humans miss. They can also significantly speed up the processes of gathering and analyzing data.

For example, an algorithm created by researchers at the Massachusetts Institute of Technology can be used to detect differences between 3D medical images—such as MRI scans—more than one thousand times faster than a human. Because of this time saved, doctors can respond to urgent issues revealed in the scans and potentially save patients’ lives.

In the Harvard Online course Data Science Principles , Professor Dustin Tingley stresses the importance of both the human and machine aspects of data science.

“With this new world of possibility, there also comes a greater need for critical thinking,” Tingley says. “Without human thought and guidance throughout the entire process, none of these seemingly fantastical machine-learning applications would be possible.”

If you want to make sense of big data and leverage it to make an impact, here are five applications for data science to harness at your organization.

5 Business Applications for Data Science

1. gain customer insights.

Data about your customers can reveal details about their habits, demographic characteristics, preferences, aspirations, and more. With so many potential sources of customer data, a foundational understanding of data science can help make sense of it.

For instance, you may gather data about a customer each time they visit your website or brick-and-mortar store, add an item to their cart, complete a purchase, open an email, or engage with a social media post. After ensuring the data from each source is accurate, you need to combine it in a process called data wrangling . This might involve matching a customer’s email address to their credit card information, social media handles, and purchase identifications. By aggregating the data, you can draw conclusions and identify trends in their behaviors.

Understanding who your customers are and what motivates them can help ensure your product meets their job to be done and your marketing and sales efforts are working. Having and understanding reliable customer data can also inform retargeting efforts, personalized experiences for specific users, and improvements to your website and product’s user experience.

2. Increase Security

You can also use data science to increase the security of your business and protect sensitive information. For example, banks use complex machine-learning algorithms to detect fraud based on deviations from a user’s typical financial activities. These algorithms can catch fraud faster and with greater accuracy than humans, simply because of the sheer volume of data generated every day.

Even if you don’t work at a bank, algorithms can be used to protect sensitive information through the process of encryption . Learning about data privacy can ensure your company doesn’t misuse or share customers’ sensitive information, including credit card details, medical information, Social Security numbers, and contact information.

“As organizations become more and more data-centric, the need for ethical treatment of individual data becomes equally urgent,” Tingley says in Data Science Principles .

It’s the combination of algorithms and human judgment that can move businesses closer to a higher level of security and ethical use of data.

Related: 9 Fundamental Data Science Skills for Business Professionals

3. Inform Internal Finances

Your organization’s financial team can utilize data science to create reports, generate forecasts, and analyze financial trends. Data on a company’s cash flows, assets, and debts are constantly gathered, which financial analysts can use to manually or algorithmically detect trends in financial growth or decline.

For example, if you’re a financial analyst tasked with forecasting revenue, you can use predictive analysis to do so. This would require calculating the predicted average selling price per unit for future periods and multiplying it by the number of units expected to be sold during those periods. You can estimate both the average selling price and number of expected units sold by finding trends in historic company and industry data, which must be qualified, cleaned, and structured. This is data science at work.

Additionally, risk management analysis can be used to calculate whether certain business decisions are worth the potential downsides. Each of these financial analyses can offer valuable insights and drive business decisions.

4. Streamline Manufacturing

Another way you can use data science in business is to identify inefficiencies in manufacturing processes. Manufacturing machines gather data from production processes at high volumes. In cases where the volume of data collected is too high for a human to manually analyze it, an algorithm can be written to clean, sort, and interpret it quickly and accurately to gather insights.

For example, industrial automation company Oden Technologies created a machine-learning tool called Golden Run, which collects manufacturing data, identifies times of highest efficiency, and provides recommendations for replicating that high-efficiency state. As the algorithm gathers more data, it provides better recommendations for improvement.

By using data science to become more efficient, companies can cut costs and produce more goods.

5. Predict Future Market Trends

Collecting and analyzing data on a larger scale can enable you to identify emerging trends in your market. Tracking purchase data, celebrities and influencers, and search engine queries can reveal what products people are interested in.

For instance, clothing upcycling has been on the rise as an environmentally conscious way to refresh a wardrobe. According to research by Nielson, 81 percent of consumers feel strongly that companies should help improve the environment. Clothing retailer Patagonia, which has been using recycled plastic polyester since 1993 , leaned into this emerging trend by launching Worn Wear , a site that’s specifically designed to help customers upcycle used Patagonia products.

By staying up-to-date on the behaviors of your target market, you can make business decisions that allow you to get ahead of the curve.

A Beginner's Guide to Data and Analytics | Access Your Free E-Book | Download Now

Using Data Science at Your Organization

When critical thinking meets machine-learning algorithms, data can offer insights, guide efficiency efforts, and inform predictions.

Even if you aren’t a data scientist, understanding how to qualify data sources, clean and structure information, and extrapolate conclusions can be valuable skills in your career.

Are you interested in furthering your data literacy? Download our Beginner’s Guide to Data & Analytics to learn how you can leverage the power of data for professional and organizational success.

essay about data science

About the Author

National Academies Press: OpenBook

Data Science for Undergraduates: Opportunities and Options (2018)

Chapter: 6 conclusions, 6 conclusions.

Data science education is well into its formative stages of development; it is evolving into a self-supporting discipline and producing professionals with distinct and complementary skills relative to professionals in the computer, information, and statistical sciences. However, regardless of its potential eventual disciplinary status, the evidence points to robust growth of data science education that will indelibly shape the undergraduate students of the future. In fact, fueled by growing student interest and industry demand, data science education will likely become a staple of the undergraduate experience. There will be an increase in the number of students majoring, minoring, earning certificates, or just taking courses in data science as the value of data skills becomes even more widely recognized. The adoption of a general education requirement in data science for all undergraduates will endow future generations of students with the basic understanding of data science that they need to become responsible citizens. Continuing education programs such as data science boot camps, career accelerators, summer schools, and incubators will provide another stream of talent. This constitutes the emerging watershed of data science education that feeds multiple streams of generalists and specialists in society; citizens are empowered by their basic skills to examine, interpret, and draw value from data.

Today, the nation is in the formative phase of data science education, where educational organizations are pioneering their own programs, each with different approaches to depth, breadth, and curricular emphasis (e.g., business, computer science, engineering, information science, math-

ematics, social science, or statistics). It is too early to expect consensus to emerge on certain best practices of data science education. However, it is not too early to envision the possible forms that such practices might take. Nor is it too early to make recommendations that can help the data science education community develop strategic vision and practices. The following is a summary of the findings and recommendations discussed in the preceding four chapters of this report.

Finding 2.1: Data scientists today draw largely from extensions of the “analyst” of years past trained in traditional disciplines. As data science becomes an integral part of many industries and enriches research and development, there will be an increased demand for more holistic and more nuanced data science roles.

Finding 2.2: Data science programs that strive to meet the needs of their students will likely evolve to emphasize certain skills and capabilities. This will result in programs that prepare different types of data scientists.

Recommendation 2.1: Academic institutions should embrace data science as a vital new field that requires specifically tailored instruction delivered through majors and minors in data science as well as the development of a cadre of faculty equipped to teach in this new field.

Recommendation 2.2: Academic institutions should provide and evolve a range of educational pathways to prepare students for an array of data science roles in the workplace.

Finding 2.3: A critical task in the education of future data scientists is to instill data acumen. This requires exposure to key concepts in data science, real-world data and problems that can reinforce the limitations of tools, and ethical considerations that permeate many applications. Key concepts involved in developing data acumen include the following:

  • Mathematical foundations,
  • Computational foundations,
  • Statistical foundations,
  • Data management and curation,
  • Data description and visualization,
  • Data modeling and assessment,
  • Workflow and reproducibility,
  • Communication and teamwork,
  • Domain-specific considerations, and
  • Ethical problem solving.

Recommendation 2.3: To prepare their graduates for this new data-driven era, academic institutions should encourage the development of a basic understanding of data science in all undergraduates.

Recommendation 2.4: Ethics is a topic that, given the nature of data science, students should learn and practice throughout their education. Academic institutions should ensure that ethics is woven into the data science curriculum from the beginning and throughout.

Recommendation 2.5: The data science community should adopt a code of ethics; such a code should be affirmed by members of professional societies, included in professional development programs and curricula, and conveyed through educational programs. The code should be reevaluated often in light of new developments.

Finding 3.1: Undergraduate education in data science can be experienced in many forms. These include the following:

  • Integrated introductory courses that can satisfy a general education requirement;
  • A major in data science, including advanced skills, as the primary field of study;
  • A minor or track in data science, where intermediate skills are connected to the major field of study;
  • Two-year degrees and certificates;
  • Other certificates, often requiring fewer courses than a major but more than a minor;
  • Massive open online courses, which can engage large numbers of students at a variety of levels; and
  • Summer programs and boot camps, which can serve to supplement academic or on-the-job training.

Recommendation 3.1: Four-year and two-year institutions should establish a forum for dialogue across institutions on all aspects of data science education, training, and workforce development.

Finding 4.1: The nature of data science is such that it offers multiple pathways for students of different backgrounds to engage at levels ranging from basic to expert.

Finding 4.2: Data science would particularly benefit from broad participation by underrepresented minorities because of the many applications to problems of interest to diverse populations.

Recommendation 4.1: As data science programs develop, they should focus on attracting students with varied backgrounds and degrees of preparation and preparing them for success in a variety of careers.

Finding 4.3: Institutional flexibility will involve the development of curricula that take advantage of current course availability and will potentially be constrained by the availability of teaching expertise. Whatever organizational or infrastructure model is adopted, incentives are needed to encourage faculty participation and to overcome barriers.

Finding 4.4: The economics of developing programs has recently changed with the shift to cloud-based approaches and platforms.

Finding 5.1: The evolution of data science programs at a particular institution will depend on the particular institution’s pedagogical style and the students’ backgrounds and goals, as well as the requirements of the job market and graduate schools.

Recommendation 5.1: Because these are early days for undergraduate data science education, academic institutions should be prepared to evolve programs over time. They should create and maintain the flexibility and incentives to facilitate the sharing of courses, materials, and faculty among departments and programs.

Finding 5.2: There is a need for broadening the perspective of faculty who are trained in particular areas of data science to be knowledgeable of the breadth of approaches to data science so that they can more effectively educate students at all levels.

Recommendation 5.2: During the development of data science programs, institutions should provide support so that the faculty can become more cognizant of the varied aspects of data science through discussion, co-teaching, sharing of materials, short courses, and other forms of training.

Finding 5.3: The data science community would benefit from the creation of websites and journals that document and make available best

practices, curricula, education research findings, and other materials related to undergraduate data science education.

Finding 5.4: The evolution of undergraduate education in data science can be driven by data science. Exploiting administrative records, in conjunction with other data sources such as economic information and survey data, can enable effective transformation of programs to better serve their students.

Finding 5.5: Data science methods applied both to individual programs and comparatively across programs can be used for both evaluation and evolution of data science program components. It is essential that both processes are sustained as new pathways emerge at institutions.

Recommendation 5.3: Academic institutions should ensure that programs are continuously evaluated and should work together to develop professional approaches to evaluation. This should include developing and sharing measurement and evaluation frameworks, data sets, and a culture of evolution guided by high-quality evaluation. Efforts should be made to establish relationships with sector-specific professional societies to help align education evaluation with market impacts.

Finding 5.6: As professional societies adapt to data science, improved coordination could offer new opportunities for additional collaboration and cross-pollination. A group or conference with bridging capabilities would be helpful. Professional societies may find it useful to collaborate to offer such training and networking opportunities to their joint communities.

Recommendation 5.4: Existing professional societies should coordinate to enable regular convening sessions on data science among their members. Peer review and discussion are essential to share ideas, best practices, and data.

This page intentionally left blank.

Data science is emerging as a field that is revolutionizing science and industries alike. Work across nearly all domains is becoming more data driven, affecting both the jobs that are available and the skills that are required. As more data and ways of analyzing them become available, more aspects of the economy, society, and daily life will become dependent on data. It is imperative that educators, administrators, and students begin today to consider how to best prepare for and keep pace with this data-driven era of tomorrow. Undergraduate teaching, in particular, offers a critical link in offering more data science exposure to students and expanding the supply of data science talent.

Data Science for Undergraduates: Opportunities and Options offers a vision for the emerging discipline of data science at the undergraduate level. This report outlines some considerations and approaches for academic institutions and others in the broader data science communities to help guide the ongoing transformation of this field.

READ FREE ONLINE

Welcome to OpenBook!

You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

Do you want to take a quick tour of the OpenBook's features?

Show this book's table of contents , where you can jump to any chapter by name.

...or use these buttons to go back to the previous chapter or skip to the next one.

Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

Switch between the Original Pages , where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

To search the entire text of this book, type in your search term here and press Enter .

Share a link to this book page on your preferred social network or via email.

View our suggested citation for this chapter.

Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

Get Email Updates

Do you enjoy reading reports from the Academies online for free ? Sign up for email notifications and we'll let you know about new publications in your areas of interest when they're released.

21st-may-banner design

  • Conferences
  • Last updated October 7, 2021
  • In Innovation in AI

How To Write An Appealing Personal Statement For Masters Programme In Data Science

How To Write An Appealing Personal Statement For Masters Programme In Data Science

Illustration by How To Write An Appealing Personal Statement For Masters Programme In Data Science

  • Published on July 31, 2020
  • by Sejuti Das

essay about data science

Besides submitting test scores, recommendation letters, and an undergraduate transcript, one essential requirement of applying to a data science masters programme is the application essay — aka personal statement. The personal statement is where applicants need to convince the professors their ability and worth of getting selected in the master programme . 

In fact, many a time, a personal statement acts as a deciding factor for getting chosen for a prestigious masters programme in the field of data science . Thus, one needs to be extremely cautious while writing a personal statement for their master’s application. Not only does it help the university authorities determine the sincere interest of the applicant to enrol for the course, but also provide a chance to the students to stand out of the crowd highlighting their skills and relevancy.

Having said that, data scientists are experts in mathematics, but writing might not always be their expertise , and a personal statement is usually longer than you think and requires to be well crafted in order to grab the attention of professors and administrators. So, if decided to pursue higher studies in the data science field and have narrowed down universities to apply, this article can help you write a winning personal statement required to apply for the data science masters’ programme.

Also Read: The 10 Most Promising Data Science Masters Programs In US

Planning Is The Key: Highlight The Reason To Study Data Science Masters

Although it is stated as ‘personal,’ a personal statement doesn’t require applicants to share the intimate details of their life; instead, it needs to highlight the intention of the applicant for the particular master’s programme. To avoid any confusion or mistakes, the first step to write a personal statement for a data science masters programme is to brainstorm around it and plan it before actually starting to write. It is critical to make notes and use bullet points when planning, which can later be referred to while writing the personal statement. One should research thoroughly about the course requirements and the university, and prepare a list of their achievements and goals that can come handy while writing the essay. 

Most universities expect their applicants to adhere to a specific word limit for the personal statement, and thus a good brainstorming will help applicants to keep their essay relevant and to the point. Planning will help in setting the context, creating a structure and forming a narrative of the piece that is critical for drafting a compelling statement.

Also Read: What Not To Include In Your Data Science Resume

Have A Killer Intro & A Concise Conclusion: Relevant To The Passion For The Field

An attention-grabbing intro and a hard-hitting conclusion are again critical for writing a compelling personal statement. The first paragraph can create the first impression of the applicant in front of the professors, and a sharp end will help them remember that candidate among the crowd. The readers of the personal statement are the experts from the data science industry and academics; thus, they expect the writeup to be extremely intriguing in terms of content. 

Personal statements are usually lengthy but require to be extremely clear in sending out the message. Rather than starting the essay with some cliches, data science applicants should begin their personal statement highlighting their passion for the stream and their domain proficiency. And to have a definite ending, these data scientists must ensure to convey their genuine interest in pursuing the master’s programme, and how their skills are relevant to the stream.

Also Read: Tips And Templates For A Data Scientist Resume

Be A Good Story Teller: Highlight Experiences & Skill Sets

Thirdly, data science applicants must showcase their skill sets and experience in their personal statements without repeating the information that is already mentioned in their application form. And that’s why it is critical to be a good storyteller with their statement, where applicants can highlight their skills by talking about a particular data science research project that helped in solving real-life problems. One can also point out their experiences, knowledge, and quantify their expertise in the field that can help them in pursuing further studies.

The job of the personal statement is to let the administrators and professors know the abilities of the applicants to be qualified for the master’s programme. Data scientists can also mention their thesis, publications, journals or any relevant activities that can help them in getting selected. A well crafted personal statement avoids clichés, jargons, and too many details, and should be presented formally with a clear narrative.

Also Read: How To Create A Compelling Cover Letter To Land A Data Science Job

Focus On Your Domain & The Programme

Unlike undergraduate courses, masters programmes are more specific as well as require applicants to understand the domain they are pursuing. Consequently, while writing a personal statement, one needs to sync their interest according to the requirement of the programme and emphasise on the specific skills that match the area of expertise. One can also network with relevant faculty members and seniors to get a better understanding of the requirements of the program.

Many universities are also working on several ongoing data science projects, citing one of them corresponding to the interest, can also be a great addition to the personal statement. Furthermore, applicants can also write about what inspired them to pursue this particular domain and how their work will contribute to the field. One can also share their personal experiences and how that has helped in pursuing this course.

Also Read: What Data Science Graduates Need To Do To Get Hired During Covid-19

Don’t Be Generic: Customise The Write Up For The Course

Lastly, it is critical that the essay is unique and thus requires to be customised according to the university and its requirements. Applicants don’t have to start from scratch every time they are applying to a university, but they must ensure that they still provide a unique draft personal statement to each application. Professors and administrators read thousands of personal statements in a day, and therefore to be unique, applicants cannot pick up generic content to build their essay.

Applicants can also make the personal statement unique by adding up personal experiences relevant to the field, which will not only make the read interesting but also will allow the readers to empathise with the applicant. One can also add up a few of their failures to make it sound genuine as well as relatable. Usually, masters programmes don’t conduct face to face interviews; thus, personal statement plays a vital role in the applicants’ admission process.

Also Read: 10 Commonly Asked Puzzles In A Data Science Interview

Other Things To Keep In Mind

  • Personal statements are not university applications, so don’t be repetitive.
  • Highlight why this university is the right choice for the career you are planning to pursue.
  • Although it’s the life experiences one shares in their personal statements, it indeed requires to be professional and to the point. 
  • Avoid grammar, spelling, and punctuation errors.

Access all our open Survey & Awards Nomination forms in one place

Picture of Sejuti Das

‘Diverse And Highly International’: North Carolina State University’s Masters In Data Science Program

IIT Roorkee Introduces 7 New Academic Programmes With Specialisation in AI & Data Science For 2021-22

IIT Roorkee Introduces 7 New Academic Programmes With Specialisation In AI & Data Science For 2021-22

data science, data science courses

6 US Data Science Degrees That Can Be Obtained Online

INSOFE

INSOFE Launches Dual Specialisation Master’s In Computer Science & AI

essay about data science

IIT Jodhpur To Offer B.Tech In AI & Data Science

essay about data science

Best Data Science Courses in India: Top Picks By AIM

DBA Data Science

DBA In Data Science – India’s Only Applied Research Program For Technology And Business Leaders

essay about data science

Roadmap for Working Professionals to Pursue Data Science Education

AIM Vertical Banner

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative ai skilling for enterprises, our customized corporate training program on generative ai provides a unique opportunity to empower, retain, and advance your talent., upcoming large format conference, data engineering summit 2024, may 30 and 31, 2024 | 📍 bangalore, india, download the easiest way to stay informed.

essay about data science

Bad Times for Perplexity AI Begins

essay about data science

Top 9 Apple Vision Pro Use Cases in India

essay about data science

US Fears China’s Rise in AI Could Dominate Global Economy and Politics

Top editorial picks, openai brings generative ai search experience to chatgpt, kunal shah says ‘gpt makes him 10x more efficient in sharing ideas with the team’, chatgpt brings data to life with interactive charts and tables directly from google drive and microsoft onedrive, after stack overflow, reddit succumbs to openai, subscribe to the belamy: our weekly newsletter, biggest ai stories, delivered to your inbox every week., also in news.

We Live in an Era Where it's Easy to Build but Difficult to Figure Out What to Build

We Live in an Era Where it’s Easy to Build but Difficult to Figure Out What to Build

Indian Companies are Good at Copying Ideas Generated Elsewhere

Indian Companies are Good at Copying Ideas Generated Elsewhere

Why Ollama is Good for Running LLMs on Computer

Why Ollama is Good for Running LLMs on Computer

essay about data science

10 Free Courses to Build AI Agents in 2024

VMware

VMware Makes Workstation Pro and Fusion Pro Free for Personal Use

essay about data science

L&T Technology Services Trains 3,000 Engineers in GenAI

essay about data science

Zoho Ventures into Chipmaking, Plans $700M Investment

essay about data science

OpenAI Hires Google Veteran to Build  ‘Google Search Alternative’

essay about data science

AI Forum for India

Our discord community for ai ecosystem, in collaboration with nvidia. , "> "> flagship events, rising 2024 | de&i in tech summit, april 4 and 5, 2024 | 📍 hilton convention center, manyata tech park, bangalore, machinecon gcc summit 2024, june 28 2024 | 📍bangalore, india, machinecon usa 2024, 26 july 2024 | 583 park avenue, new york, cypher india 2024, september 25-27, 2024 | 📍bangalore, india, cypher usa 2024, nov 21-22 2024 | 📍santa clara convention center, california, usa, genai corner.

essay about data science

Meet the Child Prodigy from India Who Helped Build OpenAI’s GPT-4o

Transform Your Career with Praxis’s Top-Ranked PGP in Data Science with Generative AI and ML

Transform Your Career with Praxis’s Top-Ranked PGP in Data Science with Generative AI and ML

Intel Says that India Doesn’t Need Big GPUs

Intel is Bullish on India with its Xeon Processors

essay about data science

Top Data Engineering Service Providers – PeMa Quadrant 2024

OpenAI Needs Apple, Badly!

OpenAI Needs Apple, Badly!

essay about data science

Hitachi Vantara & Veeam Partner to Provide Data Protection for Hybrid Cloud Environments

essay about data science

92% of Indian Knowledge Workers Embrace AI at Work: Microsoft & LinkedIn Report

Indian Government to Procure 10,000 GPUs Within 18 Months

Indian Government to Procure 10,000 GPUs Within Next 18 Months

World's biggest media & analyst firm specializing in ai, advertise with us, aim publishes every day, and we believe in quality over quantity, honesty over spin. we offer a wide variety of branding and targeting options to make it easy for you to propagate your brand., branded content, aim brand solutions, a marketing division within aim, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories., corporate upskilling, adasci corporate training program on generative ai provides a unique opportunity to empower, retain and advance your talent, with machinehack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons., talent assessment, conduct customized online assessments on our powerful cloud-based platform, secured with best-in-class proctoring, research & advisory, aim research produces a series of annual reports on ai & data science covering every aspect of the industry. request customised reports & aim surveys for a study on topics of your interest., conferences & events, immerse yourself in ai and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives., aim launches the 3rd edition of data engineering summit. may 30-31, bengaluru.

Join the forefront of data innovation at the Data Engineering Summit 2024, where industry leaders redefine technology’s future.

© Analytics India Magazine Pvt Ltd & AIM Media House LLC 2024

  • Terms of use
  • Privacy Policy

Subscribe to our Youtube channel and see how AI ecosystem works.

There must be a reason why +150k people have chosen to follow us on linkedin. 😉, stay in the know with our linkedin page. follow us and never miss an update on ai.

Help | Advanced Search

Computer Science > Computers and Society

Title: data science: a comprehensive overview.

Abstract: The twenty-first century has ushered in the age of big data and data economy, in which data DNA, which carries important knowledge, insights and potential, has become an intrinsic constituent of all data-based organisms. An appropriate understanding of data DNA and its organisms relies on the new field of data science and its keystone, analytics. Although it is widely debated whether big data is only hype and buzz, and data science is still in a very early phase, significant challenges and opportunities are emerging or have been inspired by the research, innovation, business, profession, and education of data science. This paper provides a comprehensive survey and tutorial of the fundamental aspects of data science: the evolution from data analysis to data science, the data science concepts, a big picture of the era of data science, the major challenges and directions in data innovation, the nature of data analytics, new industrialization and service opportunities in the data economy, the profession and competency of data education, and the future of data science. This article is the first in the field to draw a comprehensive big picture, in addition to offering rich observations, lessons and thinking about data science and analytics.

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

DBLP - CS Bibliography

Bibtex formatted citation.

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Essay Questions

Our essay questions for 2024-25 admissions cycle are changing, and will be posted in late summer/early fall 2024 when finalized.

These short essays (TBA) are an opportunity to articulate your candidacy for the Master of Science in Data Science program at the University of Washington. The best essays are clear, succinct, thoughtful, well-written, and engaging. Your essays play an important role in our holistic admissions process, and we expect that they are your own original work .

Please check back in late summer/early fall for more information.

Admissions Timelines

Applications for Autumn 2024 admissions are now closed.

Information about Autumn 2025 applications will be available in October.

Admissions Updates

Be boundless, connect with us:.

© 2024 University of Washington | Seattle, WA

eml header

37 Research Topics In Data Science To Stay On Top Of

Stewart Kaplan

  • February 22, 2024

As a data scientist, staying on top of the latest research in your field is essential.

The data science landscape changes rapidly, and new techniques and tools are constantly being developed.

To keep up with the competition, you need to be aware of the latest trends and topics in data science research.

In this article, we will provide an overview of 37 hot research topics in data science.

We will discuss each topic in detail, including its significance and potential applications.

These topics could be an idea for a thesis or simply topics you can research independently.

Stay tuned – this is one blog post you don’t want to miss!

37 Research Topics in Data Science

1.) predictive modeling.

Predictive modeling is a significant portion of data science and a topic you must be aware of.

Simply put, it is the process of using historical data to build models that can predict future outcomes.

Predictive modeling has many applications, from marketing and sales to financial forecasting and risk management.

As businesses increasingly rely on data to make decisions, predictive modeling is becoming more and more important.

While it can be complex, predictive modeling is a powerful tool that gives businesses a competitive advantage.

predictive modeling

2.) Big Data Analytics

These days, it seems like everyone is talking about big data.

And with good reason – organizations of all sizes are sitting on mountains of data, and they’re increasingly turning to data scientists to help them make sense of it all.

But what exactly is big data? And what does it mean for data science?

Simply put, big data is a term used to describe datasets that are too large and complex for traditional data processing techniques.

Big data typically refers to datasets of a few terabytes or more.

But size isn’t the only defining characteristic – big data is also characterized by its high Velocity (the speed at which data is generated), Variety (the different types of data), and Volume (the amount of the information).

Given the enormity of big data, it’s not surprising that organizations are struggling to make sense of it all.

That’s where data science comes in.

Data scientists use various methods to wrangle big data, including distributed computing and other decentralized technologies.

With the help of data science, organizations are beginning to unlock the hidden value in their big data.

By harnessing the power of big data analytics, they can improve their decision-making, better understand their customers, and develop new products and services.

3.) Auto Machine Learning

Auto machine learning is a research topic in data science concerned with developing algorithms that can automatically learn from data without intervention.

This area of research is vital because it allows data scientists to automate the process of writing code for every dataset.

This allows us to focus on other tasks, such as model selection and validation.

Auto machine learning algorithms can learn from data in a hands-off way for the data scientist – while still providing incredible insights.

This makes them a valuable tool for data scientists who either don’t have the skills to do their own analysis or are struggling.

Auto Machine Learning

4.) Text Mining

Text mining is a research topic in data science that deals with text data extraction.

This area of research is important because it allows us to get as much information as possible from the vast amount of text data available today.

Text mining techniques can extract information from text data, such as keywords, sentiments, and relationships.

This information can be used for various purposes, such as model building and predictive analytics.

5.) Natural Language Processing

Natural language processing is a data science research topic that analyzes human language data.

This area of research is important because it allows us to understand and make sense of the vast amount of text data available today.

Natural language processing techniques can build predictive and interactive models from any language data.

Natural Language processing is pretty broad, and recent advances like GPT-3 have pushed this topic to the forefront.

natural language processing

6.) Recommender Systems

Recommender systems are an exciting topic in data science because they allow us to make better products, services, and content recommendations.

Businesses can better understand their customers and their needs by using recommender systems.

This, in turn, allows them to develop better products and services that meet the needs of their customers.

Recommender systems are also used to recommend content to users.

This can be done on an individual level or at a group level.

Think about Netflix, for example, always knowing what you want to watch!

Recommender systems are a valuable tool for businesses and users alike.

7.) Deep Learning

Deep learning is a research topic in data science that deals with artificial neural networks.

These networks are composed of multiple layers, and each layer is formed from various nodes.

Deep learning networks can learn from data similarly to how humans learn, irrespective of the data distribution.

This makes them a valuable tool for data scientists looking to build models that can learn from data independently.

The deep learning network has become very popular in recent years because of its ability to achieve state-of-the-art results on various tasks.

There seems to be a new SOTA deep learning algorithm research paper on  https://arxiv.org/  every single day!

deep learning

8.) Reinforcement Learning

Reinforcement learning is a research topic in data science that deals with algorithms that can learn on multiple levels from interactions with their environment.

This area of research is essential because it allows us to develop algorithms that can learn non-greedy approaches to decision-making, allowing businesses and companies to win in the long term compared to the short.

9.) Data Visualization

Data visualization is an excellent research topic in data science because it allows us to see our data in a way that is easy to understand.

Data visualization techniques can be used to create charts, graphs, and other visual representations of data.

This allows us to see the patterns and trends hidden in our data.

Data visualization is also used to communicate results to others.

This allows us to share our findings with others in a way that is easy to understand.

There are many ways to contribute to and learn about data visualization.

Some ways include attending conferences, reading papers, and contributing to open-source projects.

data visualization

10.) Predictive Maintenance

Predictive maintenance is a hot topic in data science because it allows us to prevent failures before they happen.

This is done using data analytics to predict when a failure will occur.

This allows us to take corrective action before the failure actually happens.

While this sounds simple, avoiding false positives while keeping recall is challenging and an area wide open for advancement.

11.) Financial Analysis

Financial analysis is an older topic that has been around for a while but is still a great field where contributions can be felt.

Current researchers are focused on analyzing macroeconomic data to make better financial decisions.

This is done by analyzing the data to identify trends and patterns.

Financial analysts can use this information to make informed decisions about where to invest their money.

Financial analysis is also used to predict future economic trends.

This allows businesses and individuals to prepare for potential financial hardships and enable companies to be cash-heavy during good economic conditions.

Overall, financial analysis is a valuable tool for anyone looking to make better financial decisions.

Financial Analysis

12.) Image Recognition

Image recognition is one of the hottest topics in data science because it allows us to identify objects in images.

This is done using artificial intelligence algorithms that can learn from data and understand what objects you’re looking for.

This allows us to build models that can accurately recognize objects in images and video.

This is a valuable tool for businesses and individuals who want to be able to identify objects in images.

Think about security, identification, routing, traffic, etc.

Image Recognition has gained a ton of momentum recently – for a good reason.

13.) Fraud Detection

Fraud detection is a great topic in data science because it allows us to identify fraudulent activity before it happens.

This is done by analyzing data to look for patterns and trends that may be associated with the fraud.

Once our machine learning model recognizes some of these patterns in real time, it immediately detects fraud.

This allows us to take corrective action before the fraud actually happens.

Fraud detection is a valuable tool for anyone who wants to protect themselves from potential fraudulent activity.

fraud detection

14.) Web Scraping

Web scraping is a controversial topic in data science because it allows us to collect data from the web, which is usually data you do not own.

This is done by extracting data from websites using scraping tools that are usually custom-programmed.

This allows us to collect data that would otherwise be inaccessible.

For obvious reasons, web scraping is a unique tool – giving you data your competitors would have no chance of getting.

I think there is an excellent opportunity to create new and innovative ways to make scraping accessible for everyone, not just those who understand Selenium and Beautiful Soup.

15.) Social Media Analysis

Social media analysis is not new; many people have already created exciting and innovative algorithms to study this.

However, it is still a great data science research topic because it allows us to understand how people interact on social media.

This is done by analyzing data from social media platforms to look for insights, bots, and recent societal trends.

Once we understand these practices, we can use this information to improve our marketing efforts.

For example, if we know that a particular demographic prefers a specific type of content, we can create more content that appeals to them.

Social media analysis is also used to understand how people interact with brands on social media.

This allows businesses to understand better what their customers want and need.

Overall, social media analysis is valuable for anyone who wants to improve their marketing efforts or understand how customers interact with brands.

social media

16.) GPU Computing

GPU computing is a fun new research topic in data science because it allows us to process data much faster than traditional CPUs .

Due to how GPUs are made, they’re incredibly proficient at intense matrix operations, outperforming traditional CPUs by very high margins.

While the computation is fast, the coding is still tricky.

There is an excellent research opportunity to bring these innovations to non-traditional modules, allowing data science to take advantage of GPU computing outside of deep learning.

17.) Quantum Computing

Quantum computing is a new research topic in data science and physics because it allows us to process data much faster than traditional computers.

It also opens the door to new types of data.

There are just some problems that can’t be solved utilizing outside of the classical computer.

For example, if you wanted to understand how a single atom moved around, a classical computer couldn’t handle this problem.

You’ll need to utilize a quantum computer to handle quantum mechanics problems.

This may be the “hottest” research topic on the planet right now, with some of the top researchers in computer science and physics worldwide working on it.

You could be too.

quantum computing

18.) Genomics

Genomics may be the only research topic that can compete with quantum computing regarding the “number of top researchers working on it.”

Genomics is a fantastic intersection of data science because it allows us to understand how genes work.

This is done by sequencing the DNA of different organisms to look for insights into our and other species.

Once we understand these patterns, we can use this information to improve our understanding of diseases and create new and innovative treatments for them.

Genomics is also used to study the evolution of different species.

Genomics is the future and a field begging for new and exciting research professionals to take it to the next step.

19.) Location-based services

Location-based services are an old and time-tested research topic in data science.

Since GPS and 4g cell phone reception became a thing, we’ve been trying to stay informed about how humans interact with their environment.

This is done by analyzing data from GPS tracking devices, cell phone towers, and Wi-Fi routers to look for insights into how humans interact.

Once we understand these practices, we can use this information to improve our geotargeting efforts, improve maps, find faster routes, and improve cohesion throughout a community.

Location-based services are used to understand the user, something every business could always use a little bit more of.

While a seemingly “stale” field, location-based services have seen a revival period with self-driving cars.

GPS

20.) Smart City Applications

Smart city applications are all the rage in data science research right now.

By harnessing the power of data, cities can become more efficient and sustainable.

But what exactly are smart city applications?

In short, they are systems that use data to improve city infrastructure and services.

This can include anything from traffic management and energy use to waste management and public safety.

Data is collected from various sources, including sensors, cameras, and social media.

It is then analyzed to identify tendencies and habits.

This information can make predictions about future needs and optimize city resources.

As more and more cities strive to become “smart,” the demand for data scientists with expertise in smart city applications is only growing.

21.) Internet Of Things (IoT)

The Internet of Things, or IoT, is exciting and new data science and sustainability research topic.

IoT is a network of physical objects embedded with sensors and connected to the internet.

These objects can include everything from alarm clocks to refrigerators; they’re all connected to the internet.

That means that they can share data with computers.

And that’s where data science comes in.

Data scientists are using IoT data to learn everything from how people use energy to how traffic flows through a city.

They’re also using IoT data to predict when an appliance will break down or when a road will be congested.

Really, the possibilities are endless.

With such a wide-open field, it’s easy to see why IoT is being researched by some of the top professionals in the world.

internet of things

22.) Cybersecurity

Cybersecurity is a relatively new research topic in data science and in general, but it’s already garnering a lot of attention from businesses and organizations.

After all, with the increasing number of cyber attacks in recent years, it’s clear that we need to find better ways to protect our data.

While most of cybersecurity focuses on infrastructure, data scientists can leverage historical events to find potential exploits to protect their companies.

Sometimes, looking at a problem from a different angle helps, and that’s what data science brings to cybersecurity.

Also, data science can help to develop new security technologies and protocols.

As a result, cybersecurity is a crucial data science research area and one that will only become more important in the years to come.

23.) Blockchain

Blockchain is an incredible new research topic in data science for several reasons.

First, it is a distributed database technology that enables secure, transparent, and tamper-proof transactions.

Did someone say transmitting data?

This makes it an ideal platform for tracking data and transactions in various industries.

Second, blockchain is powered by cryptography, which not only makes it highly secure – but is a familiar foe for data scientists.

Finally, blockchain is still in its early stages of development, so there is much room for research and innovation.

As a result, blockchain is a great new research topic in data science that vows to revolutionize how we store, transmit and manage data.

blockchain

24.) Sustainability

Sustainability is a relatively new research topic in data science, but it is gaining traction quickly.

To keep up with this demand, The Wharton School of the University of Pennsylvania has  started to offer an MBA in Sustainability .

This demand isn’t shocking, and some of the reasons include the following:

Sustainability is an important issue that is relevant to everyone.

Datasets on sustainability are constantly growing and changing, making it an exciting challenge for data scientists.

There hasn’t been a “set way” to approach sustainability from a data perspective, making it an excellent opportunity for interdisciplinary research.

As data science grows, sustainability will likely become an increasingly important research topic.

25.) Educational Data

Education has always been a great topic for research, and with the advent of big data, educational data has become an even richer source of information.

By studying educational data, researchers can gain insights into how students learn, what motivates them, and what barriers these students may face.

Besides, data science can be used to develop educational interventions tailored to individual students’ needs.

Imagine being the researcher that helps that high schooler pass mathematics; what an incredible feeling.

With the increasing availability of educational data, data science has enormous potential to improve the quality of education.

online education

26.) Politics

As data science continues to evolve, so does the scope of its applications.

Originally used primarily for business intelligence and marketing, data science is now applied to various fields, including politics.

By analyzing large data sets, political scientists (data scientists with a cooler name) can gain valuable insights into voting patterns, campaign strategies, and more.

Further, data science can be used to forecast election results and understand the effects of political events on public opinion.

With the wealth of data available, there is no shortage of research opportunities in this field.

As data science evolves, so does our understanding of politics and its role in our world.

27.) Cloud Technologies

Cloud technologies are a great research topic.

It allows for the outsourcing and sharing of computer resources and applications all over the internet.

This lets organizations save money on hardware and maintenance costs while providing employees access to the latest and greatest software and applications.

I believe there is an argument that AWS could be the greatest and most technologically advanced business ever built (Yes, I know it’s only part of the company).

Besides, cloud technologies can help improve team members’ collaboration by allowing them to share files and work on projects together in real-time.

As more businesses adopt cloud technologies, data scientists must stay up-to-date on the latest trends in this area.

By researching cloud technologies, data scientists can help organizations to make the most of this new and exciting technology.

cloud technologies

28.) Robotics

Robotics has recently become a household name, and it’s for a good reason.

First, robotics deals with controlling and planning physical systems, an inherently complex problem.

Second, robotics requires various sensors and actuators to interact with the world, making it an ideal application for machine learning techniques.

Finally, robotics is an interdisciplinary field that draws on various disciplines, such as computer science, mechanical engineering, and electrical engineering.

As a result, robotics is a rich source of research problems for data scientists.

29.) HealthCare

Healthcare is an industry that is ripe for data-driven innovation.

Hospitals, clinics, and health insurance companies generate a tremendous amount of data daily.

This data can be used to improve the quality of care and outcomes for patients.

This is perfect timing, as the healthcare industry is undergoing a significant shift towards value-based care, which means there is a greater need than ever for data-driven decision-making.

As a result, healthcare is an exciting new research topic for data scientists.

There are many different ways in which data can be used to improve healthcare, and there is a ton of room for newcomers to make discoveries.

healthcare

30.) Remote Work

There’s no doubt that remote work is on the rise.

In today’s global economy, more and more businesses are allowing their employees to work from home or anywhere else they can get a stable internet connection.

But what does this mean for data science? Well, for one thing, it opens up a whole new field of research.

For example, how does remote work impact employee productivity?

What are the best ways to manage and collaborate on data science projects when team members are spread across the globe?

And what are the cybersecurity risks associated with working remotely?

These are just a few of the questions that data scientists will be able to answer with further research.

So if you’re looking for a new topic to sink your teeth into, remote work in data science is a great option.

31.) Data-Driven Journalism

Data-driven journalism is an exciting new field of research that combines the best of both worlds: the rigor of data science with the creativity of journalism.

By applying data analytics to large datasets, journalists can uncover stories that would otherwise be hidden.

And telling these stories compellingly can help people better understand the world around them.

Data-driven journalism is still in its infancy, but it has already had a major impact on how news is reported.

In the future, it will only become more important as data becomes increasingly fluid among journalists.

It is an exciting new topic and research field for data scientists to explore.

journalism

32.) Data Engineering

Data engineering is a staple in data science, focusing on efficiently managing data.

Data engineers are responsible for developing and maintaining the systems that collect, process, and store data.

In recent years, there has been an increasing demand for data engineers as the volume of data generated by businesses and organizations has grown exponentially.

Data engineers must be able to design and implement efficient data-processing pipelines and have the skills to optimize and troubleshoot existing systems.

If you are looking for a challenging research topic that would immediately impact you worldwide, then improving or innovating a new approach in data engineering would be a good start.

33.) Data Curation

Data curation has been a hot topic in the data science community for some time now.

Curating data involves organizing, managing, and preserving data so researchers can use it.

Data curation can help to ensure that data is accurate, reliable, and accessible.

It can also help to prevent research duplication and to facilitate the sharing of data between researchers.

Data curation is a vital part of data science. In recent years, there has been an increasing focus on data curation, as it has become clear that it is essential for ensuring data quality.

As a result, data curation is now a major research topic in data science.

There are numerous books and articles on the subject, and many universities offer courses on data curation.

Data curation is an integral part of data science and will only become more important in the future.

businessman

34.) Meta-Learning

Meta-learning is gaining a ton of steam in data science. It’s learning how to learn.

So, if you can learn how to learn, you can learn anything much faster.

Meta-learning is mainly used in deep learning, as applications outside of this are generally pretty hard.

In deep learning, many parameters need to be tuned for a good model, and there’s usually a lot of data.

You can save time and effort if you can automatically and quickly do this tuning.

In machine learning, meta-learning can improve models’ performance by sharing knowledge between different models.

For example, if you have a bunch of different models that all solve the same problem, then you can use meta-learning to share the knowledge between them to improve the cluster (groups) overall performance.

I don’t know how anyone looking for a research topic could stay away from this field; it’s what the  Terminator  warned us about!

35.) Data Warehousing

A data warehouse is a system used for data analysis and reporting.

It is a central data repository created by combining data from multiple sources.

Data warehouses are often used to store historical data, such as sales data, financial data, and customer data.

This data type can be used to create reports and perform statistical analysis.

Data warehouses also store data that the organization is not currently using.

This type of data can be used for future research projects.

Data warehousing is an incredible research topic in data science because it offers a variety of benefits.

Data warehouses help organizations to save time and money by reducing the need for manual data entry.

They also help to improve the accuracy of reports and provide a complete picture of the organization’s performance.

Data warehousing feels like one of the weakest parts of the Data Science Technology Stack; if you want a research topic that could have a monumental impact – data warehousing is an excellent place to look.

data warehousing

36.) Business Intelligence

Business intelligence aims to collect, process, and analyze data to help businesses make better decisions.

Business intelligence can improve marketing, sales, customer service, and operations.

It can also be used to identify new business opportunities and track competition.

BI is business and another tool in your company’s toolbox to continue dominating your area.

Data science is the perfect tool for business intelligence because it combines statistics, computer science, and machine learning.

Data scientists can use business intelligence to answer questions like, “What are our customers buying?” or “What are our competitors doing?” or “How can we increase sales?”

Business intelligence is a great way to improve your business’s bottom line and an excellent opportunity to dive deep into a well-respected research topic.

37.) Crowdsourcing

One of the newest areas of research in data science is crowdsourcing.

Crowdsourcing is a process of sourcing tasks or projects to a large group of people, typically via the internet.

This can be done for various purposes, such as gathering data, developing new algorithms, or even just for fun (think: online quizzes and surveys).

But what makes crowdsourcing so powerful is that it allows businesses and organizations to tap into a vast pool of talent and resources they wouldn’t otherwise have access to.

And with the rise of social media, it’s easier than ever to connect with potential crowdsource workers worldwide.

Imagine if you could effect that, finding innovative ways to improve how people work together.

That would have a huge effect.

crowd sourcing

Final Thoughts, Are These Research Topics In Data Science For You?

Thirty-seven different research topics in data science are a lot to take in, but we hope you found a research topic that interests you.

If not, don’t worry – there are plenty of other great topics to explore.

The important thing is to get started with your research and find ways to apply what you learn to real-world problems.

We wish you the best of luck as you begin your data science journey!

Other Data Science Articles

We love talking about data science; here are a couple of our favorite articles:

  • Why Are You Interested In Data Science?
  • Recent Posts

Stewart Kaplan

  • Does Google Fi Use Software or SIM Card? [Find Out How It Works] - May 18, 2024
  • How Machine Learning Enhances Software Development [Must-Read Insights] - May 18, 2024
  • Do Software Engineers Need a Portfolio Website? [Unlock Success Now] - May 18, 2024

ESSAY SAUCE

ESSAY SAUCE

FOR STUDENTS : ALL THE INGREDIENTS OF A GOOD ESSAY

Essay: Data science and its applications

Essay details and download:.

  • Subject area(s): Information technology essays
  • Reading time: 8 minutes
  • Price: Free download
  • Published: 1 February 2016*
  • File format: Text
  • Words: 2,127 (approx)
  • Number of pages: 9 (approx)

Text preview of this essay:

This page of the essay has 2,127 words. Download the full version above.

With vast amounts of data now available, organizations in almost every industry are focused on exploiting data for competitive advantage. On the other hand broad availability of data has led to increasing interest in new methods for extracting beneficial information and knowledge from data. The new discipline called data science has arose as a new paradigm to tackle these vast accumulation of data. Today it applies all most every fields in the world for different aspects. Mainly in security, health care, business, agriculture, transport, education, prediction, telecommunication, etc. Each area will also garner a different amount of return on their data science investment. This review aims to provide an overview of data science and mainly how some of these fields are currently using data science and how they could leverage it in their favor in the future. What is Data Science? According to Dhar, V. (2012), the fact that we now have vast amounts of data should not in and of itself justify the need for a new term “Data science”. It is a known fact that the extraction of information from data has been done by the use of statistics over decades. Nevertheless there are many reasons to consider Data science as a new field. First, the raw material, the “data” part of Data Science, is increasingly diverse and unstructured — text, images, and video, frequently arising from networks with complex relationships among its entities. Further he reveals that the relative expected volumes of unstructured and structured data between 2008 and 2015, projecting a difference of almost 200 petabytes in 2015 compared to a difference of 50 petabytes in 2012. Secondly, the creation of markup languages, tags, etc. are mainly designed to let computers interpret data automatically, making them active agents in the manner of decision making (Dhar, V. (2012)). i.e. computers are increasingly doing the background work for each other. In the proceedings paper by Zhu, Y. & Xiong, Y. (2015), they have mentioned that the Data science research objects, goals, and techniques are essentially different from those of computer science, information science or knowledge science. Throughout the paper they always use a comparative method to discuss how data science differs from existing technologies and established sciences. According to them data science supports natural science and social science and dealing with data is one of the driving forces behind data science. Hence, they referred data science as a data-intensive science. It is evident that data science should be considered as a new science and new techniques and methods should be introduced in order to deal with its vast amount of data. Dhar, V. (2012) nicely explains how traditional data base methods are not suited for knowledge discovery. He explains that traditional data base methods are optimized for summarization of data given what the user wants to ask but not discovery of patterns in massive amount of data when the user does not have a well formulated query. i.e. Unlike database querying which asks “what data satisfy this pattern” discovery is about asking “what patterns satisfy the given data?”. Specifically ultimate goal of Data science is to finding interesting and robust patterns that satisfy the data. When the new technologies emerged it lead to research on the data themselves. Lot of fields such as health care, business got the advantage of that and they were able to discover growth patterns of data and predict the scale of data in cyberspace ten years into the future(Zhu, Y. & Xiong, Y. (2015)).This causes to discover lot of new theories, inventions, that haven’t been uncovered for years. Many health related issues have been identified and solved using big data analytics. Applications of Data Science 1) Health care A key contemporary trend emerging in big data science is the quantified self (QS) -individuals engaged in the self-tracking of any kind of biological, physical, behavioral, or environmental information as n = 1 individuals or in groups(Swan, M. (2013)). In this article writer emphasis that Quantified self-projects are becoming an interesting data management and manipulation challenge for big data science in the areas of data collection, integration, and analysis. But at the same time he reveals, when as much larger QS data sets are being generated the quantified self, and health and biology more generally, are becoming full-fledged big data problems in many ways. Variety of self-tracking projects were conducted recently including food visualization, investigation of media consumption and reading habits, multilayer investigation in to diabetes and heart disease risk, idea tracking process, etc. These projects demonstrate the range of topics, depth of problem solving, and variety of methodologies of QS projects. Big health data streams are the main data stream in QS and most difficult task is to integrating big health data streams, especially blending genomic and environmental data. It was found that genetics has a one third of contribution of outcome to diseases like cancer and heart disease. Projects such as DIY genomics studies, 4P medicine personalized, Crohn’s disease tracking microbiomic sequencing, lactoferrin analysis project and Thyroid Hormone testing project are famous examples for applications of QS in genomics. These findings couldn’t be found if data science doesn’t exist or the newly tools which can handle large volume of data were not invented. Further Swan, M. (2013) suggests QS data streams need to be linked to healthy population longitudinal self-tracking more generally as they are the corresponding healthy cohorts to patient cohorts in clinical trials and he predicts eye tracking and emotion measurement could be coming in the future. X. Shi and S. Wang (2015) have written an article to provide an overview of theoretical background of applying the cyberGIS (Geographical information science) approach to spatial analysis for health studies. (cyberGIS is defined as geographic information systems and science based on advanced cyberinfrastructure.) As spatial analysis is a tool for analyze big data it has a high usage in medical fields. According to them many review of literature find that a majority of the methods use only geographically local information and generate non-parametric measurements. It can be found multiple related cases where computational and data sciences are central to solving challenging problems in the framework of health-GIS. Disease mapping is one of the major areas and it is basically used to measure the intensity of a disease in a particular area. Data aggregation is a method which was developed to deal with cancer registry data bases and birth defect data bases. These applications were not only limited for the disease base assessments but it also assess the environmental facts which associate with health such as disparities in geographic access to health care. X. Shi and S. Wang (2015) mentioned a study which estimated the distances or travel times from patients’ locations represented by polygon level data. In conclusion, the above mentioned two articles review that health care is a field with lot of untapped potentials and the use of big data or data science is not only limited for find remedies for diseases but also it assess the other factors such as which causes the efficiency of health care. 2) Social media and networks Swan, M. (2013) points out having large data quantities continues to allow for new methods and discovery. As Google has proved, finally having large enough data sets was the key moment for progress in many venues, where simple machine-learning algorithms could then be run over large data amounts to produce significant results. She explains this through googles’ spelling correction and language translation, image recognition and cultural anthropology via word searches on a database of 5 million digitally scanned books. Dhar, V. (2012) also address this manner in his article and further he explains that Google’s language translator doesn’t “understand” language, nor do its algorithms know the contents on webpages. So such efficient and accurate systems were invented using machine learning algorithms not by tackle this problem through an extensive enumeration of possibilities but rather, “train” a computer to interpret questions correctly based on large numbers of examples. In addition, he emphasis knowledge of text processing or “text mining” is becoming essential in light of the explosion of text and other unstructured data in healthcare systems, social networks, and other sectors. 3) Education Data science has been attracting a great deal of attention nowadays in academia and environments which dealt with theories and formulas. It improves the current research methods for scientific research in order to form new methods and improve specific theories, methods, and technologies in various fields (Zhu, Y. & Xiong, Y. (2015)). Vast accumulation of data provides the opportunity to filter considerably large portion which is useful to a particular object. It provides a great platform to research rare and important matters in any field. At the same time they argued data science itself requires more fundamental theories and new methods and techniques; for example, the existence of data, the measurement of data, time in cyberspace, data algebra, data similarities and the theory of clusters, data classification etc. New action plans, conferences, workshops, Data science journals, institutes specifically for data science, study materials in universities to study data science as a subject, etc. will increase the awareness and understanding on this new science. 4) Business Business field is one of the major sectors which gain benefits from data science principals and data mining techniques. Data mining techniques are widely used in marketing for tasks such as targeted marketing, online advertising, and recommendations for cross-selling (Provost, F. & Fawcett, T. (2013)). According to them data science is mainly used in business fields with the objective of improving decision making. They have mentioned two types of decisions; (1) decisions for which “discoveries” need to be made within data, and (2) decisions that repeat, especially at massive scale. In this article Provost, F. & Fawcett, T. (2013) nicely explain how the companies trying to increase their customers by using data science approach. They give an example using company called “Target” who sells baby related products. In order to increase the no of customers they were interested in whether they could predict that people are expecting a baby in advance. If so, they could make offers to them before their competitors. Usually most birth records are public, so retailers obtain those information and aware the new parents about their new offers. If the information could get before the baby was born then the ones who got that information first would gain an advantage on their marketing campaign. By using data science techniques, they analyzed historical data on customers and identified group of customers who later revealed to have been pregnant. This can be predicted by using change in mother’s diet, vitamin regimens, etc. According to Provost, F. & Fawcett, T. (2013), banking sector is also gain advantage of using Data science, and they were able to do more sophisticated predictive modeling on pricing, credit limits, low-initial-rate balance transfers, cash back, loyalty points, etc. Specially credit card system is also a outcome of big data analytics. Further they claim that the banks with bigger data assets may have an important strategic advantage over their smaller competitors.so the net result will be either increased acceptance of the bank’s products, decreased cost of customer acquirement, or both. 5) Telecommunication Customer churn is the most critical problem that service providers face. Customers switching from one company to another is called churn (Provost, F. & Fawcett, T. (2013)). They state that attracting new customers is much more expensive than retaining existing customers. So each and every service provider is trying to prevent customer chain by giving them a new retention offer. Data mining techniques are majorly used to identify customers who tend to churn. Challenges and barriers In this process lot of personal data were stored in each individual from any sector. With these vast amount of personal data there are certain boundaries and issues which the researchers or data scientists should be considered. Mainly in health data lot of patients are not comfortable with sharing their data in public (Swan, M. (2013)). In her opinion it is necessary to think about personal data privacy rights and neural data privacy rights proactively to facilitate humanity’s future directions in a mature, comfortable, and empowering way. Dhar, V. (2012) also address this matter. He explains with the vast technology development computer has become the decision maker, unaided by the humans and it raises multitude issues such as cost of incorrect decisions and ethical issues. Conclusion It is evident Data science is a newly emerged science that requires overall knowledge mainly in computational science, statistics and mathematics. New technologies are emerged to dealing with massive amount of data in any field and benefits are many and they are ranging from health care to telecommunication. At the same time they should be handled cautiously to ensure that the respondents’ information are not exploited. In the near future data science will uncover many discoveries that support humans to improve their life style in every aspect.

...(download the rest of the essay above)

About this essay:

If you use part of this page in your own work, you need to provide a citation, as follows:

Essay Sauce, Data science and its applications . Available from:<https://www.essaysauce.com/information-technology-essays/data-science-and-its-applications/> [Accessed 18-05-24].

These Information technology essays have been submitted to us by students in order to help you with your studies.

* This essay may have been previously published on Essay.uk.com at an earlier date.

Essay Categories:

  • Accounting essays
  • Architecture essays
  • Business essays
  • Computer science essays
  • Criminology essays
  • Economics essays
  • Education essays
  • Engineering essays
  • English language essays
  • Environmental studies essays
  • Essay examples
  • Finance essays
  • Geography essays
  • Health essays
  • History essays
  • Hospitality and tourism essays
  • Human rights essays
  • Information technology essays
  • International relations
  • Leadership essays
  • Linguistics essays
  • Literature essays
  • Management essays
  • Marketing essays
  • Mathematics essays
  • Media essays
  • Medicine essays
  • Military essays
  • Miscellaneous essays
  • Music Essays
  • Nursing essays
  • Philosophy essays
  • Photography and arts essays
  • Politics essays
  • Project management essays
  • Psychology essays
  • Religious studies and theology essays
  • Sample essays
  • Science essays
  • Social work essays
  • Sociology essays
  • Sports essays
  • Types of essay
  • Zoology essays

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • CAREER FEATURE
  • 08 May 2024

Illuminating ‘the ugly side of science’: fresh incentives for reporting negative results

  • Rachel Brazil 0

Rachel Brazil is a freelance journalist in London, UK.

You can also search for this author in PubMed   Google Scholar

Sarahanne Field giving a talk

The editor-in-chief of the Journal of Trial & Error , Sarahanne Field wants to publish the messy, null and negative results sitting in researchers’ file drawers. Credit: Sander Martens

Editor-in-chief Sarahanne Field describes herself and her team at the Journal of Trial & Error as wanting to highlight the “ugly side of science — the parts of the process that have gone wrong”.

She clarifies that the editorial board of the journal, which launched in 2020 , isn’t interested in papers in which “you did a shitty study and you found nothing. We’re interested in stuff that was done methodologically soundly, but still yielded a result that was unexpected.” These types of result — which do not prove a hypothesis or could yield unexplained outcomes — often simply go unpublished, explains Field, who is also an open-science researcher at the University of Groningen in the Netherlands. Along with Stefan Gaillard, one of the journal’s founders, she hopes to change that.

Calls for researchers to publish failed studies are not new. The ‘file-drawer problem’ — the stacks of unpublished, negative results that most researchers accumulate — was first described in 1979 by psychologist Robert Rosenthal . He argued that this leads to publication bias in the scientific record: the gap of missing unsuccessful results leads to overemphasis on the positive results that do get published.

essay about data science

Careers Collection: Publishing

Over the past 30 years, the proportion of negative results being published has decreased further. A 2012 study showed that, from 1990 to 2007, there was a 22% increase in positive conclusions in papers; by 2007, 85% of papers published had positive results 1 . “People fail to report [negative] results, because they know they won’t get published — and when people do attempt to publish them, they get rejected,” says Field. A 2022 survey of researchers in France in chemistry, physics, engineering and environmental sciences showed that, although 81% had produced relevant negative results and 75% were willing to publish them, only 12.5% had the opportunity to do so 2 .

One factor that is leading some researchers to revisit the problem is the growing use of predictive modelling using machine-learning tools in many fields. These tools are trained on large data sets that are often derived from published work, and scientists have found that the absence of negative data in the literature is hampering the process. Without a concerted effort to publish more negative results that artificial intelligence (AI) can be trained on, the promise of the technology could be stifled.

“Machine learning is changing how we think about data,” says chemist Keisuke Takahashi at Hokkaido University in Japan, who has brought the issue to the attention of the catalysis-research community . Scientists in the field have typically relied on a mixture of trial and error and serendipity in their experiments, but there is hope that AI could provide a new route for catalyst discovery. Takahashi and his colleagues mined data from 1,866 previous studies and patents to train a machine-learning model to predict the best catalyst for the reaction between methane and oxygen to form ethane and ethylene, both of which are important chemicals used in industry 3 . But, he says, “over the years, people have only collected the good data — if they fail, they don’t report it”. This led to a skewed model that, in some cases, enhanced the predicted performance of a material, rather than realistically assessing its properties.

Portrait of Felix Strieth-Kalthoff in the lab

Synthetic organic chemist Felix Strieth-Kalthoff found that published data were too heavily biased toward positive results to effectively train an AI model to optimize chemical reaction yields. Credit: Cindy Huang

Alongside the flawed training of AI models, the huge gap of negative results in the scientific record continues to be a problem across all disciplines. In areas such as psychology and medicine, publication bias is one factor exacerbating the ongoing reproducibility crisis — in which many published studies are impossible to replicate. Without sharing negative studies and data, researchers could be doomed to repeat work that led nowhere. Many scientists are calling for changes in academic culture and practice — be it the creation of repositories that include positive and negative data, new publication formats or conferences aimed at discussing failure. The solutions are varied, but the message is the same: “To convey an accurate picture of the scientific process, then at least one of the components should be communicating all the results, [including] some negative results,” says Gaillard, “and even where you don’t end up with results, where it just goes wrong.”

Science’s messy side

Synthetic organic chemist Felix Strieth-Kalthoff, who is now setting up his own laboratory at the University of Wuppertal, Germany, has encountered positive-result bias when using data-driven approaches to optimize the yields of certain medicinal-chemistry reactions. His PhD work with chemist Frank Glorius at the University of Münster, Germany, involved creating models that could predict which reactants and conditions would maximize yields. Initially, he relied on data sets that he had generated from high-throughput experiments in the lab, which included results from both high- and low-yield reactions, to train his AI model. “Our next logical step was to do that based on the literature,” says Strieth-Kalthoff. This would allow him to curate a much larger data set to be used for training.

But when he incorporated real data from the reactions database Reaxys into the training process, he says, “[it] turned out they don’t really work at all”. Strieth-Kalthoff concluded the errors were due the lack of low-yield reactions 4 ; “All of the data that we see in the literature have average yields of 60–80%.” Without learning from the messy ‘failed’ experiments with low yields that were present in the initial real-life data, the AI could not model realistic reaction outcomes.

Although AI has the potential to spot relationships in complex data that a researcher might not see, encountering negative results can give experimentalists a gut feeling, says molecular modeller Berend Smit at the Swiss Federal Institute of Technology Lausanne. The usual failures that every chemist experiences at the bench give them a ‘chemical intuition’ that AI models trained only on successful data lack.

Smit and his team attempted to embed something similar to this human intuition into a model tasked with designing a metal-organic framework (MOF) with the largest known surface area for this type of material. A large surface area allows these porous materials to be used as reaction supports or molecular storage reservoirs. “If the binding [between components] is too strong, it becomes amorphous; if the binding is too weak, it becomes unstable, so you need to find the sweet spot,” Smit says. He showed that training the machine-learning model on both successful and unsuccessful reaction conditions created better predictions and ultimately led to one that successfully optimized the MOF 5 . “When we saw the results, we thought, ‘Wow, this is the chemical intuition we’re talking about!’” he says.

According to Strieth-Kalthoff, AI models are currently limited because “the data that are out there just do not reflect all of our knowledge”. Some researchers have sought statistical solutions to fill the negative-data gap. Techniques include oversampling, which means supplementing data with several copies of existing negative data or creating artificial data points, for example by including reactions with a yield of zero. But, he says, these types of approach can introduce their own biases.

Portrait of Ella Peltonen

Computer scientist Ella Peltonen helped to organize the first International Workshop on Negative Results in Pervasive Computing in 2022 to give researchers an opportunity to discuss failed experiments. Credit: University of Oulu

Capturing more negative data is now a priority for Takahashi. “We definitely need some sort of infrastructure to share the data freely.” His group has created a website for sharing large amounts of experimental data for catalysis reactions . Other organizations are trying to collect and publish negative data — but Takahashi says that, so far, they lack coordination, so data formats aren’t standardized. In his field, Strieth-Kalthoff says, there are initiatives such as the Open Reaction Database , launched in 2021 to share organic-reaction data and enable training of machine-learning applications. But, he says, “right now, nobody’s using it, [because] there’s no incentive”.

Smit has argued for a modular open-science platform that would directly link to electronic lab notebooks to help to make different data types extractable and reusable . Through this process, publication of negative data in peer-reviewed journals could be skipped, but the information would still be available for researchers to use in AI training. Strieth-Kalthoff agrees with this strategy in theory, but thinks it’s a long way off in practice, because it would require analytical instruments to be coupled to a third-party source to automatically collect data — which instrument manufacturers might not agree to, he says.

Publishing the non-positive

In other disciplines, the emphasis is still on peer-reviewed journals that will publish negative results. Gaillard, a science-studies PhD student at Radboud University in Nijmegen, the Netherlands, co-founded the Journal of Trial & Error after attending talks on how science can be made more open. Gaillard says that, although everyone whom they approached liked the idea of the journal, nobody wanted to submit articles at first. He and the founding editorial team embarked on a campaign involving cold calls and publicity at open-science conferences. “Slowly, we started getting our first submissions, and now we just get people sending things in [unsolicited],” he says. Most years the journal publishes one issue of about 8–14 articles, and it is starting to publish more special issues. It focuses mainly on the life sciences and data-based social sciences.

In 2008, David Alcantara, then a chemistry PhD student at the University of Seville in Spain who was frustrated by the lack of platforms for sharing negative results, set up The All Results journals, which were aimed at disseminating results regardless of the outcome . Of the four disciplines included at launch, only the biology journal is still being published. “Attracting submissions has always posed a challenge,” says Alcantara, now president at the consultancy and training organization the Society for the Improvement of Science in Seville.

But Alcantara thinks there has been a shift in attitudes: “More established journals [are] becoming increasingly open to considering negative results for publication.” Gaillard agrees: “I’ve seen more and more journals, like PLoS ONE , for example, that explicitly mentioned that they also publish negative results.” ( Nature welcomes submissions of replication studies and those that include null results, as described in this 2020 editorial .)

Journals might be changing their publication preferences, but there are still significant disincentives that stop researchers from publishing their file-drawer studies. “The current academic system often prioritizes high-impact publications and ground-breaking discoveries for career advancement, grants and tenure,” says Alcantara, noting that negative results are perceived as contributing little to nothing to these endeavours. Plus, there is still a stigma associated with any kind of failure . “People are afraid that this will look negative on their CV,” says Gaillard. Smit describes reporting failed experiments as a no-win situation: “It’s more work for [researchers], and they don’t get anything in return in the short term.” And, jokes Smit, what’s worse is that they could be providing data for an AI tool to take over their role.

Ultimately, most researchers conclude that publishing their failed studies and negative data is just not worth the time and effort — and there’s evidence that they judge others’ negative research more harshly than positive outcomes. In a study published in August, 500 researchers from top economics departments around the world were randomized to two groups and asked to judge a hypothetical research paper. Half of the participants were told that the study had a null conclusion, and the other half were told the results were sizeably significant. The null results were perceived to be 25% less likely to be published, of lower quality and less important than were the statistically significant findings 6 .

Some researchers have had positive experiences sharing their unsuccessful findings. For example, in 2021, psychologist Wendy Ross at the London Metropolitan University published her negative results from testing a hypothesis about human problem-solving in the Journal of Trial & Error 7 , and says the paper was “the best one I have published to date”. She adds, “Understanding the reasons for null results can really test and expand our theoretical understanding.”

Fields forging solutions

The field of psychology has introduced one innovation that could change publication biases — registered reports (RRs). These peer-reviewed reports , first published in 2014, came about largely as a response to psychology’s replication crisis, which began in around 2011. RRs set out the methodology of a study before the results are known, to try to prevent selective reporting of positive results. Daniël Lakens, who studies science-reward structures at Eindhoven University of Technology in the Netherlands, says there is evidence that RRs increase the proportion of negative results in the psychology literature.

In a 2021 study, Lakens analysed the proportion of published RRs whose results eventually support the primary hypothesis. In a random sample of hypothesis-testing studies from the standard psychology literature, 96% of the results were positive. In RRs, this fell to only 44% 8 . Lakens says the study shows “that if you offer this as an option, many more null results enter the scientific literature, and that is a desirable thing”. At least 300 journals, including Nature , are now accepting RRs, and the format is spreading to journals in biology, medicine and some social-science fields.

Yet another approach has emerged from the field of pervasive computing, the study of how computer systems are integrated into physical surroundings and everyday life. About four years ago, members of the community started discussing reproducibility, says computer scientist Ella Peltonen at the University of Oulu in Finland. Peltonen says that researchers realized that, to avoid the repetition of mistakes, there was a need to discuss the practical problems with studies and failed results that don’t get published. So in 2022, Peltonen and her colleagues held the first virtual International Workshop on Negative Results in Pervasive Computing (PerFail) , in conjunction with the field’s annual conference, the International Conference on Pervasive Computing and Communications.

Peltonen explains that PerFail speakers first present their negative results and then have the same amount of time for discussion afterwards, during which participants tease out how failed studies can inform future work. “It also encourages the community to showcase that things require effort and trial and error, and there is value in that,” she adds. Now an annual event, the organizers invite students to attend so they can see that failure is a part of research and that “you are not a bad researcher because you fail”, says Peltonen.

In the long run, Alcantara thinks a continued effort to persuade scientists to share all their results needs to be coupled with policies at funding agencies and journals that reward full transparency. “Criteria for grants, promotions and tenure should recognize the value of comprehensive research dissemination, including failures and negative outcomes,” he says. Lakens thinks funders could be key to boosting the RR format, as well. Funders, he adds, should say, “We want the research that we’re funding to appear in the scientific literature, regardless of the significance of the finding.”

There are some positive signs of change about sharing negative data: “Early-career researchers and the next generation of scientists are particularly receptive to the idea,” says Alcantara. Gaillard is also optimistic, given the increased interest in his journal, including submissions for an upcoming special issue on mistakes in the medical domain. “It is slow, of course, but science is a bit slow.”

doi: https://doi.org/10.1038/d41586-024-01389-7

Fanelli, D. Scientometrics 90 , 891–904 (2012).

Article   Google Scholar  

Herbet, M.-E., Leonard, J., Santangelo, M. G. & Albaret, L. Learned Publishing 35 , 16–29 (2022).

Fujima, J., Tanaka, Y., Miyazato, I., Takahashi, L. & Takahashi, K. Reaction Chem. Eng. 5 , 903–911 (2020).

Strieth-Kalthoff, F. et al. Angew. Chem. Int. Edn 61 , e202204647 (2022).

Moosavi, S. M. et al. Nature Commun. 10 , 539 (2019).

Article   PubMed   Google Scholar  

Chopra, F., Haaland, I., Roth, C. & Stegmann, A. Econ. J. 134 , 193–219 (2024).

Ross, W. & Vallée-Tourangeau, F. J. Trial Error https://doi.org/10.36850/e4 (2021).

Scheel, A. M., Schijen, M. R. M. J. & Lakens, D. Adv. Methods Pract. Psychol. Sci . https://doi.org/10.1177/25152459211007467 (2021).

Download references

Related Articles

essay about data science

  • Scientific community

I’m worried I’ve been contacted by a predatory publisher — how do I find out?

I’m worried I’ve been contacted by a predatory publisher — how do I find out?

Career Feature 15 MAY 24

How I fled bombed Aleppo to continue my career in science

How I fled bombed Aleppo to continue my career in science

Career Feature 08 MAY 24

Hunger on campus: why US PhD students are fighting over food

Hunger on campus: why US PhD students are fighting over food

Career Feature 03 MAY 24

Mount Etna’s spectacular smoke rings and more — April’s best science images

Mount Etna’s spectacular smoke rings and more — April’s best science images

News 03 MAY 24

Plagiarism in peer-review reports could be the ‘tip of the iceberg’

Plagiarism in peer-review reports could be the ‘tip of the iceberg’

Nature Index 01 MAY 24

Algorithm ranks peer reviewers by reputation — but critics warn of bias

Algorithm ranks peer reviewers by reputation — but critics warn of bias

Nature Index 25 APR 24

Researchers want a ‘nutrition label’ for academic-paper facts

Researchers want a ‘nutrition label’ for academic-paper facts

Nature Index 17 APR 24

Postdoc in CRISPR Meta-Analytics and AI for Therapeutic Target Discovery and Priotisation (OT Grant)

APPLICATION CLOSING DATE: 14/06/2024 Human Technopole (HT) is a new interdisciplinary life science research institute created and supported by the...

Human Technopole

essay about data science

Research Associate - Metabolism

Houston, Texas (US)

Baylor College of Medicine (BCM)

essay about data science

Postdoc Fellowships

Train with world-renowned cancer researchers at NIH? Consider joining the Center for Cancer Research (CCR) at the National Cancer Institute

Bethesda, Maryland

NIH National Cancer Institute (NCI)

Faculty Recruitment, Westlake University School of Medicine

Faculty positions are open at four distinct ranks: Assistant Professor, Associate Professor, Full Professor, and Chair Professor.

Hangzhou, Zhejiang, China

Westlake University

essay about data science

PhD/master's Candidate

PhD/master's Candidate    Graduate School of Frontier Science Initiative, Kanazawa University is seeking candidates for PhD and master's students i...

Kanazawa University

essay about data science

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

a metal bracket holding in place a small piece of paper with the word "junk" printed on it

When authoritative sources hold onto bad data: A legal scholar explains the need for government databases to retract information

essay about data science

Associate Professor of Law, Fordham University

Disclosure statement

Janet Freilich does not work for, consult, own shares in or receive funding from any company or organisation that would benefit from this article, and has disclosed no relevant affiliations beyond their academic appointment.

View all partners

In 2004, Hwang Woo-suk was celebrated for his breakthrough discovery creating cloned human embryos , and his work was published in the prestigious journal Science. But the discovery was too good to be true ; Dr. Hwang had fabricated the data. Science publicly retracted the article and assembled a team to investigate what went wrong .

Retractions are frequently in the news. The high-profile discovery of a room-temperature superconductor was retracted on Nov. 7, 2023. A series of retractions toppled the president of Stanford University on July 19, 2023. Major early studies on COVID-19 were found to have serious data problems and retracted on June 4, 2020.

Retractions are generally framed as a negative: as science not working properly, as an embarrassment for the institutions involved, or as a flaw in the peer review process. They can be all those things. But they can also be part of a story of science working the right way: finding and correcting errors, and publicly acknowledging when information turns out to be incorrect.

A far more pernicious problem occurs when information is not, and cannot, be retracted. There are many apparently authoritative sources that contain flawed information. Sometimes the flawed information is deliberate, but sometimes it isn’t – after all, to err is human. Often, there is no correction or retraction mechanism, meaning that information known to be wrong remains on the books without any indication of its flaws.

As a patent and intellectual property legal scholar , I’ve found that this is a particularly harmful problem with government information, which is often considered a source of trustworthy data but is prone to error and often lacking any means to retract the information.

Patent fictions and fraud

Consider patents, documents that contain many technical details that can be useful to scientists . There is no way to retract a patent . And patents contain frequent errors : Although patents are reviewed by an expert examiner before being granted, examiners do not check whether the scientific data in the patent is correct.

In fact, the U.S. Patent and Trademark Office permits patentees to include fictional experiments and data in patents. This practice, called prophetic examples , is common; about 25% of life sciences patents contain fictional experiments . The patent office requires that prophetic examples be written in the present or future tense while real experiments can be written in the past tense. But this is confusing to nonspecialists, including scientists, who tend to assume that a phrase like “X and Y are mixed at 300 degrees to achieve a 95% yield rate” indicates a real experiment.

Almost a decade after Science retracted the journal article claiming cloned human cells, Dr. Hwang received a U.S patent on his retracted discovery. Unlike the journal article, this patent has not been retracted. The patent office did not investigate the accuracy of the data – indeed, it granted the patent long after the data’s inaccuracy had been publicly acknowledged – and there is no indication on the face of the patent that it contains information that has been retracted elsewhere.

This is no anomaly. In a similar example, Elizabeth Holmes, the former – now imprisoned – CEO of Theranos, holds patents on her thoroughly discredited claims for a small device that could rapidly run many tests on a small blood sample. Some of those patents were granted long after Theranos’ fraud headlined major newspapers.

A document containing numbers and text

Long-lived bad information

This sort of under-the-radar wrong data can be deeply misleading to readers. The system of retractions in scientific journals is not without its critics, but it compares favorably to the alternative of no retractions. Without retractions, readers don’t know when they are looking at incorrect information.

My colleague Soomi Kim and I conducted a study of patent-paper pairs. We looked at cases where the same information was published in a journal article and in a patent by the same scientists, and the journal paper had subsequently been retracted. We found that while citations to papers dropped steeply after the paper was retracted, there was no reduction in citations to patents with the very same incorrect information.

This probably happened because scientific journals paint a big red “retracted” notice on retracted articles online, informing the reader that the information is wrong. By contrast, patents have no retraction mechanism, so incorrect information continues to spread.

There are many other instances where authoritative-looking information is known to be wrong . The Environmental Protection Agency publishes emissions data supplied by companies but not reviewed by the agency. Similarly, the Food and Drug Administration disseminates official-looking information about drugs that is generated by drug manufacturers and posted without an evaluation by the FDA.

Consequences of nonretractions

There are also economic consequences when incorrect information can’t be easily corrected. The Food and Drug Administration publishes a list of patents that cover brand-name drugs. The FDA won’t approve a generic drug unless the generic manufacturer has shown that each patent that covers the drug in question is expired, not infringed or invalid.

The problem is that the list of patents is generated by the brand-name drug manufacturers , who have an incentive to list patents that don’t actually cover their drugs . Doing so increases the burden on generic drug manufacturers. The list is not checked by the FDA or anyone else, and there are few mechanisms for anyone other than the brand-name manufacturer to tell the FDA to remove a patent from the list.

Even when retractions are possible, they are effective only when readers pay attention to them. Financial data is sometimes retracted and corrected, but the revisions are not timely. “ Markets don’t tend to react to revisions ,” Paul Donovan, chief economist of UBS Global Wealth Management, told the Wall Street Journal, referring to governments revising gross domestic product figures.

Misinformation is a growing problem. There are no easy answers to solve it. But there are steps that would almost certainly help. One relatively straightforward one is for trusted data sources like those from the government to follow the lead of scientific journals and create a mechanism to retract erroneous information.

  • Retractions
  • Misinformation
  • Drug patents
  • Government data
  • Scientific fraud
  • Scientific journals

essay about data science

Compliance Lead

essay about data science

Lecturer / Senior Lecturer - Marketing

essay about data science

Assistant Editor - 1 year cadetship

essay about data science

Executive Dean, Faculty of Health

essay about data science

Lecturer/Senior Lecturer, Earth System Science (School of Science)

Photo Essay: My Spring 2024 Semester at CDS

essay about data science

Hi there! My name is Isabella Boncser, and I'm currently a sophomore in the six-year Accelerated BS/DPT program in Boston University's Sargent College (2026/2028). In addition to my academic pursuits, I have a passion for photography, and am currently the CDS student event photographer. I love capturing student life within CDS, whether that be taking pictures of students studying in the building on a rainy day, attending 24-hour civic tech hackathon on the 17th floor, or a faculty and staff appreciation event. Over this past semester, I had the honor of working with the CDS communications team, led by Maureen McCarthy , director, and Alessandra Augusto , events & communications manager.

I was asked to highlight some of my best and brightest work from the semester. The following images were captured this spring, and are some of my favorite images. They showcase the versatility of student life within CDS and BU Spark !

essay about data science

BU Spark! hosted a Tech For Change Civic Tech Hackathon , where students spent 24 hours at BU to developed a new project with teamwork and technical skills at the forefront. I had the opportunity to meet students from 19 different schools, all of whom spent (literally) day and night on the 17th floor of the Center for Computing & Data Sciences working together and using their hacking skills to create a difference in the world. Pictured here are two students celebrating after discussing their individual projects and asking for some advice regarding their presentations.

essay about data science

CDS serves as home for a variety of people and their furry friends! This image shows Miss Belle, the beautiful English Setter (who loves birds) who shares office space with her owner, Chris DeVits, CDS Director of Administration.

essay about data science

The Center for Computing & Data Sciences truly has a place for everyone at BU. The main level has become the campus living room, where students can meet to chat over coffee, or catch up on emails on the staircase. On a rainy day, students can find a " cozy corner " and focus on their work in a relaxing environment. This is a glimpse of the "sit steps" - the large staircase with over two dozen conversation spaces that has become popular for students to relax and get some work done between classes.

essay about data science

You may have heard people refer to the Center for Computing & Data Sciences as the "Jenga Building" because of its Jenga-like architecture. The building, which is home to the Faculty of Computing & Data Sciences, the Departments of Mathematics & Statistics and Computer Sciences, and the renowned Rafik B. Hariri Institute for Computing and Computation Science & Engineering, embraces its beautiful yet fun architecture while focusing on community! Next time you are craving a fun study break, join the CDS Events Team for a night of Jenga and try some delicious popcorn!

essay about data science

Driving down Commonwealth Avenue, the building stands out amongst its peers and shines bright along the Boston city skyline. Illuminating the streets during dusk, the building is one of my favorites the photograph. The 17th floor is home to many events hosted by CDS faculty and staff, as well as the general BU community.

essay about data science

The students pictured had been working tirelessly on their TFC Civic Tech Hackathon project. This photo exemplifies teamwork, collaboration, and partnership. Although students were working on their projects for 24 hours on the 17th floor of CDS, they were all smiles for the camera during final presentations!

essay about data science

Yoga at the Top of BU has become a staple for students to come and enjoy a one-hour yoga session. The class is open to all students across BU, and is a great way to take a study break and get your body moving. If you are a zen master, or have never taken a yoga class before, come join us for the next session!

essay about data science

The BU Spark! team gathered for a group picture during the Civic Tech Hackathon which took place on the 17th floor in February. Over the Spring 2024 semester, I've had the pleasure of getting to know the ambassadors from each track, and their passion for their work within the BU community is truly inspiring. BU Spark! hosts numerous events, talks, and community-building programs like Cookie O'clock, town halls, and much more. Visit the BU Spark! space on the second floor to learn more about their involvement on campus!

essay about data science

Computational Humanities, Arts & Social Sciences ( CHASS ) hosted a variety of tutorials ranging from "An Analysis on Emerson's Work" to large language model discussions throughout the Spring 2024 semester. These sessions are a great way to learn about the data science industry and how your skills will be used in the real world. Check out the CHASS video tutorial library on YouTube .

I am heading to Dublin, Ireland to live and study abroad for the Fall 2024 semester! I am so thankful to Maureen McCarthy who gave me the opportunity to work with and celebrate the CDS community. I would also like to shoutout Sebastian Bak (QST'25) who recommended the position to me, and spoke so highly of the CDS community!

Share this:

  • Share on Facebook (Opens in new window)
  • Click to share on LinkedIn (Opens in new window)
  • Click to share on Twitter (Opens in new window)

View all posts

IMAGES

  1. Introduction To Data Science. What is Data Science?

    essay about data science

  2. The Different Types of Data Scientists (And What Kind You Should Hire)

    essay about data science

  3. Data Science

    essay about data science

  4. What is data science? A data science definition.

    essay about data science

  5. (PDF) Towards Data Science

    essay about data science

  6. (PDF) Research on Data Science, Data Analytics and Big Data

    essay about data science

VIDEO

  1. Why Data Science?

  2. Lecture 12

  3. What makes SST special? Ft Srikanth, Data Scientist

  4. "Learn data science the best way possible with Datacamp" #data #datascience

  5. Indian Population and Workforce Distribution

  6. How to Become a Data Scientist In 2024

COMMENTS

  1. What is Data Science? Definition, Examples, Tools & More

    Definition, Examples, Tools & More. Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. Data science has been hailed as the 'sexiest job of the 21st century', and this is not just a hyperbolic claim.

  2. Careers in STEM: Why Should I Study Data Science?

    5 Reasons to Study Data Science. The field of data science is a great career choice that offers high salaries, opportunities across several industries, and long-term job security. Here are five reasons to consider a career in data science. 1. Data Scientists Are in High Demand.

  3. What is Data Science?

    The data science lifecycle involves various roles, tools, and processes, which enables analysts to glean actionable insights. Typically, a data science project undergoes the following stages: Data ingestion: The lifecycle begins with the data collection—both raw structured and unstructured data from all relevant sources using a variety of ...

  4. Harvard Data Science Review

    As an open access platform of the Harvard Data Science Initiative, Harvard Data Science Review (HDSR) features foundational thinking, research milestones, educational innovations, and major applications, with a primary emphasis on reproducibility, replicability, and readability.We aim to publish content that helps define and shape data science as a scientifically rigorous and globally ...

  5. Data Science and Analytics: An Overview from Data-Driven Smart

    Introduction. We are living in the age of "data science and advanced analytics", where almost everything in our daily lives is digitally recorded as data [].Thus the current electronic world is a wealth of various kinds of data, such as business data, financial data, healthcare data, multimedia data, internet of things (IoT) data, cybersecurity data, social media data, etc [].

  6. Optimal Structure for Data Science Essays

    This approach ensures that your essay remains coherent and focused. Begin each paragraph with a topic sentence that introduces the paragraph's main idea. Include relevant data and figures to support your points, ensuring they are appropriately cited. Provide analysis and interpretation of the data, explaining how it supports your thesis.

  7. Everything you need to know about Data Science

    Data Science is the study that concerns the retrieval and analysis of data sets, intending to identify information and correspondences hidden in the unprocessed data, defined as raw. Data Science, in other words, is the science that combines programming skills and mathematical and statistical knowledge to extract meaningful information from data.

  8. Data science: a game changer for science and innovation

    This paper shows data science's potential for disruptive innovation in science, industry, policy, and people's lives. We present how data science impacts science and society at large in the coming years, including ethical problems in managing human behavior data and considering the quantitative expectations of data science economic impact. We introduce concepts such as open science and e ...

  9. (PDF) What Is Data Science?

    science. Data S cience is a body of principles and techniques for applying data analytic. methods to data at scal e, including volume, velocity, and variety, to accelerate the. investig ation of ...

  10. Data Science Masters Personal Statement Sample

    This is an example personal statement for a Masters degree application in Data Science. See our guide for advice on writing your own postgraduate personal statement. The emergence of big data over the past decade as a power for good - and, dare I say it, evil - has convinced me of the importance of developing and honing my skills in this arena.

  11. What Is Data Science? 5 Applications in Business

    3. Inform Internal Finances. Your organization's financial team can utilize data science to create reports, generate forecasts, and analyze financial trends. Data on a company's cash flows, assets, and debts are constantly gathered, which financial analysts can use to manually or algorithmically detect trends in financial growth or decline.

  12. Data Science for Undergraduates: Opportunities and Options

    Recommendation 2.3: To prepare their graduates for this new data-driven era, academic institutions should encourage the development of a basic understanding of data science in all undergraduates. Recommendation 2.4: Ethics is a topic that, given the nature of data science, students should learn and practice throughout their education.

  13. The All-time Best Guides to Data Science Writing

    1. Photo by Freddy Castro on Unsplash. The data science blogging ecosystem is rich and growing. TDS alone has an archive of more than 20,000 posts across numerous topics. Many experts have launched Substacks, newsletters, or personal blogs. If you're looking for great new reads to add to your roster, check out Vicky Boykis, Randy Au, or start ...

  14. How To Write A Personal Statement For Data Science Masters

    An attention-grabbing intro and a hard-hitting conclusion are again critical for writing a compelling personal statement. The first paragraph can create the first impression of the applicant in front of the professors, and a sharp end will help them remember that candidate among the crowd. The readers of the personal statement are the experts ...

  15. Top 10 Essential Data Science Topics to Real-World Application From the

    1. Introduction. Statistics and data science are more popular than ever in this era of data explosion and technological advances. Decades ago, John Tukey (Brillinger, 2014) said, "The best thing about being a statistician is that you get to play in everyone's backyard."More recently, Xiao-Li Meng (2009) said, "We no longer simply enjoy the privilege of playing in or cleaning up everyone ...

  16. 6 Papers Every Modern Data Scientist Must Read

    Bottom line, if you're interested in the field of recommendation systems, this one is a must. This concludes my top 6 papers every modern data scientist must read — let me know if you think I missed a paper! Artificial Intelligence. Deep Learning. Machine Learning.

  17. [2007.03606] Data Science: A Comprehensive Overview

    The twenty-first century has ushered in the age of big data and data economy, in which data DNA, which carries important knowledge, insights and potential, has become an intrinsic constituent of all data-based organisms. An appropriate understanding of data DNA and its organisms relies on the new field of data science and its keystone, analytics. Although it is widely debated whether big data ...

  18. What Is a Data Scientist? Salary, Skills, and How to Become One

    4. Prepare for data science interviews. With a few years of experience working with data analytics, you might feel ready to move into data science. Once you've scored an interview, prepare answers to likely interview questions. Data scientist positions can be highly technical, so you may encounter technical and behavioral questions.

  19. Essay Questions

    These short essays are an opportunity to articulate your candidacy for the Master of Science in Data Science program at the University of Washington. The best essays are clear, succinct, thoughtful, well-written, and engaging. Your essays play an important role in our holistic admissions process, and we expect that they are your own original work.

  20. 37 Research Topics In Data Science To Stay On Top Of » EML

    The data science landscape changes rapidly, and new techniques and tools are constantly being developed. To keep up with the competition, you need to be aware of the latest trends and topics in data science research. ... Some ways include attending conferences, reading papers, and contributing to open-source projects. 10.) Predictive Maintenance.

  21. How I Got Into 12 Data Science Masters Programs

    MS in Data Science/Analytics Programs: Applied: 12; Accepted: 12; Schools: Duke, Tufts (2), USC (2), Georgetown, UMich (2), UVA, UChicago (2), NC State; ... The "big" essay is the most important, so it was the one I always made sure I had at least one (or three) sets of eyes on. Prioritize the big essay and focus on the small with the ...

  22. Essay: Data science and its applications

    Applications of Data Science. 1) Health care. A key contemporary trend emerging in big data science is the quantified self (QS) -individuals engaged in the self-tracking of any kind of biological, physical, behavioral, or environmental information as n = 1 individuals or in groups (Swan, M. (2013)). In this article writer emphasis that ...

  23. Full article: Integrating Data Science Ethics Into an Undergraduate

    1 Introduction. Data ethics is a rapidly-developing yet inchoate subfield of research within the discipline of data science, which is itself rapidly-developing (Wender and Kloefkorn 2017 ). For example, the Data Science department at Stanford University lists "Ethics and Data Science" as one of its research areas: https://datascience ...

  24. Understanding the Meta-analytics of Data Science

    Abstract. This paper is dedicated toward understanding the meta-analytics of data science. Data science—being now an independent domain—demands a greater expertise and skilful ability on the part of data analysts who perform analysis of data. In this paper, I attempt to underline the foundational processes based on which meta-analytic and ...

  25. Illuminating 'the ugly side of science': fresh incentives for reporting

    New data repositories and alternative journals and workshops offer routes for sharing negative results — which could help to solve the reproducibility crisis and give machine learning a boost.

  26. College of Computing and Data Science

    Andreas Kuster, a PhD student from Nanyang Technological University's College of Computing and Data Science (CCDS), has been named one of the three winners of the 53 rd St. Gallen Symposium's essay award. His essay, titled "Beyond the Noise: Innovating Information Verification in the Digital Age", stood out among over 700 submissions from all ...

  27. Concerns about data integrity across 263 papers by one author

    The number of active studies per month was greatest between 2016 and 2019, with 88 ongoing studies in May 2017. We found evidence of data integrity concerns in 130 (49%) papers, 43 (33%) of which contained concerns sufficient to suggest that they could not be based on data reliably collected from human participants.

  28. When authoritative sources hold onto bad data: A legal scholar explains

    Science publicly retracted the article and assembled a team to investigate what went wrong. Retractions are frequently in the news. The high-profile discovery of a room-temperature superconductor ...

  29. Art of the Graduate School Essay

    My pivot point for choosing to pursue data science came last April when my team competed at the University of Chicago Econometrics Games. Like a Hackathon for young econometricians, the games pitted students of economics from Cambridge to Santa Clara. The objective was to stage, and answer, a question of economic importance within fourteen hours.

  30. Photo Essay: My Spring 2024 Semester at CDS

    Computational Humanities, Arts & Social Sciences ( CHASS) hosted a variety of tutorials ranging from "An Analysis on Emerson's Work" to large language model discussions throughout the Spring 2024 semester. These sessions are a great way to learn about the data science industry and how your skills will be used in the real world.