Ever Wondered What Data Scientists Actually Do? Let's Dive In!
In simple terms, a data scientist is a person who helps businesses to achieve their goals and find appropriate answers to their questions through data. For this, they need to tackle data, analyze and explore it, and then based on the results, make decisions.
As mentioned, data science is a multidisciplinary field, so a data scientist needs to have proper background knowledge in needed tools, technologies, and fields. In the following well-known Drew Conway's Venn diagram, different expertise to be a data scientist is shown. As depicted, a data scientist is someone who develops their skills in computer science, mathematics, and domain expertise. The latter means understanding the overall aspects of the field in which you work. But what is computer science?
Computer science is a broad field that encompasses the theoretical and practical aspects of modern computing. At its core, computer science is concerned with understanding how computers work and how they can be used to solve complex problems. This includes everything from software to the operating systems they run on and to the base hardware that interacts with the OS. One of the key skills that a computer scientist possesses is the ability to code and program. This involves writing instructions that a computer can understand and execute. Remind a data scientist does not necessarily need to be a computer scientist; rather, they need to possess certain skills and knowledge essential for the job.
You might ask what the difference between a statistician and a data scientist is. Indeed, data scientists go beyond just analyzing and exploring data; they also use specific algorithms called machine learning algorithms to do more advanced tasks, such as detecting patterns of data and predicting the future trend. In the following some of the important tasks of a data scientist are listed:
- Asking the right questions to identify the problem or goal of the project.
- Conducting exploratory data analysis, which may include statistical analysis or other techniques to understand the data.
- Visualizing and presenting the data to communicate insights or findings to stakeholders.
- Modeling the data using machine learning algorithms to gain further insights or make predictions.
- Detecting patterns and anomalies, which can help identify important trends or outliers in the data.
- Understanding the past and present by analyzing historical data and current trends.
- Making data-driven decisions based on the insights and predictions generated by the analysis.
- Predicting the future by using machine learning models to forecast future trends or outcomes.
As you might guess, the demand for skillful data scientists is growing quickly. In a nutshell, there are many reasons that lead to rising demands for data scientists. In the following, we point out some of them:
- Organizations are getting into trouble handling a huge amount of data.
- Since a broad range of knowledge is required to master data science, skilled data scientists are hard to find.
- The salary is higher in comparison to other careers.
- In addition to tech giants, many small and mid-sized companies need to hire data scientists.
- The data scientist position is not limited to tech companies; various industries, from manufacturing, and agriculture to healthcare, need to fill their vacancies.
Generally, there is no silver bullet for solving a data science problem; indeed, it is necessary to use different steps and strategies to tackle each project. However, we can see some patterns and similarities among all those steps. In fact, data science workflow refers to steps and phases that should be done in a data science project. Selecting an appropriate workflow helps to organize and implement a project clearly. Moreover, by dividing a project into phases, we can expect what the output of each stage should be looked like, and also assigning each phase to related experts within our team will be possible.
Many frameworks are introduced to define different data science workflows; some of them are more well-known. In this section, we will explain two of them, CRISP-DM and OSEMN, but feel free to explore the others on your own.
CRISP-DM
CRISP-DM, or Cross Industry Standard Process for Data Mining, was first introduced in the late 90s and developed over time. This is a famous and user-friendly workflow used to define the life cycle, from planning and organizing to implementing a data science project.
As shown in the above figure, this workflow contains six iterative phases. In each phase, tasks and deliveries will be defined. Also, when needed, each phase can be repeated. In the following, these six phases are described:
- Business Understanding: First of all, we should understand what we want to accomplish in our business. In other words, we should ask ourselves what questions we are trying to answer.
- Data Understanding: In this stage, we should gather and collect the data related to the questions we asked before. As a matter of fact, loading data into particular tools can be done at this stage.
- Data Preparation: After gathering data, we should prepare raw data for further analysis in the next stages. This means that data should be cleaned and prepared in this stage in the right format so that the analyzing algorithms can get this well-formatted data as their input.
- Modeling: Regardless of what we want to do at the end, predicting the future or analyzing the past data, we usually want to create a model to generalize the patterns of our data. With this in mind, different tools, techniques, and algorithms will be used in this stage to help to model our data. Moreover, we assess our models based on our defined metrics.
- Evaluation: In the previous step, we evaluate our models technically, but in this stage, we assess the models to understand how good they are in the bigger picture and whether they appropriately respond to our business needs or not.
- Deployment: After picking the best model, we should present it in a suitable way so that our end users can use it easily. In this stage, we deploy our model into a production environment.
OSEMN
OSEMN workflow, which stands for Obtain, Scrub, Explore, Model, and iNterpret, was first introduced by Hilary Mason in 2010. Let's describe its 5 key components:
- Obtain: In the first step, we should find out what our data is and how we can access it. Although there are many free and ready-to-use datasets, most of the time, we should use techniques such as web scraping or querying our databases to gather the right data. After collecting it, data should be stored in a suitable format.
- Scrub: Almost always, collected data is messy. Thus, for further usage, data should be cleaned. In this step, we should take care of tasks like handling missing values, removing irrelevant data, and identifying errors and corrupt data. Although this step may underestimate by some data scientists, bear in mind that good data cleaning will lead to good inputs to the next phases and may obtain good results.
- Explore: At this step, exploratory data analysis, or EDA for short, should be done. By using data visualization (a graphical representation of data characteristics) and statistical testing methods, we can get some intuition about data and understand some patterns without any use of artificial intelligence tools.
- Model: After cleaning and analyzing data, we can use different algorithms to generalize the patterns within our data and make a model to describe them. Based on what metrics we care about the most, we can evaluate the models and pick the best one.
- Interpret: In the end, we should present what we achieved to users and stakeholders. Reporting just a bunch of numbers is not our goal; what matters is explaining hidden insights behind the raw data that we find.
For more insightful explanations on data scientists and what they do, be a part of our hands-on and immersive Introduction to Python for Data Science Online Training.
References
[1] https://www.datascience-pm.com/data-science-workflow/