What is Data Science?
Dealing with unstructured and structured data, Data Science is a field that comprises everything that related to data cleansing, preparation, and analysis.
Data Science is the combination of statistics, mathematics, programming, problem-solving, capturing data in ingenious ways, the ability to look at things differently, and the activity of cleansing, preparing, and aligning the data.
In simple terms, it is the umbrella of techniques used when trying to extract insights and information from data.
Data Science Features
Data Scienceis an interdisciplinary area that uses scientific methods, processes, algorithms and extracts insights from data.
- A Machine Learning algorithm builds a model out of sample data, known as
training dataand used in data science
- The model is further used to make predictions of unknown data used in data science
- For example, machine learning algorithms are used in applications of
filtering spam emails. The built-in model predicts if an email is a
DevOps for Data Science
- The machine learning model built by a data scientist has to be deployed into production for usage.
- When a new version of a model is available, then the old version in production has to be replaced with a new one.
- This process of deploying machine learning models is very similar to the deployment of software code.
- As a reason,
DevOpspractices and habits can be followed while deploying machine learning models into production.
Steps in Machine Learning in data science
As you know,
Machine Learning provides computers the ability to perform tasks such as classification and prediction.
The major steps involved in Machine Learning are:
- Data Collection
- Data Munging
- Feature Engineering
- Model Building
- Testing and Validating Model
- Running Models
- The first part of any Data Science project is
Data Collection. It enables an organization to derive answers to specific questions and first step of data science
- The data can be collected from multiple sources such as
Logfilesand so on.
- Collecting high-quality data is essential for getting correct insights from the data.
- In reality, most of the raw data is redundant and can be structured, unstructured or semi-structured.
What is Big Data?
Big Data refers to humongous volumes of data that cannot be processed effectively with the traditional applications that exist. The processing of Big Data begins with the raw data that isn’t aggregated and is most often impossible to store in the memory of a single computer.
A buzzword that is used to describe immense volumes of data, both unstructured and structured, Big Data inundates a business on a day-to-day basis. Big Data is something that can be used to analyze insights that can lead to better decisions and strategic business moves.
The definition of Big Data, given by Gartner, is, “Big data is high-volume, and high-velocity or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.”
What is Data Analytics?
Data Analytics the science of examining raw data to conclude that information.
Data Analytics involves applying an algorithmic or mechanical process to derive insights and, for example, running through several data sets to look for meaningful correlations between each other.
It is used in several industries to allow organizations and companies to make better decisions as well as verify and disprove existing theories or models. The focus of Data Analytics lies in inference, which is the process of deriving conclusions that are solely based on what the researcher already knows.
Now, let us move to applications of Data Science, Big Data, and Data Analytics.
Types of Data Analytics
Data analytics is a broad field. There are four primary types of data analytics: descriptive, diagnostic, predictive and prescriptive analytics. Each type has a different goal and a different place in the data analysis process. These are also the primary data analytics applications in business.
- Descriptive analytics helps answer questions about what happened. These techniques summarize large datasets to describe outcomes to stakeholders. By developing key performance indicators (KPIs,) these strategies can help track successes or failures. Metrics such as return on investment (ROI) are used in many industries. Specialized metrics are developed to track performance in specific industries. This process requires the collection of relevant data, processing of the data, data analysis and data visualization. This process provides essential insight into past performance.
- Diagnostic analytics helps answer questions about why things happened. These techniques supplement more basic descriptive analytics. They take the findings from descriptive analytics and dig deeper to find the cause. The performance indicators are further investigated to discover why they got better or worse. This generally occurs in three steps:
- Identify anomalies in the data. These may be unexpected changes in a metric or a particular market.
- Data that is related to these anomalies is collected.
- Statistical techniques are used to find relationships and trends that explain these anomalies.
- Predictive analytics helps answer questions about what will happen in the future. These techniques use historical data to identify trends and determine if they are likely to recur. Predictive analytical tools provide valuable insight into what may happen in the future and its techniques include a variety of statistical and machine learning techniques, such as: neural networks, decision trees, and regression.
- Prescriptive analytics helps answer questions about what should be done. By using insights from predictive analytics, data-driven decisions can be made. This allows businesses to make informed decisions in the face of uncertainty. Prescriptive analytics techniques rely on machine learning strategies that can find patterns in large datasets. By analyzing past decisions and events, the likelihood of different outcomes can be estimated.
These types of data analytics provide the insight that businesses need to make effective and efficient decisions. Used in combination they provide a well-rounded understanding of a companies needs and opportunities.
Applications of Data Science
- Internet Search engines make use of data science algorithms to deliver the best results for search queries in a fraction of seconds.
- Digital AdvertisementsThe entire digital marketing spectrum uses the data science algorithms – from display banners to digital billboards. This is the mean reason for digital ads getting higher CTR than traditional advertisements.
- Recommender SystemsThe recommender systems not only make it easy to find relevant products from billions of products available but also adds a lot to user-experience. A lot of companies use this system to promote their products and suggestions in accordance with the user’s demands and relevance of information. The recommendations are based on the user’s previous search results.
Applications of Big Data
- Big Data for Financial Services Credit card companies, retail banks, private wealth management advisories, insurance firms, venture funds, and institutional investment banks use big data for their financial services. The common problem among them all is the massive amounts of multi-structured data living in multiple disparate systems, which can be solved by big data. Thus big data is used in several ways like:
- Customer analytics
- Compliance analytics
- Fraud analytics
- Operational analytics
- Big Data in Communications Gaining new subscribers, retaining customers, and expanding within current subscriber bases are top priorities for telecommunication service providers. The solutions to these challenges lie in the ability to combine and analyze the masses of customer-generated data and machine-generated data that is being created every day.
- Big Data for RetailBrick and Mortar or an online e-tailer, the answer to staying the game and being competitive is understanding the customer better to serve them. This requires the ability to analyze all the disparate data sources that companies deal with every day, including the weblogs, customer transaction data, social media, store-branded credit card data, and loyalty program data.
Skills Required to Become a Data Scientist
- Education: 88% have a Master’s Degree, and 46% have PhDs
- In-depth knowledge of SAS or R: For Data Science, R is generally preferred.
- Python coding: Python is the most common coding language that is used in data science, along with Java, Perl, C/C++.
- Hadoop platform: Although not always a requirement, knowing the Hadoop platform is still preferred for the field. Having a bit of experience in Hive or Pig is also a huge selling point.
- SQL database/coding: Though NoSQL and Hadoop have become a significant part of the Data Science background, it is still preferred if you can write and execute complex queries in SQL.
- Working with unstructured data: It is essential that a Data Scientist can work with unstructured data, be it on social media, video feeds, or audio.
Skills Required to Become a Big Data Specialist
- Analytical skills: The ability to be able to make sense of the piles of data that you get. With analytical skills, you will be able to determine which data is relevant to your solution, more like problem-solving.
- Creativity: You need to have the ability to create new methods to gather, interpret, and analyze a data strategy. This is an extremely suitable skill to possess.
- Mathematics and statistical skills: Good, old-fashioned “number crunching.” This is extremely necessary, be it in data science, data analytics, or big data.
- Computer science: Computers are the workhorses behind every data strategy. Programmers will have a constant need to come up with algorithms to process data into insights.
- Business skills: Big Data professionals will need to have an understanding of the business objectives that are in place, as well as the underlying processes that drive the growth of the business as well as its profit.
What is Data?
The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.
What is Big Data?
Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data that is huge in volume and yet growing exponentially with time. In short such data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently.
Examples Of Big Data
Following are some the examples of Big Data-
The New York Stock Exchange generates about one terabyte of new trade data per day.
The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand flights per day, generation of data reaches up to many Petabytes.
Types Of Big Data
BigData’ could be found in three forms:
Any data that can be stored, accessed and processed in the form of fixed format is termed as a ‘structured’ data. Over the period of time, talent in computer science has achieved greater success in developing techniques for working with such kind of data (where the format is well known in advance) and also deriving value out of it. However, nowadays, we are foreseeing issues when a size of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes.
Do you know? 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.
Looking at these figures one can easily understand why the name Big Data is given and imagine the challenges involved in its storage and processing.
Do you know? Data stored in a relational database management system is one example of a ‘structured’ data.
Examples Of Structured Data
An ‘Employee’ table in a database is an example of Structured Data
Any data with unknown form or the structure is classified as unstructured data. In addition to the size being huge, un-structured data poses multiple challenges in terms of its processing for deriving value out of it. A typical example of unstructured data is a heterogeneous data source containing a combination of simple text files, images, videos etc. Now day organizations have wealth of data available with them but unfortunately, they don’t know how to derive value out of it since this data is in its raw form or unstructured format.
Examples Of Un-structured Data
The output returned by ‘Google Search’
Semi-structured data can contain both the forms of data. We can see semi-structured data as a structured in form but it is actually not defined with e.g. a table definition in relational DBMS. Example of semi-structured data is a data represented in an XML file.
Examples Of Semi-structured Data
Personal data stored in an XML file-
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec> <rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec> <rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec> <rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec> <rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Data Growth over the years
Please note that web application data, which is unstructured, consists of log files, transaction history files etc. OLTP systems are built to work with structured data wherein data is stored in relations (tables).
Characteristics Of Big Data
(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence, ‘Volume’ is one characteristic which needs to be considered while dealing with Big Data.
(ii) Variety – The next aspect of Big Data is its variety.
Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data.
(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.
Benefits of Big Data Processing
Ability to process Big Data brings in multiple benefits, such as-
- Businesses can utilize outside intelligence while taking decisions
Access to social data from search engines and sites like facebook, twitter are enabling organizations to fine tune their business strategies.
- Improved customer service
Traditional customer feedback systems are getting replaced by new systems designed with Big Data technologies. In these new systems, Big Data and natural language processing technologies are being used to read and evaluate consumer responses.
- Early identification of risk to the product/services, if any
- Better operational efficiency
Big Data technologies can be used for creating a staging area or landing zone for new data before identifying what data should be moved to the data warehouse. In addition, such integration of Big Data technologies and data warehouse helps an organization to offload infrequently accessed data.