Recent Question/Assignment

ASSESSMENT 1 BRIEF
Subject Code and Title BDA601—Big Data and Analytics
Assessment Design Data Pipeline
Individual/Group Individual
Length 1,500 words (+/–10%)
Learning Outcomes The Subject Learning Outcomes demonstrated by the successful completion of the task below include:
a) Explain and evaluate the V’s of Big Data (volume, velocity, variety, veracity, valence, and value)
b) Identify best practices in data collection and storage, including data security and privacy principles; and
e) Effectively report and communicate findings to an appropriate audience.
Submission Due by 11.55 pm AEST on the Sunday at the end of Module 4
Weighting 30%
Total Marks 100 marks
Task Summary
Critically analyse the online retail business case (see below) and write a 1,500-word report that:
a) Identifies various sources of data to build an effective data pipeline;
b) Identifies challenges in integrating the data from the sources and formulates a strategy to address those challenges; and
c) Describes a design for a storage and retrieval system for the data lake that uses commercial and/or open-source big data tools.
Please refer to the Task Instructions (below) for details on how to complete this task.
Context
A modern data-driven organisation must be able to collect and process large volumes of data and perform analytics at scale on that data. Thus, the establishment of a data pipeline is an essential first step in building a data-driven organisation. A data pipeline ingests data from various sources, integrates that data and stores that data in a ‘data lake’, making that data available to everyone in the organisation.
This Assessment prepares you to identify potential sources of data, address challenges in integrating data and design an efficient ‘data lake’ using the big data principles, practices and technologies covered in the learning materials.
Case Study
Big Retail is an online retail shop in Adelaide, Australia. Its website, at which its users can explore different products and promotions and place orders, has more than 100,000 visitors per month. During checkout, each customer has three options: 1) to login to an existing account; 2) to create a new account if they have not already registered; or 3) to checkout as a guest. Customers’ account information is maintained by both the sales and marketing departments in their separate databases. The sales department maintains records of the transactions in their database. The information technology (IT) department maintains the website.
Every month, the marketing team releases a catalogue and promotions, which are made available on the website and emailed to the registered customers. The website is static; that is, all the customers see the same content, irrespective of their location, login status or purchase history.
Recently, Big Retail has experienced a significant slump in sales, despite its having a cost advantage over its competitors. A significant reduction in the number of visitors to the website and the conversion rate (i.e., the percentage of visitors who ultimately buy something) has also been observed. To regain its market share and increase its sales, the management team at Big Retail has decided to adopt a data-driven strategy. Specifically, the management team wants to use big data analytics to enable a customised customer experience through targeted campaigns, a recommender system and product association.
The first step in moving towards the data-driven approach is to establish a data pipeline. The essential purpose of the data pipeline is to ingest data from various sources, integrate the data and store the data in a ‘data lake’ that can be readily accessed by both the management team and the data scientists.
Task Instructions
Critically analyse the above case study and write a 1,500-word report. In your report, ensure that you:
• Identify the potential data sources that align with the objectives of the organisation’s datadriven strategy. You should consider both the internal and external data sources. For each data source identified, describe its characteristics. Make reasonable assumptions about the fields and format of the data for each of the sources;
• Identify the challenges that will arise in integrating the data from different sources and that must be resolved before the data are stored in the ‘data lake.’ Articulate the steps necessary to address these issues;
• Describe the ‘data lake’ that you designed to store the integrated data and make the data available for efficient retrieval by both the management team and data scientists. The system should be designed using a commercial and/or an open-source database, tools and frameworks. Demonstrate how the ‘data lake’ meets the big data storage and retrieval requirements; and
• Provide a schematic of the overall data pipeline. The schematic should clearly depict the data sources, data integration steps, the components of the ‘data lake’ and the interactions among all the entities.
Submission Instructions
• Submit this task via the Assessment link in the main navigation menu in BDA601—Big Data and Analytics.
The Learning Facilitator will provide feedback via the Grade Centre in the LMS portal. Feedback can be viewed in My Grades.
Academic Integrity Declaration
I declare that except where referenced, the work I am submitting for this assessment task is my own work. I have read and am aware of the Academic Integrity Policy and Procedure of Torrens University, Australia, viewable online at http://www.torrens.edu.au/policies-and-forms.
I am also aware that I need to keep a copy of all submitted material and any drafts and I agree to do so.
Assessment Rubric
Assessment Attributes Fail
(Yet to Achieve
Minimum Standard) 0–49% Pass
(Functional) 50–64% Credit
(Proficient)
65–74% Distinction
(Advanced)
75–84% High Distinction
(Exceptional)
85–100%
Identifies the potential data sources aligned with the objectives of the organisation’s datadriven strategy
25%
Demonstrates partial or unsatisfactory knowledge and understanding in identifying the data sources aligned with the objectives of the organisation’s data-driven strategy.
• Does not distinguish between external and internal data sources.
• Does not distinguish among structured, semi-structured and unstructured data sources.
• Provides an unsatisfactory description of the characteristics, formats and structures of the sources.
Demonstrates functional knowledge and
understanding in identifying the data sources aligned with the objectives of the organisation’s data-driven strategy.
• Does not distinguish between external and internal data sources.
• Identifies only structured or semi-structured or unstructured data source(s).
• Provides a satisfactory description of the characteristics, formats and structures of the source(s).
Demonstrates solid knowledge and
understanding in identifying the data sources aligned with the objectives of the organisation’s data-driven strategy.
• Identifies both external and internal data sources.
• Identifies two different types of data sources among structured, semi-structured and unstructured data sources.
• Provides a good quality description of the characteristics, formats and structures of the sources.
Demonstrates advanced knowledge and
understanding in identifying the data sources aligned with the objectives of the organisation’s data-driven strategy.
• Identifies both external and internal data sources.
• Identifies structured, semi-structured and unstructured data sources.
• Provides a high-quality description of the characteristics, formats and structures of the sources.
Demonstrates exceptional knowledge and understanding in identifying the data sources aligned with the objectives of the organisation’s datadriven strategy.
• Identifies both external and internal data sources.
• Identifies structured, semi-structured and unstructured data sources.
• Provides an exceptional quality description of the characteristics, formats and structures of the sources.
Demonstrates partial or unsatisfactory knowledge
Demonstrates satisfactory knowledge and
Demonstrates solid knowledge and
Demonstrates advanced knowledge and
Demonstrates exceptional knowledge and
Identifies and resolves data integration
challenges
30% and understanding in Identifying and resolving data integration challenges.
• The identification of the integration issues is unsatisfactory.
• Unsatisfactory demonstration of skills in resolving the data integration issues. understanding in Identifying and resolving data integration challenges.
• Partially identifies the major integration issues, including schema alignment and duplicates.
• Satisfactory demonstration of skills in resolving the data integration issues. understanding in Identifying and resolving data integration challenges.
• Articulates all the major integration
issues, including schema alignment and duplicates, with good clarity and accuracy.
• Good demonstration of
skills in resolving the data integration issues. understanding in Identifying and resolving data integration challenges.
• Articulates all the major integration
issues, including schema alignment and duplicates, with high clarity, accuracy and completeness.
• Very good
demonstration of skills in resolving the data integration issues.
understanding in
Identifying and resolving data integration challenges.
• Articulates all the major integration
issues, including schema alignment and duplicates, with exemplary clarity, accuracy and
completeness
• Exemplary
demonstration of skills in resolving the data integration issues.
The design of the ‘data lake’ satisfies the requirements for big data storage and
retrieval
30%
Demonstrates partial or unsatisfactory knowledge and understanding in designing the ‘data lake’ and thus does not satisfy the requirements for big data storage and retrieval.
The designed ‘data lake’ does not meet any of the following requirements:
• Can store structured data;
• Can store semistructured and unstructured data;
• Supports efficient searches; and
Demonstrates satisfactory knowledge and understanding in designing the ‘data lake’ and thus satisfies the requirements for big data storage and retrieval.
The designed ‘data lake’ meets one of the following requirements:
• Can store structured data;
• Can store semistructured and unstructured data;
• Supports efficient searches; or
Demonstrates solid knowledge and understanding in designing the ‘data lake’ and thus satisfies the requirements for big data storage and retrieval.
Good quality demonstration that the designed ‘data lake’ meets two of the following requirements:
• Can store structured data;
• Can store semistructured and unstructured data;
Demonstrates advanced knowledge and understanding in designing the ‘data lake’ and thus satisfies the requirements for big data storage and retrieval.
High-quality demonstration that the designed ‘data lake’ meets three of the following requirements:
• Can store structured data;
• Can store semistructured and unstructured data;
Demonstrated exceptional knowledge and understanding in the designing ‘data lake’ and thus satisfies the requirements for big data storage and retrieval.
Exemplary demonstration that the designed ‘data lake’ meets all of the following requirements:
• Can store structured data;
• Can store semistructured and unstructured data;
• Supports low latency retrieval. • Supports low latency retrieval. •
• Supports efficient searches; and/or Supports low latency retrieval. •
• Supports efficient searches; and/or Supports low latency retrieval. •
• Supports efficient searches; and Supports low latency retrieval.
Clarity of data pipeline diagram and quality of overall presentation of the report
15% •



The diagram for the data pipeline is incomplete and of an unsatisfactory quality. Lacks overall organisation. Very difficult to follow.
Grammatical and spelling errors make it difficult for the reader to interpret the text in many places. •




The diagram for the data
pipeline is mostly complete and is of a satisfactory quality. Not well organised for the most part. Ambiguous and very basic.
Choice of words needs to be improved. Grammatical errors impede the flow of communications. •




The diagram for the data pipeline is mostly complete and is of a good quality.
Organised for the most part.
Partly cohesive and easy to follow.
Words are well chosen with some minor improvements needed. Sentences are mostly grammatically correct and contain few spelling errors. •




The diagram for the data pipeline is complete and is of a remarkably high quality.
Well organised. Cohesive and easy to follow.
Words are well chosen. Sentences are grammatically correct and free of spelling errors. •




The diagram for the data pipeline is complete and is of an exceptional quality. Exceptionally organised.
Highly cohesive and easy to follow. Words are carefully chosen that precisely express the intended meaning and support reader comprehension. Sentences are grammatically correct and free of spelling errors.
The following Subject Learning Outcomes are addressed in this assessment
SLO a) Explain and evaluate the V’s of Big Data (volume, velocity, variety, veracity, valence and value).
SLO b) Identify best practices in data collection and storage, including data security and privacy principles.
SLO e) Effectively report and communicate findings to an appropriate audience