[Jan 26, 2022] Databricks Databricks-Certified-Professional-Data-Scientist Real Exam Questions and Answers FREE
Pass Databricks Databricks-Certified-Professional-Data-Scientist Exam Info and Free Practice Test
Databricks Databricks-Certified-Professional-Data-Scientist Exam Syllabus Topics:
| Topic | Details |
|---|---|
| Topic 1 |
|
| Topic 2 |
|
| Topic 3 |
|
| Topic 4 |
|
| Topic 5 |
|
| Topic 6 |
|
NEW QUESTION 37
Under which circumstance do you need to implement N-fold cross-validation after creating a regression model?
- A. There are categorical variables in the model.
- B. The data is unformatted.
- C. There are missing values in the data.
- D. There is not enough data to create a test set.
Answer: D
NEW QUESTION 38 
The figure below shows a plot of the data of a data matrix M that is 1000 x 2. Which line represents the first principal component?
- A. Neither
- B. blue
- C. yellow
Answer: B
Explanation:
Explanation
Principal component analysis (PCA) involves a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible.
The first principal component corresponds to the greatest variance in the data. The blue line is evidently this first principal component, because if we project the data onto the blue line, the data is more spread out (higher variance) than if projected onto any other line, including the yellow one.
NEW QUESTION 39
Suppose that we are interested in the factors that influence whether a political candidate wins an election. The outcome (response) variable is binary (0/1); win or lose. The predictor variables of interest are the amount of money spent on the campaign, the amount of time spent campaigning negatively and whether or not the candidate is an incumbent.
Above is an example of
- A. Hierarchical linear models
- B. Linear Regression
- C. Logistic Regression
- D. Maximum likelihood estimation
- E. Recommendation system
Answer: C
Explanation:
Explanation : Logistic regression
Pros: Computationally inexpensive, easy to implement, knowledge representation easy to interpret Cons: Prone to underfitting, may have low accuracy Works with: Numeric values, nominal values
NEW QUESTION 40
In which lifecycle stage are test and training data sets created?
- A. Discovery
- B. Data preparation
- C. Model building
- D. Model planning
Answer: C
Explanation:
Explanation
In Phase 1, the team learns the business domain, including relevant history such as whether the organization or business unit has attempted similar projects in the past from which they can learn. The team assesses the resources available to support the project in terms of people, technology time, and data. Important activities in this phase include framing the business problem as an analytics challenge that can be addressed in subsequent phases and formulating initial hypotheses (IHs) to test and begin learning the data. Data preparation: Phase 2 requires the presence of an analytic sandbox, in which the team can work with data and perform analytics for the duration of the project. The team needs to execute extract, load, and transform (ELT) or extract, transform and load (ETL) to get data into the sandbox. The ELT and ETL are sometimes abbreviated as ETLT Data should be transformed in the ETLT process so the team can work with it and analyze it. In this phase, the team also needs to familiarize itself with the data thoroughly and take steps to condition the data Model planning:
Phase 3 is model planning, where the team determines the methods, techniques, and workflow it intends to follow for the subsequent model building phase. The team explores the data to learn about the relationships between variables and subsequently selects key variables and the most suitable models.
Model building: In Phase 4, the team develops datasets for testing, training, and production purposes. In addition, in this phase the team builds and executes models based on the work done in the model planning phase. The team also considers whether its existing tools will suffice for running the models, or if it will need a more robust environment for executing models and workflows (for example, fast hardware and parallel processing, if applicable).
Communicate results: In Phase 5, the team, in collaboration with major stakeholders, determines if the results of the project are a success or a failure based on the criteria developed in Phase 1. The team should identify key findings, quantify the business value, and develop a narrative to summarize and convey findings to stakeholders.
Operationalize: In Phase 6, the team delivers final reports, briefings, code, and technical documents. In addition, the team may run a pilot project to implement the models in a production environment.
NEW QUESTION 41
Which activity is performed in the Operationalize phase of the Data Analytics Lifecycle?
- A. Try different variables
- B. Define the process to maintain the model
- C. Transform existing variables
- D. Try different analytical techniques
Answer: B
Explanation:
Explanation
Operationalize In the final phase, the team communicates the benefits of the project more broadly and sets up a pilot project to deploy the work in a controlled way before broadening the work to a full enterprise or ecosystem of users. In Phase 4. the team scored the model in the analytics sandbox.
NEW QUESTION 42
Which of the following problem you can solve using binomial distribution
- A. A life insurance salesman sells on the average 3 life insurance policies per week. Use Poisson's law to calculate the probability that in a given week he will sell Some policies
- B. It was found that the mean length of 100 parts produced by a lathe was 20.05 mm with a standard deviation of 0.02 mm. Find the probability that a part selected at random would have a length between
20.03 mm and 20.08 mm - C. A manufacturer of metal pistons finds that on the average: 12% of his pistons are rejected because they are either oversize or undersize. What is the probability that a batch of 10 pistons will contain no more than 2 rejects?
- D. Vehicles pass through a junction on a busy road at an average rate of 300 per hour Find the probability that none passes in a given minute.
Answer: C
Explanation:
Explanation
The entire problem can be solved using below method
Binomial: A manufacturer of metal pistons finds that on the average, 12% of his pistons are rejected because they are either oversize or undersize. What is the probability that a batch of 10 pistons will contain no more than 2 rejects?
Poisson: A life insurance salesman sells on the average 3 life insurance policies per week. Use Poisson's law to calculate the probability that in a given week he will sell Some policies Poisson: Vehicles pass through a junction on a busy road at an average rate of 300 per hour Find the probability that none passes in a given minute.
Normal: It was found that the mean length of 100 parts produced by a lathe was 20.05 mm with a standard deviation of 0.02 mm. Find the probability that a part selected at random would have a length between 20 03 mm and 20.08 mm
NEW QUESTION 43
In unsupervised learning which statements correctly applies
- A. It does not have a target variable
- B. Instead of telling the machine Predict Y for our data X, we're asking What can you tell me about X?
- C. telling the machine Predict Y for our data X
Answer: A,B
Explanation:
Explanation
In unsupervised learning we don't have a target variable as we did in
classification and regression.
Instead of telling the machine Predict Y for our data X, we're asking What can you tell me about X?
Things we ask the machine to tell us about
X may be What are the six best groups we can make out of X? or What three features occur together most frequently in X?
NEW QUESTION 44
You are working as a data science consultant for a gaming company. You have three member team and all other stake holders are from the company itself like project managers and project sponsored, data team etc.
During the discussion project managed asked you that when can you tell me that the model you are using is robust enough, after which step you can consider answer for this question?
- A. Discovery
- B. Operationalize
- C. Data Preparation
- D. Model building
- E. Model planning
Answer: D
Explanation:
Explanation
To answer whether the model you are building is robust enough or not you need to have answer below questions at least
- Model is performing as expected with the test data or not?
- Whatever hypothesis defined in the initial phase is being tested or not?
- Do we need more data?
- Domain experts are convinced or not with the model?
And all these can be answered when you have built the model and tested with the test data sets. Hence, correct option will be Model Building.
NEW QUESTION 45
Which of the following are point estimation methods?
- A. MAP
- B. MLE
- C. MMSE
Answer: A,B,C
Explanation:
Explanation
Point estimators
* minimum-variance mean-unbiased estimator (MVUE), minimizes the risk (expected loss) of the squared-error loss-function.
* best linear unbiased estimator (BLUE)
* minimum mean squared error (MMSE)
* median-unbiased estimator, minimizes the risk of the absolute-error loss function
* maximum likelihood (ML)
* method of moments, generalized method of moments
NEW QUESTION 46
You are studying the behavior of a population, and you are provided with multidimensional data at the individual level. You have identified four specific individuals who are valuable to your study, and would like to find all users who are most similar to each individual. Which algorithm is the most appropriate for this study?
- A. Linear regression
- B. K-means clustering
- C. Association rules
- D. Decision trees
Answer: B
Explanation:
Explanation
kmeans uses an iterative algorithm that minimizes the sum of distances from each object to its cluster centroid, over all clusters. This algorithm moves objects between clusters until the sum cannot be decreased further. The result is a set of clusters that are as compact and well-separated as possible. You can control the details of the minimization using several optional input parameters to kmeans, including ones for the initial values of the cluster centroids, and for the maximum number of iterations.
Clustering is primarily an exploratory technique to discover hidden structures of the data: possibly as a prelude to more focused analysis or decision processes. Some specific applications of k-means are image processing^ medical and customer segmentation. Clustering is often used as a lead-in to classification. Once the clusters are identified, labels can be applied to each cluster to classify each group based on its characteristics. Marketing and sales groups use k-means to better identify customers who have similar behaviors and spending patterns.
NEW QUESTION 47
Suppose you have made a model for the rating system, which rates between 1 to 5 stars. And you calculated that RMSE value is 1.0 then which of the following is correct
- A. It means that your predictions are on average four star off of what people really think
- B. It means that your predictions are on average two star off of what people really think
- C. It means that your predictions are on average one star off of what people really think
- D. It means that your predictions are on average three star off of what people really think
Answer: C
NEW QUESTION 48
Which is an example of supervised learning?
- A. PCA
- B. SVM
- C. SVD
- D. k-means clustering
- E. EM
Answer: B
Explanation:
Explanation
SVMs can be used to solve various real world problems:
* SVMs are helpful in text and hypertext categorization as their application can significantly reduce the need for labeled training instances in both the standard inductive and transductive settings.
* Classification of images can also be performed using SVMs. Experimental results show that SVMs achieve significantly higher search accuracy than traditional query refinement schemes after just three to four rounds of relevance feedback.
* SVMs are also useful in medical science to classify proteins with up to 90% of the compounds classified correctly.
* Hand-written characters can be recognized using SVM
NEW QUESTION 49
Classification and regression are examples of___________.
- A. supervised learning
- B. un-supervised learning
- C. Clustering
- D. Density estimation
Answer: A
Explanation:
Explanation
In classification, our job is to predict what class an instance of data should fall into. Another task in machine learning is regression. Regression is the prediction of a numeric value. Most people have probably seen an example of regression with a best-fit line drawn through some data points to generalize the data points.
Classification and regression are examples of supervised learning. This set of problems is known as supervised because we're telling the algorithm what to predict.
NEW QUESTION 50
Select the choice where Regression algorithms are not best fit
- A. Weight of the person is given
- B. Temperature in the atmosphere
- C. Employee status
- D. When the dimension of the object given
Answer: C
Explanation:
Explanation
Regression algorithms are usually employed when the data points are inherently numerical variables (such as the dimensions of an object the weight of a person, or the temperature in the atmosphere) but unlike Bayesian algorithms, they're not very good for categorical data (such as employee status or credit score description).
NEW QUESTION 51
Your customer provided you with 2. 000 unlabeled records three groups. What is the correct analytical method to use?
- A. Linear regression
- B. Logistic regression
- C. K-means clustering
- D. Naive Bayesian classification
- E. Semi Linear Regression
Answer: C
Explanation:
Explanation
k-means clustering is a method of vector quantization^ originally from signal processing, that is popular for cluster analysis in data mining, k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster This results in a partitioning of the data space into Voronoi cells.
The problem is computationally difficult (NP-hard); however there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms. Additionally they both use cluster centers to model the data; however k-means clustering tends to find clusters of comparable spatial extent, while the expectation-maximization mechanism allows clusters to have different shapes.
The algorithm has nothing to do with and should not be confused with k-nearest neighbor another popular machine learning technique.
NEW QUESTION 52
Digit recognition, is an example of.....
- A. Unsupervised learning
- B. None of the above
- C. Clustering
- D. Classification
Answer: D
Explanation:
Explanation
Supervised learning is fairly common in classification problems because the goal is often to get the computer to learn a classification system that we have created. Digit recognition: once again, is a common example of classification learning. More generally, classification learning is appropriate for any problem where deducing a classification is useful and the classification is easy to determine. In some cases, it might not even be necessary to give pre-determined classifications to every instance of a problem if the agent can work out the classifications for itself. This would be an example of unsupervised learning in a classification context.
NEW QUESTION 53
......
Latest Databricks-Certified-Professional-Data-Scientist Exam Dumps Databricks Exam: https://prep4sure.vce4dumps.com/Databricks-Certified-Professional-Data-Scientist-latest-dumps.html