Description

For the final project for this class, I chose to work with a heart disease set from UC Irvine. This is a multivariate type of dataset which means providing or involving a variety of separate mathematical or statistical variables, multivariate numerical data analysis.

The given dataset is a multivariate type, which means that it contains multiple variables or features. Multivariate numerical data analysis is a statistical technique used to analyze and understand the relationships between various variables in a dataset.

The goal of the project was to develop a machine-learning model that could predict whether a patient has heart disease or not. To achieve this the dataset contains 76 different attributes, but most studies only focus on a subset of 14 of these attributes.

These attributes are age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise-induced angina, old peak (ST depression induced by exercise relative to rest), the slope of the peak exercise ST segment, number of major vessels, and Thalassemia.

By analyzing these features, the machine learning model can identify patterns and relationships that may be useful in predicting whether a patient has heart disease.

In addition to the classification task, researchers may also be interested in using this dataset to explore and gain insights into the factors that contribute to heart disease. They may use statistical analysis and visualization techniques to identify patterns and correlations between the different attributes and the occurrence of heart disease.

Data Dictionary

IndexNameDescriptionType
1ageAge of patientNumeric
2sexSex of patientBoolean (1 = male 0 =female)
3cpType of chest painNumeric value:
1 – Typical Angina 2 – Atypical Angina 3 – Non-Angina Pain 4 – asymptomatic
4trestbpsResting Blood Pressure in mm Hg when being admitted to the hospitalNumeric
5cholSerum cholestoral in mg/dlNumeric
6fbsFasting blood sugar > 120 mg/dl Boolean (1 = true; 0 = false)
7restecgResting electrocardiographic measurementNumeric value: 0 – Normal 1 – having ST-T wave abnormality (T wave inversions and/or ST                     elevation or depression of > 0.05 mV) 2 – showing probable or definite left ventricular hypertrophy               
8thalachMaximum heart rate achievedNumeric
9exangExercise induced anginaBoolean ((1 = yes; 0 = no)
10oldpeakST depression induced by exercise relative to rest (‘ST’ relates to positions on the ECG plot. See more here) 
11slopethe slope of the peak exercise ST segment  1 – Upsloping 2 – Flat 3 – Downsloping
12canumber of major vesselsNumeric (0-3)
colored by flourosopy  
13thal A blood disorder called thalassemiaNumeric 3 = normal; 6 = fixed defect; 7 = reversable defect
14targetHeart diseaseBoolean 0 = no 1 =yes

The statement describes the analysis of a dataset using the “str” and “summarizeColumns” functions in Table 1. The following anomalies were identified in the dataset:

  1. The target feature, which was designated as a binary target feature, had a cardinality of 5 instead of 2. This means that there were more than two possible values for the target feature, which is not consistent with its designation as a binary feature.
  2. Ten out of the 14 features in the dataset contained missing values. Among these features, Slope, CA, and Thal had a particularly high number of missing values, with 309, 611, and 486 missing values, respectively. Missing values can be problematic for machine learning models, as they can lead to biased or inaccurate predictions.
  3. The features “trestbps” (resting blood pressure) and “chol” (serum cholesterol) contained several data entries with values of 0, which is not possible for these diagnostic tests. This suggests that there may be errors or anomalies in the data that need to be corrected.
  4. The target column contained values ranging from 0 to 4, indicating different degrees of heart disease. However, the researchers chose to group the data into two categories of “no heart disease” (value of 0) and “displaying heart disease” (value of 1) to make it binary. This simplification may have implications for the accuracy of the machine learning models that are built using this dataset.
  5. The dataset was found to be unbalanced, with a higher proportion of people diagnosed with heart disease. This may require additional parameter-tuning when building machine-learning models to ensure that they can accurately predict the presence or absence of heart disease in patients.

Overall, these findings highlight the importance of careful data preprocessing and cleaning in machine learning projects, as well as the need to consider the specific characteristics of the dataset when building models.

The data analysis, preprocessing, and model generation were performed using the R programming language. R is a popular tool for data analysis and statistical computing, which provides a variety of functions and libraries for working with data, building machine learning models, and visualizing results. The use of R for these tasks suggests that the person or team responsible for the project had familiarity with the language and felt it was an appropriate choice for their data analysis needs.

Data Visualization

Histogram of all the diseases vs nondiseases patients. In the given data, we seem to have more healthy people than people with heart disease
Comparison of Cholesterol across pain type
Correlation plotting

Business Questions

  • What age groups are more vulnerable to heart disease?
    Based on the data above, there is no specific age group that can be considered to be vulnerable to age diseases.
  • What are the cholesterol levels that contribute toward heart diseases?
    Cholesterol level between 200 and 250 mg/dl seems to contribute more towards a heart disease
  • Which sex is more at risk for heart disease?
    Based on the ratio women are more at risk than compared to men

Model Generation

Logistic regression was chosen as the modeling technique because it returns a binary output, which is suitable for the binary classification problem at hand. By using logistic regression, the model is able to predict whether a patient has heart disease or not, based on the 14 attributes provided in the dataset.

The final accuracy of the model comes down to 84.74%

Reflection and Learning Goal

The course covered the essential concepts and characteristics of data and how to manage it using R and R-Studio. It also taught me principles and practices in data screening, cleaning, and linking, as well as how to communicate the results to decision-makers.

The course helped the speaker identify problems and understand the data needed to address them. I learned to perform basic computational scripting using R and other optional tools. I also learned how to transform data through processing, linking, aggregation, summarization, and searching. Additionally, I learned how to organize and manage data at various stages of a project life-cycle.

Overall, the course seems to have provided the speaker with a solid foundation in data management and analysis using R and related tools. The skills and knowledge gained from this course may be useful in a variety of fields where data analysis and management are essential.

Learning Objectives

  1. Understand essential concepts and characteristics of data: This refers to gaining a fundamental understanding of data, including what data is, the different types of data, and the characteristics of good quality data. This includes understanding key concepts like variables, observations, data types, and data distributions.

  2. Understand scripting/code development for data management using R and R-Studio: This involves learning how to use R and R-Studio to manage and analyze data, including writing scripts and code to automate data processing tasks. This includes learning how to read data into R, manipulate and clean data, perform basic statistical analyses, and visualize data.

  3. Understand principles and practices in data screening, cleaning, and linking: This involves learning how to screen, clean, and link data to ensure that it is of high quality and suitable for analysis. This includes learning how to identify missing or erroneous data, deal with outliers, and merge data from different sources.

  4. Understand communication of results to decision makers: This involves learning how to communicate data analysis results to decision makers, including presenting results in a clear and concise manner using visualizations, tables, and charts.

  5. Identify a problem and the data needed for addressing the problem: This involves identifying a specific problem or research question, and then determining what data is needed to address the problem. This includes learning how to design data collection instruments and how to identify and obtain existing data sources.

  6. Perform basic computational scripting using R and other optional tools: This involves learning how to write basic scripts and code to automate data processing tasks using R and other optional tools. This includes learning how to read data into R, manipulate and clean data, perform basic statistical analyses, and visualize data.

  7. Transform data through processing, linking, aggregation, summarization, and searching: This involves learning how to transform data through various data processing techniques, such as linking, aggregation, summarization, and searching. This includes learning how to merge data from different sources, summarize data using statistical measures, and search for patterns in large datasets.

Full details of the project can be found here