Learn R

Topics

Learn R: Introduction
Learn R: Data Frames
Learn R: Data Cleaning
Learn R: Fundamentals of Data Visualization with ggplot2
Learn R: Aggregates
Learn R: Joining Tables
Learn R: Mean, Median, and Mode
Learn R: Variance and Standard Deviation
Learn R: Quartiles, Quantiles, and Interquartile Range
Learn R: Hypothesis Testing

Learn R: Introduction

R Logical Data Type

The R logical data type has two possible values: TRUE or FALSE.

# Results in 500
573 - 74 + 1
# Results in 50
25 * 2
# Results in 2
10 / 5

NA Data Type in R

In R, NA represents missing or undefined data. Make sure to use NA in uppercase and do not wrap it in quotes.

# var1 has the numeric data type
var1 <- 3
# var2 has the character data type
var2 <- "happiness"
# var3 has the logical data type
var3 <- TRUE

Mathematical Operations in R

R supports mathematical operations such as addition, subtraction, multiplication, and division. These operations follow the standard order of operations (PEMDAS).

if (n == 5) {
  print('n is equal to five')
}

R Data Types

R has various data types including numeric, integer, logical, and character. These types are used for different kinds of data operations.

# This is a comment!

R Conditional Statements

Conditional statements in R evaluate conditions and execute code blocks based on the outcome.

example <- 4
another_example <- 15.2

R Comments

In R, comments are made using the # symbol. Everything following # on that line is considered a comment and is ignored during execution.

R’s Character Data Type

The character data type in R is used to store text data. Text strings are enclosed in double quotes.

Assignment Operator in R

The assignment operator <- is used to assign values to variables. For example, x <- 10 assigns the value 10 to the variable x.

R Numeric Data Type

The numeric data type represents numbers. R handles both integers and real numbers, and performs arithmetic operations on them.

Learn R: Data Frames

dplyr package

The dplyr package provides functions for data manipulation. Key functions include select, filter, mutate, arrange, and summarize, which allow for efficient data exploration and manipulation.

mutate(heights, cm = inches * 2.54)

Loading and Saving CSVs with R

The read_csv() and write_csv() functions from the readr package handle reading from and writing to CSV files. read_csv() converts files to tibbles, a modern version of data frames.

weather %>% select(1:2)

filter with logical operators

The filter() function subsets rows based on conditions. Logical operators (e.g., <, ==, >, !=) are used to specify the conditions.

transmute(population, increase = total_population / lag(total_population))

data frame object

A data frame in R is a two-dimensional structure with rows and columns. Each column can hold different types of data, and each row represents a unique observation.

Excluding Columns with select() in dplyr

The select() function allows for column exclusion by prefixing column names with a - sign. This results in a new data frame without the excluded columns.

select(-genre, -spotify_monthly_listeners, -year_founded)

rename-dplyr

The rename() function changes column names in a data frame. To rename columns, use the new name followed by = and the old name. Variants like rename_if(), rename_at(), and rename_all() provide more flexibility.

dplyr’s filter()

The filter() function selects rows that meet certain criteria. Conditions are specified as arguments to the function.

filter(artists, genre == 'Rock', spotify_monthly_listeners > 20000000)

data frames primary information

Functions like head() and summary() provide insights into data frames. head() shows the first few rows, while summary() offers summary statistics.

dplyr arrange()

The arrange() function sorts rows based on column values. You can specify ascending or descending order using the desc() function.

arrange(data_frame, desc(column_name))

mutate() dplyr

The mutate() function adds new columns to a data frame or modifies existing ones based on transformations.

mutate(data_frame, new_column = existing_column * 2)

Comma Separated Values (CSV)

CSV files store data in a plain text format, where values are separated by commas. These files are compatible with many applications and can be imported into R for analysis.

pipes

The pipe operator %>% passes the result of one function as the input to another. This allows for cleaner and more readable code.

data_frame %>% select(column) %>% filter(condition)

transmute() dplyr

The transmute() function creates new columns from transformations while dropping the existing columns from the data frame.

transmute(data_frame, new_column = transformation(existing_column))

dplyr’s select()

The select() function extracts specified columns from a data frame, while dropping others.

Learn R: Data Cleaning

gsub() R Function

The gsub() function replaces occurrences of a pattern in a string with a replacement value. It can handle single strings or vectors of strings.

# Replace '1' with an empty string in a vector
teams <- c("Fal1cons", "Cardinals", "Seah1awks", "Vikings", "Bro1nco", "Patrio1ts") 
teams_clean <- gsub("1", "", teams)
print(teams_clean)
# Output: "Falcons" "Cardinals" "Seahawks" "Vikings" "Bronco" "Patriots"

distinct() dplyr

The distinct() function removes duplicate rows from a data frame. It can also be used to select unique values from specific columns.

# Keep unique rows in a data frame
distinct(data_frame)
# Keep unique values in a specific column
distinct(data_frame, column_name)

str() Function

The str() function provides a compact, human-readable summary of the structure of an R object. It displays the type, size, and content of the object.

str(data_frame)

Combining Data with R

Multiple files can be combined into a single data frame using functions like lapply() and bind_rows(). This is useful for aggregating data from different sources.

df_list <- lapply(files, read_csv)
df_combined <- bind_rows(df_list)

R as.numeric() Function

The as.numeric() function converts data to numeric type, which is useful for performing numerical operations on data that may be stored as characters.

numeric_data <- as.numeric(character_data)

str_sub() function

The str_sub() function extracts or replaces substrings within a string, using specified start and end positions.

# Extract the first five characters from a string
str_sub('Marya1984', start = 1, end = 5)

Handling Missing Data

Missing data in R is represented by NA. Handling missing data involves techniques such as imputation, exclusion, or using functions like is.na() to identify missing values.

data_with_na <- data_frame[!is.na(data_frame$column_name), ]

tidyr Functions

The tidyr package provides functions for reshaping data, such as pivot_longer() and pivot_wider(), which transform data between wide and long formats.

# Convert wide format to long format
pivot_longer(data_frame, cols = c(column1, column2))

regex

Regular expressions (regex) are patterns used for string matching and manipulation. Functions like gsub() and str_replace() use regex to perform replacements.

# Remove all non-numeric characters
cleaned_string <- gsub("[^0-9]", "", original_string)

R’s substr() Function

The substr() function extracts or replaces substrings from a string based on specified starting and ending positions.

substr('example_string', 1, 7)

Learn R: Fundamentals of Data Visualization with ggplot2

ggplot2 Overview

ggplot2 is a powerful package for creating data visualizations. It follows the Grammar of Graphics to build plots by combining various components.

# Basic scatter plot
ggplot(data_frame, aes(x = x_variable, y = y_variable)) + geom_point()

ggplot2 Functions

Key ggplot2 functions include ggplot() to initialize the plot, aes() for aesthetics, and geom_* functions to add geometric layers like points, lines, or bars.

# Adding a line to a scatter plot
ggplot(data_frame, aes(x = x_variable, y = y_variable)) + geom_point() + geom_smooth()

Mapping Aesthetics

In ggplot2, aesthetics (aes()) control visual properties such as color, size, and shape. Mapping aesthetics allows you to represent data values visually.

# Scatter plot with color mapping
ggplot(data_frame, aes(x = x_variable, y = y_variable, color = category)) + geom_point()

geom_bar() Function

The geom_bar() function creates bar charts. It can be used for categorical data, with the height of bars representing counts or values.

# Basic bar chart
ggplot(data_frame, aes(x = category)) + geom_bar()

ggtitle() Function

The ggtitle() function adds titles to plots. It allows customization of plot titles, including main titles and subtitles.

# Adding a title
ggplot(data_frame, aes(x = x_variable, y = y_variable)) + geom_point() + ggtitle("Main Title")

ggplot2 Themes

Themes in ggplot2 control the non-data elements of plots, such as background color, gridlines, and text size. Themes can be customized or replaced with built-in options.

# Applying a theme
ggplot(data_frame, aes(x = x_variable, y = y_variable)) + geom_point() + theme_minimal()

geom_histogram() Function

The geom_histogram() function creates histograms to show the distribution of a single variable. It groups data into bins and displays frequencies.

# Basic histogram
ggplot(data_frame, aes(x = variable)) + geom_histogram(bins = 30)

geom_point() Function

The geom_point() function creates scatter plots, plotting data points with specified aesthetics such as color and size.

# Scatter plot
ggplot(data_frame, aes(x = x_variable, y = y_variable)) + geom_point()

geom_line() Function

The geom_line() function adds line layers to plots, useful for visualizing trends over time or continuous variables.

# Line plot
ggplot(data_frame, aes(x = x_variable, y = y_variable)) + geom_line()

Learn R: Aggregates

R’s Aggregation Functions

Aggregation functions in R, such as sum(), mean(), and sd(), summarize data by computing statistical measures across groups.

# Calculating mean
mean(data_frame$column_name)

Group By with dplyr

The group_by() function groups data based on specified columns. It is often used in conjunction with summarization functions to analyze data by groups.

data_frame %>% group_by(group_variable) %>% summarize(mean_value = mean(numeric_variable))

Learn R: Joining Tables

inner_join() Function

The inner_join() function combines rows from two data frames where there are matching values in specified columns.

inner_join(df1, df2, by = "common_column")

left_join() Function

The left_join() function combines rows from two data frames, retaining all rows from the left data frame and matching rows from the right data frame.

left_join(df1, df2, by = "common_column")

right_join() Function

The right_join() function retains all rows from the right data frame and matches rows from the left data frame, filling in with NA where no match is found.

right_join(df1, df2, by = "common_column")

full_join() Function

The full_join() function combines rows from two data frames, retaining all rows from both data frames, with NA in places where matches are not found.

full_join(df1, df2, by = "common_column")

Learn R: Mean, Median, and Mode

Mean

The mean is the average value of a dataset, calculated by summing all values and dividing by the number of values.

mean(c(2, 4, 6, 8))

Median

The median is the middle value of a dataset when it is ordered from least to greatest. If there is an even number of values, the median is the average of the two middle values.

median(c(2, 4, 6, 8))

Mode

The mode is the value that appears most frequently in a dataset. R does not have a built-in function for mode, but it can be calculated using table() and which.max().

table(c(1, 2, 2, 3, 4))

Learn R: Variance and Standard Deviation

Variance

Variance measures the spread of data points around the mean. It is calculated as the average of the squared differences from the mean.

var(c(1, 2, 3, 4, 5))

Standard Deviation

Standard deviation is the square root of the variance, providing a measure of spread in the same units as the data.

sd(c(1, 2, 3, 4, 5))

Learn R: Quartiles, Quantiles, and Interquartile Range

Quartiles

Quartiles divide a dataset into four equal parts. Q1 (first quartile) is the value below which 25% of the data falls, Q2 (median) divides the data in half, and Q3 (third quartile) is the value below which 75% of the data falls.

quantile(c(1, 2, 3, 4, 5), probs = 0.25)

Quantiles

Quantiles are values that divide a dataset into equal parts. Common quantiles include quartiles, deciles, and percentiles.

quantile(c(1, 2, 3, 4, 5), probs = c(0.1, 0.5, 0.9))

Interquartile Range (IQR)

The IQR measures the range within which the middle 50% of the data falls. It is calculated as Q3 minus Q1.

IQR(c(1, 2, 3, 4, 5))

Learn R: Hypothesis Testing

Introduction to Hypothesis Testing

Hypothesis testing is a statistical method used to make inferences or draw conclusions about a population based on sample data. It involves formulating a null hypothesis (H0) and an alternative hypothesis (H1), and then using statistical tests to determine whether there is enough evidence to reject the null hypothesis.

# Formulating hypotheses
H0 <- "The mean is equal to 0"
H1 <- "The mean is not equal to 0"

t-test

The t-test is used to compare the means of two groups and determine if they are significantly different from each other. It can be a one-sample t-test, independent two-sample t-test, or paired t-test.

# One-sample t-test
result <- t.test(data_vector, mu = 0)

Chi-Square Test

The chi-square test assesses the association between categorical variables. It compares observed frequencies to expected frequencies to determine if there is a significant difference.

# Chi-square test for independence
result <- chisq.test(table(data_frame$variable1, data_frame$variable2))

ANOVA (Analysis of Variance)

ANOVA tests if there are significant differences among the means of three or more groups. It extends the t-test for comparing multiple groups.

# One-way ANOVA
result <- aov(response_variable ~ factor_variable, data = data_frame)

p-value

The p-value indicates the probability of observing the test results under the null hypothesis. A small p-value (typically < 0.05) suggests that the null hypothesis can be rejected.

# Extract p-value from test result
p_value <- result$p.value

Confidence Intervals

A confidence interval provides a range of values within which the true population parameter is likely to fall, with a specified level of confidence (e.g., 95%).

# Calculate 95% confidence interval
result <- t.test(data_vector)
conf_interval <- result$conf.int

Programming Cheatsheets: Quick Reference for Productivity

Welcome to our comprehensive collection of programming language cheatsheets! Whether you're a seasoned developer or a beginner, these quick reference guides provide essential tips and key information for all major languages. They focus on core concepts, commands, and functions—designed to enhance your efficiency and productivity.

ManageEngine Site24x7, a leading IT monitoring and observability platform, is committed to equipping developers and IT professionals with the tools and insights needed to excel in their fields.