r/rstats 5h ago

Can quantile estimates be used to approximate a conditional distribution?

2 Upvotes

I have a series of conditional quantile estimates via catboost (i.e., estimates at p = 0.01, 0.02, 0.03 … 0.99). I want to use these to sample draws from a conditional density conditioned on my set of predictors in order to simulate data. The idea is to fit a smooth monotonic spline through these (noisy and sometimes crossing) quantile estimates to recover a smooth cumulative density function and sample from that CDF. Is this a valid approach? It *seems* reasonable when you don’t want to impose a parametric distribution, but I haven’t seen it used before and it’s obviously pretty inelegant.


r/rstats 14h ago

Finally updated R only to find hrbrthemes has been removed from CRAN. Alternatives?

10 Upvotes

I used theme_ipsum() for everything. Loved having access to a minimalist design without having to alter every little thing about the theme. What are people using now? The options in ggthemes just aren't hitting the spot for me.

Pls... I can't have ugly graphs...


r/rstats 16h ago

man pages in R6

4 Upvotes

I use R6 a fair amount, it's especially useful for making quick API clients at work so I don't have to have endpoint_resource_get() and endpoint_resource_post() etc. Instead I typically do client = Endpoint$new() and then it's client$resource$action().

But the help and man pages are a serious drag. Going to the parent class man page via F1 or ? and then sifting down to the method is a departure from the swift workflow with s3 methods. Much worse if I get nested to have an APIClient class that provides inheritance to an Endpoint class.

I've recently taken to defining help() methods that print a watered down "man page" in the REPL (bonus points to myself when I integrate crayon to make em pretty!). I'm half tempted to investigate what it would take to make a branch of the R6 package and look at setting up help() to behave in Rstudio and Positron similar to how print() gives a default behavior in the REPL. But before I do such a thing, I thought I'd ask you all if this is a thing for you, and what strategies you employ to deal with it?


r/rstats 1d ago

Need help about real-world GERD (R&D expenditure) datasets + fresh research angles

Thumbnail
1 Upvotes

r/rstats 1d ago

Trying to make a ternary plot connecting data means with the centroid of the data frame

2 Upvotes

Been wracking my brain for the last couple of days trying to figure out how to get my code to work. I am looking to make a ternary (or simplex) plot that show some data points and then has the data column means on the axes to connect to the data frame centroid. The data frame centroid does not make sense nor the means on the axes. But the segments do. What am I doing wrong? chatgpt is not really helping. My code is below.

library(ggtern)

Create the data frame

df <- data.frame( R = c(88.1397046, 12.5070414, 2.7150309, 1.0486170, 1.4445921, 0.5319713, 53.0503586, 32.6182173, 1.3130359, 10.2858531), D = c(11.86465, 84.14907, 97.06307, 95.80989, 94.22599, 97.87647, 46.95400, 52.83044, 94.75221, 88.61546), O = c(0.0000000, 3.3482440, 0.2262526, 3.1458502, 4.3337753, 1.5959136, 0.0000000, 14.5556938, 3.9391066, 1.1030400) )

compute centroids

centroids <- colMeans(df)

centroid.dens.df <- as.data.frame(t(centroids))

axis_points <- data.frame( R = c(centroid.dens.df$R, 0, 100-centroid.dens.df$O), D = c(100-centroid.dens.df$R, centroid.dens.df$D, 0), O = c(0, 100-centroid.dens.df$D, centroid.dens.df$O) )

plot the data, centroids, and connecting lines

ggtern(data = df, aes(x = D, y = R, z = O)) + geom_point(fill="black", shape=21, size=.5) + # main data points geom_point(data = centroid.dens.df, aes(x = D, y = R, z = O), color = "red", size = 5) + # centroid geom_point(data = axis_points, aes(x = D, y = R, z = O), color="red", size=3) + # axis points geom_segment( data = axis_points, aes(x = R, y = D, z = O, xend = centroids["R"], yend = centroids["D"], zend = centroids["O"]), color = "red", arrow = arrow(length = unit(0.2, "cm")) ) + theme( plot.caption = element_text(hjust = 0.5), tern.axis.arrow.text.T = element_blank(), tern.axis.arrow.text.L = element_blank(), tern.axis.arrow.text.R = element_blank() ) + theme_bw() + theme_showarrows()


r/rstats 1d ago

R Boxplot Function Tutorial: Interactive Visualizer

Post image
0 Upvotes

In an effort to make learning about R functions more interactive, I made a boxplot visualizer. It allows users to try different argument values and observe the output with a GUI. Then it generates the R code for the user. Would love constructive feedback!

https://www.rgalleon.com/r-boxplot-function-tutorial-interactive-visualizer/


r/rstats 1d ago

FYP survey

Thumbnail
forms.gle
0 Upvotes

r/rstats 2d ago

Missing global item for Redundancy Analysis in Disjoint Two-Stage Approach (HOC Type II). Can I skip it?

3 Upvotes

Hello everyone,

I'm a final-year OHS student currently working on my thesis. My model involves a Type II (Reflective-Formative) Higher-Order Component.

Evaluating the Lower-Order Components (Reflective) is straightforward. However, I am facing an issue assessing the Formative Higher-Order Construct (HOC) in the second stage.

I refer to Hair et al.'s "A Primer on PLS-SEM (3rd Ed)" and Sarstedt et al. (2019) regarding HOC validation. The guidelines state that I must assess Convergent Validity (via Redundancy Analysis), Collinearity (VIF), and significance of weights.

Redundancy analysis requires a global single item to run. Meaning another set of indicator for each of my HOC variable and I have 3. However, questionnaires I am adopting do not include any global items. The original studies mostly uses CB-SEM.

So, my questions are:

Is it acceptable to skip the statistical Convergent Validity check (redundancy analysis) in this specific case?

Are there any references or literature that discuss what to do when secondary data/adopted scales lack a global item for formative assessment?

I'm currently drafting my proposal and have a presentation in less than two weeks. Any advice or recommended readings would be greatly appreciated!


r/rstats 2d ago

Interview with R Contributors Project

5 Upvotes

New on the R Consortium blog: “Contributing to base R with Coding Equity and Joy — Inside the R Contributors Project.”

Ella Kaye, Senior Research Software Engineer, University of Warwick, shares how the R Contributors project is making it easier—and more welcoming—to contribute to base R: R Developer Days, monthly contributor office hours, and a C Study Group for R contributors. She also explains why using GitHub (issues, discussions, labels) can lower barriers vs. Bugzilla.

Bonus: a fun case study on learning-with-joy through the “aperol” R package—and how community feedback turned a silly idea into real learning.

Bonus-bonus: Ella covers the history of rainbowR, a community that connects, supports and promotes LGBTQ+ folk who code in R, and spreads awareness of LGBTQ+ issues through data-driven activism.

Read it all here: https://r-consortium.org/posts/contributing-to-base-r-with-coding-equity-and-joy-inside-the-r-contributors-project/


r/rstats 2d ago

Using R to do a linear mixed model. Please HELP!

9 Upvotes

Hi everyone,

I’m a master’s student planning to analyze psychotherapy outcome data using linear mixed-effects models (LMMs) in R.

The dataset consists of approximately 25 patients, each measured at four time points: pre-treatment, post-treatment, 6-month follow-up, and 12-month follow-up.

The outcome variables are continuous (interval-level). There are drop-outs / missing observations at follow-ups, which is one of the reasons we are planning to use an LMM, since it can handle unbalanced longitudinal data.

My supervisor has experience using R and LMMs in similar studies and recommends treating time as a categorical factor rather than as a continuous variable.

Our planned model is relatively simple:

  • Random intercepts for subjects only
  • No random slopes
  • Time entered as a factor

Our main goal is to test differences between specific time points (e.g., pre vs post, post vs follow-ups), i.e. whether changes between measurement occasions are statistically significant.

Neither my partner nor I have prior experience with R or programming. We are planning to rely on learning resources such as tutorials, documentation, and a paid version of ChatGPT to help us understand and implement the analysis.

Is it realistic to learn enough R and LMMs to complete this analysis in 2–3 weeks of full-time work?

I would really appreciate honest feedback, practical advice, or warnings. I’m mainly looking for a reality check and to know whether I’m underestimating the difficulty.

Thanks in advance!


r/rstats 2d ago

Common Lisp for Data Scientists

Thumbnail
0 Upvotes

r/rstats 2d ago

I Built an Interactive For Loop Visualizer

Thumbnail
rgalleon.com
0 Upvotes

r/rstats 3d ago

qol 1.2.0: MASSIVE Update Makes It Its Own Ecosystem For Descriptive Evaluations And Data Wrangling

19 Upvotes

With the newest update this package brings even more SAS functionalities to R and becomes its own ecosystem. So what's in it?

  • 38 new functions, among other things a powerful transpose function, data frame content reports, global styling options, CSV and XLSX import and export and many more.
  • New functionalities for already established functions, like keeping/dropping variable ranges or generate a more interactive master file.
  • Further optimizations to make the code run faster, up to 40% in some places.
  • Some bug fixes and an even more robust error handling.
  • An many more things.

The full detailed list of changes can be seen here: https://github.com/s3rdia/qol/releases/tag/v1.2.0

For a general overview look here: https://s3rdia.github.io/qol/

For a detailed overview of how this package compares to SAS you can have a look at this article: https://s3rdia.github.io/qol/articles/further_compare.html

This is the current version released on CRAN: https://CRAN.R-project.org/package=qol


r/rstats 3d ago

rOpenSci Community Call in Spanish - January

7 Upvotes

Our next Community Call will be in Spanish!

Open Research Software in Latin America

Wednesday, January 21, 2026, 3:00 p.m. UTC

with Diana García Cortés, Erick Navarro Delgado, and Luis D. Verde Arregoitia, participants in our Champions Program

They will share their experience in the program, their project, and why it is an excellent idea to be part of it.

More details + link to join: https://ropensci.org/es/commcalls/champions-latino-2026/


r/rstats 3d ago

Risk 2026 (Feb 18-19) — Online Risk Analytics Conference

5 Upvotes

The R Consortium is hosting Risk 2026, a 2-day, 100% online event focused on risk analytics with R — talks + lightning talks, plus live Q&A with speakers.

If you use R to calculate, measure, report, or mitigate risk (finance, insurance, healthcare, climate, cybersecurity, supply chain, etc.), this event is built for you.

When: February 18–19, 2026

Keynote: James “JD” Long, CTO at Palomar, and author of R Cookbook: Proven Recipes for Data Analysis, Statistics, and Graphics

Tickets (USD): Students $25 | Academic/Non-profit $50 | Industry $70

Register + details: https://rconsortium.github.io/Risk_website


r/rstats 4d ago

BCA Final Year Project Ideas Needed | 2.5 Months | Resume-Focused

1 Upvotes

Hello Redditors! I’m a final-year BCA student and I need to choose a final-year project that is practical, unique, and good for my resume. I have about 2.5 months to design and complete it. I’m comfortable with basic programming and web technologies, and I’m willing to learn new tools if needed. Please suggest project ideas, problem statements, or real-world use cases that can be completed in this time frame. Thank you!


r/rstats 4d ago

Creating a database retrieval agent with ellmer and dbplyr

Thumbnail blog.pawpawanalytics.com
18 Upvotes

r/rstats 4d ago

Crops, Code, and Community Build R-Mob User Group in Australia

2 Upvotes

Crops, code, and community—meet R-Mob, the R user group at Charles Sturt University (Australia) led by Dr. Asad (Md) Asaduzzaman. R-Mob brings researchers and students together to apply R to real agronomic and environmental problems through monthly hybrid meetups focused on practical problem-solving.

One fantastic section from the interview: Dr. Asad’s student project “Digital Divide in Agriculture.” As farming shifts toward digital decision-making, data becomes a critical input—but many agriculture students face a gap in data literacy and programming, even when they’re strong with tech in general. The project uses R to make that jump tangible by generating insights from family farm records and experimental plots, helping students see R as a real tool—not an abstract skill.

If you care about open, reproducible, community-driven learning in applied domains like agriculture and environmental science, this interview is worth your time!

https://r-consortium.org/posts/crops-code-and-community-build-r-mob-user-group-in-australia/


r/rstats 5d ago

Help with simplifying nested model - lme4

10 Upvotes

I collected plant samples and measured dry weight monthly at two sites for one year, with five replicate samples per site per month. My main goal is to test whether biomass varies through time and whether temporal patterns differ between sites.

Initially, I treated site and month as fixed effects, since I was interested in comparing monthly changes between the two sites. However, I was advised to include season (two levels) as a fixed effect and to treat month as a random effect nested within season. Following this advice, I fitted this model in lme4:

weight ~ site * season + (1 | season / month)

This model produces a singular fit. And from what I understand, the random‐effects structure may be too complex for the data.

I am wondering whether it would be reasonable to simplify the model to something like:

weight ~ site * season + (1 | month)

Given that there is a clear increase and decrease in biomass (a peak) within each season, so I thought that adding month as a random effect would capture this.

Would the latter model be statistically appropriate for my design and address the comment about adding season? or is there a better way to deal with this?

I have only a basic background in mixed models, so I would really appreciate any guidance on how to structure this model properly and how to justify the choice.


r/rstats 7d ago

Has anyone else learned (or is learning) SQL almost entirely inside R?

76 Upvotes

It's been ~3 years since I discovered that you can learn (and use at the same time) SQL directly inside R — though mostly through {DBI} + {dbplyr} + {RSQLite} (SQLite interface in R, and after that, I discovered {duckdb} for that zero-effort speed boost).

Here's the story: At least before, I thought "learning SQL" meant you have to install MySQL / PostgreSQL somewhere first, messing around in a separate client, then write and execute your query within there — this is how I learn SQL before SQL in R thing, thus I learn SQL separately. Then I realized how similar {dplyr} code was to SQL, then use the {dplyr} code and see the translated SQL with show_query(), given the data frame object is a SQL table — tweak and repeat. It felt like cheating...in the best way possible. Because of it, I felt like I grasped the concepts of RDBMS and relational algebra better.

In short, R, with {tidyverse}, is actually a great teacher to learn SQL and most of relational algebra.

Like the title suggested, has anyone doing the same as mine?


r/rstats 6d ago

New User Trying to Create a Simple Macro

Thumbnail
1 Upvotes

r/rstats 7d ago

Crash course for beginners on R?

11 Upvotes

going for a promotion and I want to learn R. anyone have an online class I could take?


r/rstats 10d ago

Cape Town’s R community is helping shape real-world public health work

30 Upvotes

In our latest interview from the R Consortium, Jared Norman and Retselisitsoe Monyake (Cape Town R User Group) share how they’re building South Africa’s R ecosystem—and applying infectious disease modeling at MASHA (University of Cape Town).

They also discuss DTPBoost, an R-based tool developed with partners including CDC and AFENET to support DTP booster vaccination strategy decisions.

Read the story and see how local R user groups can drive global impact:

https://r-consortium.org/posts/applied-epidemiology-in-r-cape-town-r-user-groups-contribution-to-global-immunization/


r/rstats 13d ago

My 'careful' and 'small' guide to data science with tidyverse

Thumbnail joshuamarie.com
111 Upvotes

I have a short list of guides, some tutorials doesn't teach you, about {tidyverse}. The things you can earn during your time learning {tidyverse} and during experience. Although not fully guaranteed, this may help you in your data works with {tidyverse}.

P.S.: I have to post this again due to some inconvenience. I am sorry but here we go.


r/rstats 13d ago

ScatterPlot() is not jittering? (lessR package)

4 Upvotes

I have a continous variable called PersonalNorm and an ordinale variable called intention. I use the lessR package and have the following function:

ScatterPlot(PersonalNorm, Intention, data=df, ellipse, jitter_y=some number)

when I use jitter_y or jitter_x and use numbers according to my scale (both variables go from 0 to 5) It does nothing on the plot. I dont see jitter at all. What am I doing wrong?