Publishing notebook analysis 🍫

Practicing DatScy
6 min readFeb 10, 2023

I finally learned R and discovered R markdown! I find that R markdown is most useful for making reports or documents with your analysis.

The most common ways to create a written report with data analysis notebooks are:

1. (Python, R, SQL) code analysis in jupyter notebook, using Markdown notation — File —Download as (pdf, html, docx, etc) document

2. R markdown (.rmd) scripts that can be transformed into other file formats (pdf, html, docx, odt,rtf), using a header notation called YAML.

The second option is a little different than the first option because you can compile these .rmd files using LaTeX format templates. So instead of a plain pdf document that would be created with option 1, you can have a template added to your analysis information depending on your YAML header specifications. It is the same thinking as selecting a template for Word or google sheets before you enter your data, but with R markdown you can automatically tryout many different templates with your data report.

Most recently, report documents appear to be obsolete! The “Download as” button on most cloud platforms does NOT even exist anymore. All reporting appears to be slowly converted to functional applications (Apps) in html/javscript format. So, YES, it appears that everyone who analyzes data needs to learn html/javascript just to present their data analysis. Users like to see the data directly and manipulate the data using published programs like Tableau, Power BI, or whatever URL site you select (it could be a google sheet or a github repository transformed into a github page).

WARNING that some code snippets MIGHT not work 😢

I found that R is “buggy”, I spent a LOT of extra time trying to look for the correct function or syntax; stackoverflow was a savior! And, unfortunately the syntax was not robust, I worked on Kaggle for several days and when I came back each day I could not run the same code that worked on the previous day. Pipe (%>%), which is one of the best attributes of R, was unreliable and stopped working so I had to recode everything again with direct commands (<-). In addition, the classification functions were extremely difficult to use in terms of syntax. Many blogs gave notation for classification and examples that did not work; RandomForest could only be used for Regression despite the many different syntatical constructions that I tried. I managed to happily run two of the Perceptron Neural Network models correctly, but with the syntax reproducability problems, hyperparameter testing was a long process. I was really surprised that R was so difficult to use for Machine Learning, when the graphical and pre-processing packages appeared to work quite well. I hope this post helps those who need clear notation for some basic Machine Learning functions!

R Classification

The full notebook is located at :

However, below I show some main code snippets that might help people:

Grouping DataFrame rows by similarity (I call it pd_dummies_similarity)

If you give this code snippet an X matrix (df_num2), a unique list of words from a column in the Xmatrix that you want to condense to maybe 2 to 3 unique words (unique_words) . I was interested in condensing the number of unique words in the column bean_type (the eventual y vector); it had 8 unique words and using this code I reduced it to 4 unique words. It simply looks for pieces of exact matches in the list of unique_words; when two words are similar the shortest word is used to represent both words. Replace and mutate do the hard work of updating the DataFrame.

Main pipeline (Normalize → Train-test split )

Using the mapping functions were very helpful at performing time consuming computations. And, the initial_split function was really straight forward like the train_test_split function in scikit-learn.

Neural network X and y syntax

I tried to use two neural network models (nnet and neuralnet), the neuralnet library was better than the nnet model because I could use more model evaluation functions with the resulting model. The neuralnet model allowed me to predict an estimate of y, show the prediction probability, and calculate accuracy.

The two main notations are shown above (string and dataframe notation). I found that a good indicator that the notation was correct, was to print(model) and verify the response (y) and covariate/features; the data matrix will include both the X and y information.

I did not have much luck building a reliable neural network, but I think there must be other settings that I have to select because an accuracy of 0.046 is extremely low even for a bad model.

Random Forest syntax

The random forest model syntax was the same as the neural network syntax, but unfortunately the warning message said that I was performing Regression and not Classification. I tried more combinations of notations that are not listed below, and I still was unable to perform Classification. If anyone knows the correct syntax for classification I would love to know it.

Despite only being able to perform Regression, I rounded the estimated y output and was able to obtain a training accuracy of 0.88. This accuracy result seems reasonable because I received a training accuracy of about 0.5 for only 3 features; adding 4 additional categorical features should increase training accuracy, as it did.

Knit to output html

Finally I wanted to try “Knitting” a .rmd file to see the html output. Knitting an rmd file means that you use the R Knit render to convert an R Markdown file to various outputs (html, pdf, docx, etc).

To obtain the .rmd file, I downloaded the notebook (ipynb) from Kaggle and opened it on my PC, then I converted it to a markdown (.md) file using File — “Download as”. Looking at the .md file in a text editor, I replaced the ```R notation with ```{r} and then resaved the file as a .rmd.

YAML header of the rmd file.

I uploaded the .rmd file to the Kaggle notebook, copied the PATH into the code snippet, and ran the following code snippet.

It took 10–15 mins for the render to run the code in between the ```{r} and ``` marks; the marks indicate that the interpreter should execute the code within the start and end mark. The output html document was a bit similar to the html document that one would obtain by exporting with a juypter-notebook, but the font was softer and additonal html could be added to the rmd file.

Html file exported from juypter-notebook using File-Downoad as.
Html file created by R Knit render.

Talking about slightly different html output files sounds a little silly, but the concept of the R Knitter was something interesting to tryout. Advanced html and javascript could be used with rmd files to create complex functional documents, like Tableau and Power BI, that explain data analysis.

If you are curious about how to create a Github page from a Github repository, follow the directions in Reference 1. It was very easy to setup!

Happy Practicing! 👋 🍫

References

1. Explanation of how to setup a Github page: https://christianheilmann.com/2022/01/13/turning-a-github-page-into-a-progressive-web-app/

2. Explanation of how to Knit on Kaggle: https://www.kaggle.com/code/mutindafestus/how-to-knit-r-markdown-file-to-pdf-in-kaggle

BECOME a WRITER at MLearning.ai

--

--

Practicing DatScy

Practicing coding, Data Science, and research ideas. Blog brand: Use logic in a clam space, like a forest, and use reliable Data Science workflows!