In this notebook I want to play with Jupyter and show the steps of how to create a MEVN application from a notebook. Normally I would do this in the normal Linux terminal and a text editor, but since we can combine code, explanation and shell commands, I want to create a story in this notebook which hopefully will be of any help of people experimenting with full-stack development. I will create the application, use some simple Linux tricks and use Selenium to test the application.
Create a lagged column in a PySpark dataframe:
from pyspark.sql.functions import monotonically_increasing_id, lag from pyspark.sql.window import Window # Add ID to be used by the window function df = df.withColumn('id', monotonically_increasing_id()) # Set the window w = Window.orderBy("id") # Create the lagged value value_lag = lag('value').over(w) # Add the lagged values to a new column df = df.withColumn('prev_value', value_lag)
A simple trick to select columns from a dataframe:
# Create the filter condition condition = lambda col: col not in DESIRED_COLUMNS # Filter the dataframe filtered_df = df.drop(*filter(condition, df.columns))
In my previous posts I have already shown simple examples of using MapReduce and Spark with Pyspark. A missing piece moving from MapReduce to Spark is the usage of Pig scripts. This posts shows an example howto use a Pig script.
This is a short explanation on how to setup a Truffle decentralized app using Docker containers.
Last time I started to experiment with Hadoop and simple scripts using MapReduce and Pig on a Cloudera Docker container. Now lets start playing with Spark, since this is the goto language for machine learning on Hadoop.
This post describes my first experiment with the Cloudera environment by trying to use the basic MapReduce method on a simple dataset.
Using the Docker HDP image from Hortonworks, it is easy to spin up an Hadoop environment onto your machine.