This is a blog entry that I will use to share some things I have found to be useful but that do not really require a lengthy specified blog entry.
I think of it as a notes to self section of my website.
Many of these commands can be found in this website
If we messed something up, we can try and revert the changes:
To get single files or folders from a branch (in case we accidentally deleted a file or something like that) we can use:
git fetch --all
git checkout origin/master -- <your_file_path>
git add <your_file_path>
git commit -m "<your_file_name> updated"
Arguably, one of the tools that helps the most when coding within a team is to make use of automatic tests and/or quick
style fixes to our code. A quick example: spotting and fixing trailing white spaces
manually
(who has time for that???). The tool to use in this case is pre-commit, which is run when you launch a git commit
command and prevents you from making a commit or fixes your files if one of your predefined hooks fails.
After installing pre-commit via pip install pre-commit
you can run
pre-commit sample-config
to create a basic first configuration file: .pre-commit-config.yaml
. This yaml file will consist of a couple of sources (repo) with corresponding versions (rev) and a series of hooks to be used (hook and a list of identifiers id).
The file looks like:
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: https://github.com/psf/black
rev: 22.10.0
hooks:
- id: black
Which you can modify to your convenience. For instance, you may want to run some hooks with specific arguments or explicitly skip a given file for some particular hook:
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
hooks:
- id: trailing-whitespace
- id: check-added-large-files
args: ['--maxkb=1024']
exclude: "notebooks/some_jupyter_notebook.ipynb, some_data.csv"
- repo: https://github.com/psf/black
rev: 22.10.0
hooks:
- id: black
args:['--line-length=100']
Now that you have a yaml you want to work with, you can install the pre-commit hooks:
pre-commit install
Now, even though these hooks are supposed to run right before the commit in an automated way, it is nice to be able to run them directly as a sanity check when you are fixing some of the issues you have:
pre-commit run --all-files
pre-commit run <ONE_FILE_NAME> # to run hooks on just one specific file
It is very importatn that if one of your files fails to pass one of the hooks you'll need to tell git you want to add the modified version that will now pass the check, otherwise you will never be able to actually commit and push (unless you want to skip the pre-commit hooks verification via git commit -m <MESSAGE> --no-verify
which is not recommended).
git add <MODIFIED_FILE_NAME>
git add commit -m <MESSAGE>
Very often we need clarity in the terminal to know the most important information of our current workspace and knowing what username, project, branch, python version you're using at all times becomes paramount.
For a long time I modified manually the .bashrc
file but this was a rather rudimentary way of implementing what Starhsip does in a split of a second (if you know how to use it).
Here I'll only display the steps I use for ubuntu
but most of the installation steps are clear or easier (with brew --cask) to run on a macOS.
To install starship we only need to run:
curl -sS https://starship.rs/install.sh | sh
and then add a single line at the end of ~./bashrc
(you can open it with gedit ~./bashrc
):
# Starship
eval "$(starship init bash)"
To install fonts, I found it easier to follow these steps:
Go to a location where a nerd-fonts
repo should be stored
bash
git clone --depth 1 https://github.com/ryanoasis/nerd-fonts.git # warning: takes a while
Install a particular font. For instance, for Fira Code NF
(which has most of the icons you may need), use:
bash
cd nerd-fonts/
./install.sh FiraCode
Once we have installed it and told bash we want to use it, we now need to configure the main terminal to use this font.
To do so:
Ctrl+Alt+T
)FiraCode Nerd Font
I am a VSCode user, so now to make things work you may need to:
Ctrl + ,
Terminal > integrated > Font Family
'FiraCode Nerd Font', monospace
. This makes use of Fira Code by Nerd Fonts and falls back to monospace if things break down.Now you can choose a template from this gallery. For example, to install the gruvbox preset, just run
starship preset gruvbox-rainbow -o ~/.config/starship.toml
To check out in detail the steps to install it, or to modify the .toml
used to produce the template, simply click on the image.
Sometimes when importing data to pandas or reading from a .csv
file one may encounter issues with the imported data.
To be able to understand the reason behind such an issue you sometimes need to see the data directly to then be able to modify it.
Here is one command that does such a thing:
filename="path_to_csv_file.csv"
cat -n "$filename" | head -n 275211 | tail -n 3
When using jupyter notebooks, IPython optimizes the amount of imports by disregarding those that have been already imported and saved on memory. This is of course counterproductive when you are trying to have a utils file to create clean jupyter notebooks for presentation purposes that have a lot of code behind. Saving the python file you're working on and rerunning the jupyter cell containing the import won't be enough.
However, you can enforce the reload of functions f, g, h
from library X
with:
from importlib import reload
import X
from X import (f, g, h)
reload(X)
In SQL, windows can be used when one attempts to gather data from subgroups of a given column for several rows already when defining a column. Pandas is slightly different in that any function stems from the .groupby method. This website explains quite clearly how to make use of SQL-like window functions in pandas.
df['n_rows_same_column_value'] = df.groupby('column_to_check')['other_column_to_count'].transform('count')
from pyspark.sql import functions as F
from pyspark.sql.window import Window
def lagged_window(
window_partition_columns:str,
time_column:str,
window_size:float=7,
lagging_days:float=0
):
"""This function creates a sliding window of constant width from a date column based on a single time column
Args:
window_partition_columns (list of str): Names of the columns to be used as partition.
These columns "reset" the windowing function.
time_column (str): Column with the date or timestamp to be used to order the rows for the windowing function
window_size (int or float): number of days or portion of a day to be used as interval for the window size.
This will be rounded to integer seconds. Defaults to 7.
lagging_days (int or float): number of days to shift the windowing function. Defaults to 0
Returns:
pyspark.sql.window.Window
"""
days_to_s = lambda x: int(24 * 60 * 60 * x)
win = (
Window.partitionBy(window_partition_columns)
.orderBy(F.col(time_column).cast("timestamp").cast("long"))
.rangeBetween(-(days_to_s(window_size + lagging_days)), -days_to_s(lagging_days))
)
return win
# Example
part_cols = ['shop', 'city', 'customer', 'device']
lag_window = Window.partitionBy(part_cols).orderBy('time_column')
df_with_moving_average = (
df
.withColumn('col_1D_MA', F.avg('col').over(lagged_window(part_cols, 'time_column', window_size=1)))
.withColumn('col_30m_MA', F.avg('col').over(lagged_window(part_cols, 'time_column', window_size=(.5/24))))
)
def flatten_df(nested_df):
"""Flatten a spark dataframe
Args:
nested_df (pyspark.sql.dataframe.DataFrame): dataframe with possible struct-type columns
Returns:
pyspark.sql.dataframe.DataFrame with a column per 'key' of the nested_df struct-type columns
"""
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
flat_df = nested_df.select(flat_cols +
[F.col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols
for c in nested_df.select(nc+'.*').columns])
return flat_df
def normalise_units(column: str):
"""Normalises the units given by removing well-known prefixes
Args:
column (str): name of the column with the values and units mixed
Returns:
pyspark.sql.column.Column: column with value for measure without prefixes.
Example:
285.3km -> 285300
500.1Mm -> 500100000
"""
units_col = F.split(column, "[0-9.]+")[1]
result = (
(
F.when(units_col.rlike("^k\w"), F.lit(1e3))
.when(units_col.rlike("^M\w"), F.lit(1e6))
.when(units_col.rlike("^G\w"), F.lit(1e9))
.when(units_col.rlike("^c\w"), F.lit(1e-2))
.when(units_col.rlike("^m\w"), F.lit(1e-3))
.when(units_col.rlike("^n\w"), F.lit(1e-9))
.otherwise(F.lit(1))
)
* F.cast("float", F.split("outfeed_voltage", "[a-zA-Z]+")[0])
)
return result
def gantt_chart(
df: pd.DataFrame,
tasks_column: str,
initial_time_column: str,
end_time_column: str,
color_dict: dict = None,
) -> plt.Figure:
"""Gantt plot that shows where each individual task starts and ends.
Each task can have multiple time intervals
Args:
df (pd.DataFrame): pandas dataframe with
tasks_column (str): name of the column containing categorical data
initial_time_column (str): start time of task interval
end_time_column (str): end time of task interval
color_dict (dict, optional): colors assigned to each task interval. Defaults to None.
Returns:
plt.Figure: Gantt plot
"""
"""Gantt plot that shows where each individual task starts and ends.
Each task can have multiple time intervals
Args:
df (pd.DataFrame): pandas dataframe with
tasks_column (str): name of the column containing categorical data
initial_time_column (str): start time of task interval
end_time_column (str): end time of task interval
color_dict (dict, optional): colors assigned to each task interval. Defaults to None.
Returns:
plt.Figure: Gantt plot
"""
if color_dict:
cd = color_dict
else:
cd = {}
for i in df[tasks_column].unique():
for i in df[tasks_column].unique():
cd[i] = "#" + "".join([np.random.choice([i for i in set("0123456789ABCDEF")]) for j in range(6)])
fig, ax = plt.subplots(1, figsize=(25, 6))
# Plot
ax.barh(
df[tasks_column],
df[end_time_column] - df[initial_time_column],
left=df[initial_time_column],
color=df[tasks_column].apply(lambda x: cd[x]),
height=0.9,
)
# Axes formatting
ax.yaxis
ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y-%m-%d"))
ax.xaxis.set_major_locator(mdates.MonthLocator(bymonthday=1, interval=3))
fig.autofmt_xdate(which="both")
plt.show()
return fig
Hopefullly a time saver when we just want to see quickly some characteristics of a time series.
import pandas as pd
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf
from plotly.subplots import make_subplots
def plot_time_series_decomposition(
df: pd.DataFrame,
value_column: str,
time_column: str,
model: str = "multiplicative",
show_autocorrelation_plot: bool = True,
lags_for_acf: int = 365 * 2,
):
"""
Show the time series decomposition of a time series pandas dataframe
Args:
df (pd.DataFrame): pandas dataframe to try and extract seasonalities.
value_column (str): name of the column with the values to analyse
time_column (str): name of the column used to extract seasonalities and trends.
model (str): one of 'additive' or 'multiplicative'. Default is 'multiplicative'
show_autocorrelation_plot (bool): True if an autocorrelation plot wants to be displayed. Default is True
lags_for_acf (int): number of lags to consider in the autocorrelation plot. Default is 365 * 2
Returns:
None
"""
data_orig = df[[time_column, value_column]].copy()
data_orig.set_index(pd.DatetimeIndex(df[time_column], freq="D"), inplace=True)
analysis = data_orig[value_column]
decompose_result_mult = seasonal_decompose(analysis, model=model)
trend = decompose_result_mult.trend
seasonal = decompose_result_mult.seasonal
residual = decompose_result_mult.resid
fig = make_subplots(
rows=3,
cols=1,
vertical_spacing=0.05,
horizontal_spacing=0.15,
row_heights=[0.6, 0.2, 0.2],
subplot_titles=(["Original set and Trend", "Seasonal", "Residual"]),
shared_xaxes=True,
specs=[[{}], [{}], [{"secondary_y": True}]],
)
# Original
fig.add_trace(
go.Scattergl(
x=df[time_column],
y=df[value_column],
name="Original",
line_color="rgba(30,125,245,.5)",
mode="markers",
),
col=1,
row=1,
)
# Trend
fig.add_trace(
go.Scattergl(
x=analysis.index,
y=trend,
name="Trend",
line_color="black",
mode="lines",
),
col=1,
row=1,
)
# Seasonal
fig.add_trace(
go.Scattergl(
x=analysis.index,
y=seasonal,
name="Seasonal",
line_color="black",
mode="lines",
),
col=1,
row=2,
)
# Residual
fig.add_trace(
go.Scattergl(
x=analysis.index,
y=residual,
name="Residual",
line_color="black",
mode="lines",
),
col=1,
row=3,
secondary_y=False,
)
fig.update_layout(height=800, width=1400)
if show_autocorrelation_plot:
my_dpi = 96
fig_plt, ax = plt.subplots(figsize=(1400 / my_dpi, 200 / my_dpi), dpi=my_dpi)
plot_acf(analysis, lags=lags_for_acf, ax=ax)
fig_plt.show()
fig.show()
return None
Hopefullly a time saver when we just want to see quickly some characteristics of a time series.
import pandas as pd
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf
from plotly.subplots import make_subplots
def plot_time_series_decomposition(
df: pd.DataFrame,
value_column: str,
time_column: str,
model: str = "multiplicative",
show_autocorrelation_plot: bool = True,
lags_for_acf: int = 365 * 2,
):
"""
Show the time series decomposition of a time series pandas dataframe
Args:
df (pd.DataFrame): pandas dataframe to try and extract seasonalities.
value_column (str): name of the column with the values to analyse
time_column (str): name of the column used to extract seasonalities and trends.
model (str): one of 'additive' or 'multiplicative'. Default is 'multiplicative'
show_autocorrelation_plot (bool): True if an autocorrelation plot wants to be displayed. Default is True
lags_for_acf (int): number of lags to consider in the autocorrelation plot. Default is 365 * 2
Returns:
None
"""
data_orig = df[[time_column, value_column]].copy()
data_orig.set_index(pd.DatetimeIndex(df[time_column], freq="D"), inplace=True)
analysis = data_orig[value_column]
decompose_result_mult = seasonal_decompose(analysis, model=model)
trend = decompose_result_mult.trend
seasonal = decompose_result_mult.seasonal
residual = decompose_result_mult.resid
fig = make_subplots(
rows=3,
cols=1,
vertical_spacing=0.05,
horizontal_spacing=0.15,
row_heights=[0.6, 0.2, 0.2],
subplot_titles=(["Original set and Trend", "Seasonal", "Residual"]),
shared_xaxes=True,
specs=[[{}], [{}], [{"secondary_y": True}]],
)
# Original
fig.add_trace(
go.Scattergl(
x=df[time_column],
y=df[value_column],
name="Original",
line_color="rgba(30,125,245,.5)",
mode="markers",
),
col=1,
row=1,
)
# Trend
fig.add_trace(
go.Scattergl(
x=analysis.index,
y=trend,
name="Trend",
line_color="black",
mode="lines",
),
col=1,
row=1,
)
# Seasonal
fig.add_trace(
go.Scattergl(
x=analysis.index,
y=seasonal,
name="Seasonal",
line_color="black",
mode="lines",
),
col=1,
row=2,
)
# Residual
fig.add_trace(
go.Scattergl(
x=analysis.index,
y=residual,
name="Residual",
line_color="black",
mode="lines",
),
col=1,
row=3,
secondary_y=False,
)
fig.update_layout(height=800, width=1400)
if show_autocorrelation_plot:
my_dpi = 96
fig_plt, ax = plt.subplots(figsize=(1400 / my_dpi, 200 / my_dpi), dpi=my_dpi)
plot_acf(analysis, lags=lags_for_acf, ax=ax)
fig_plt.show()
fig.show()
return None
def plot_pareto(
df: pd.DataFrame, categorical_column: str, value_column: str, threshold_on_value: float = None
) -> go.Figure:
"""Plot Pareto's curve and compare to the usual 80/20 principle
Args:
df (pd.DataFrame): dataframe with at least a categorical and a value column
categorical_column (str): name of the categorical column to consider.
value_column (str): name of the values column
threshold_on_value (float, optional): cutoff at which categorical entities will be grouped under a unique group.
If None, all elements are shown independently. Defaults to None.
Returns:
go.Figure: Pareto plot
"""
df["n_elements"] = 1
df["elem_group"] = df.apply(
lambda x: x[categorical_column] if x[value_column] > threshold_on_value else "below_threshold", axis=1
)
df_grouped = df.groupby("elem_group").agg({value_column: "sum", "n_elements": "sum"}).reset_index()
df_grouped = df_grouped.sort_values(by=value_column, ascending=False)
is_below_thr = df_grouped["elem_group"] == "below_threshold"
df_grouped = pd.concat([df_grouped[~is_below_thr], df_grouped[is_below_thr]])
df_grouped["cumulative_pct_val"] = 100 * df_grouped[value_column].cumsum() / df_grouped[value_column].sum()
df_grouped["cat_order_per_value"] = df_grouped["n_elements"].cumsum()
df_grouped["cum_pct_categories"] = 100 * df_grouped["cat_order_per_value"] / df_grouped["n_elements"].sum()
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(
go.Scatter(
x=df_grouped["cum_pct_categories"],
y=df_grouped["cumulative_pct_val"],
name=f"cumulative % of {value_column}",
mode="lines+markers",
marker_color="rgba(255,0,0,.5)",
customdata=df_grouped[["elem_group", "n_elements", value_column]],
hovertemplate="<br>".join(
[
"<b>%{y:0.2f}</b>",
"Element: <b>%{customdata[0]}</b>",
"Value: <b>%{customdata[2]:,}</b>",
"Cumulative % categories: <b>%{x}</b>",
"n_elements: <b>%{customdata[1]}</b>",
]
),
),
secondary_y=True,
)
fig.add_trace(
go.Bar(
x=df_grouped["cum_pct_categories"],
y=df_grouped[value_column],
name=value_column,
marker=dict(color="rgba(0,0,255,.5)"),
)
)
fig.add_hline(y=80, line=dict(color="rgba(0,0,0,.3)", dash="dash"), secondary_y=True)
fig.add_vline(x=20, line=dict(color="rgba(0,0,0,.3)", dash="dash"))
fig.update_layout(
hovermode="x unified", title=f"Pareto chart of {value_column} per {categorical_column}", height=600, width=1600
)
return fig
def plot_pareto(
df: pd.DataFrame, categorical_column: str, value_column: str, threshold_on_value: float = None
) -> go.Figure:
"""Plot Pareto's curve and compare to the usual 80/20 principle
Args:
df (pd.DataFrame): dataframe with at least a categorical and a value column
categorical_column (str): name of the categorical column to consider.
value_column (str): name of the values column
threshold_on_value (float, optional): cutoff at which categorical entities will be grouped under a unique group.
If None, all elements are shown independently. Defaults to None.
Returns:
go.Figure: Pareto plot
"""
df["n_elements"] = 1
df["elem_group"] = df.apply(
lambda x: x[categorical_column] if x[value_column] > threshold_on_value else "below_threshold", axis=1
)
df_grouped = df.groupby("elem_group").agg({value_column: "sum", "n_elements": "sum"}).reset_index()
df_grouped = df_grouped.sort_values(by=value_column, ascending=False)
is_below_thr = df_grouped["elem_group"] == "below_threshold"
df_grouped = pd.concat([df_grouped[~is_below_thr], df_grouped[is_below_thr]])
df_grouped["cumulative_pct_val"] = 100 * df_grouped[value_column].cumsum() / df_grouped[value_column].sum()
df_grouped["cat_order_per_value"] = df_grouped["n_elements"].cumsum()
df_grouped["cum_pct_categories"] = 100 * df_grouped["cat_order_per_value"] / df_grouped["n_elements"].sum()
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(
go.Scatter(
x=df_grouped["cum_pct_categories"],
y=df_grouped["cumulative_pct_val"],
name=f"cumulative % of {value_column}",
mode="lines+markers",
marker_color="rgba(255,0,0,.5)",
customdata=df_grouped[["elem_group", "n_elements", value_column]],
hovertemplate="<br>".join(
[
"<b>%{y:0.2f}</b>",
"Element: <b>%{customdata[0]}</b>",
"Value: <b>%{customdata[2]:,}</b>",
"Cumulative % categories: <b>%{x}</b>",
"n_elements: <b>%{customdata[1]}</b>",
]
),
),
secondary_y=True,
)
fig.add_trace(
go.Bar(
x=df_grouped["cum_pct_categories"],
y=df_grouped[value_column],
name=value_column,
marker=dict(color="rgba(0,0,255,.5)"),
)
)
fig.add_hline(y=80, line=dict(color="rgba(0,0,0,.3)", dash="dash"), secondary_y=True)
fig.add_vline(x=20, line=dict(color="rgba(0,0,0,.3)", dash="dash"))
fig.update_layout(
hovermode="x unified", title=f"Pareto chart of {value_column} per {categorical_column}", height=600, width=1600
)
return fig
There's a plethora of tips, guidelines and tutorials about MLOps, Data Science, Visualizations and Analysis out there. Some of my personal favourites are: * MadeWithML by Goku Mohandas * Kuhyen Tran's tips * Geographic Data Science by Sergio Rey, Dani Arribas and Levi Wolf * Python Data Science Handbook by Jake VanderPlas