0711 217 2828-0 info@sdsc-bw.de
Beispielanalyse Parallel Computing

Dieses Jupyter Notebook stammt aus einem Vortrag von Till Riedel am “2. Smart Data und KI Tag”. Es wird gezeigt, wie sich eine Datenanalyse parallelisieren lässt.

Wie beschleunige ich meine Datenanalyse zuhause oder im Cluster

Parallelisierung der Merkmalsberechnung mit python, joblib und dask

Zuerst initialisieren wir unsere Umgebung mit einigen Paketen. Falls es zu Fehlern kommt sollten die Pakete auf der Kommandozeile mit pip oder conda nachinstalliert werden.

In [1]:
import warnings
warnings.filterwarnings("ignore")
import os
import time
from tqdm import tqdm


import sys 
os.environ["PATH"] += os.pathsep + sys.prefix+'/bin'
    
%matplotlib inline
%config InteractiveShell.ast_node_interactivity="last_expr_or_assign"

Die folgenden Pfade müssen angepasst werden, sobald die Daten runtergeladen wurden (>800MB gepackt). Die Daten können unter https://bwsyncandshare.kit.edu/s/NzrXCAnTHDWJZRk heruntergeladen werden.

In [2]:
TRAIN_LABEL_PATH = "data/train_labels.csv"
TRAIN_PATH = "data/train/"
Out[2]:
'data/train/'
In [59]:
from IPython.display import Image
Image('images/data.PNG')
Out[59]:

Die Daten sind von verschiedene Windrädern in China bei denen 75 Sensoren alle 10 Minuten aufgezeichnet wurden.

In [3]:
import pandas as pd
data= pd.read_csv("data/train/002/cbd192c9-5e59-3b3c-bae8-20f8ae9f2b36.csv")
Out[3]:
Wheel speed hub angle blade 1 angle blade 2 angle blade 3 angle pitch motor 1 current pitch motor 2 current Pitch motor 3 current overspeed sensor speed detection value 5 second yaw against wind average ... blade 3 inverter box temperature blade 1 super capacitor voltage blade 2 super capacitor voltage blade 3 super capacitor voltage drive 1 thyristor temperature Drive 2 thyristor temperature Drive 3 thyristor temperature Drive 1 output torque Drive 2 output torque Drive 3 output torque
0 14.63 154.01 0.24 0.31 0.22 12.48 13.58 14.00 14.91 2.6 ... 300 0 0 0 0 0 0 0 0 0
1 13.74 312.77 0.24 0.31 0.22 11.36 11.14 13.06 13.95 8.7 ... 300 0 0 0 0 0 0 0 0 0
2 13.55 73.76 0.24 0.31 0.22 11.74 11.90 14.64 13.81 5.4 ... 300 0 0 0 0 0 0 0 0 0
3 12.21 132.26 0.24 0.31 0.22 10.08 10.30 12.20 12.47 -7.1 ... 300 0 0 0 0 0 0 0 0 0
4 12.91 239.51 0.24 0.31 0.22 10.90 11.84 13.04 13.16 1.2 ... 300 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
445 11.37 93.24 0.24 0.31 0.22 10.10 8.84 11.28 11.62 -27.1 ... 300 0 0 0 0 0 0 0 0 0
446 14.37 194.51 0.24 0.31 0.22 12.20 13.46 14.64 14.65 1.6 ... 300 0 0 0 0 0 0 0 0 0
447 12.31 82.76 0.24 0.31 0.22 10.00 10.10 12.02 12.60 -1.5 ... 300 0 0 0 0 0 0 0 0 0
448 12.24 183.49 0.24 0.31 0.22 9.48 10.30 11.56 12.50 2.3 ... 300 0 0 0 0 0 0 0 0 0
449 12.89 340.74 0.24 0.31 0.22 10.30 9.80 11.50 13.14 -7.7 ... 300 0 0 0 0 0 0 0 0 0

450 rows × 75 columns

Soviele Minuten umfasst die Datei:

In [4]:
data.shape[0]*10
Out[4]:
4500

Die Spalten sind Zeitreihen

In [5]:
data["Wheel speed"].plot()
Out[5]:
<AxesSubplot:>

Zu jeder multidimensionalen Zeitreihe gehört ein Label: 1 wenn das Windrad danach einen Defekt hatte, ansonsten 0. Die Herausforderung wird es sein den Defekt anhand der Zeitreihe vorherzusagen.

In [6]:
label=pd.read_csv(TRAIN_LABEL_PATH)
Out[6]:
f_id file_name ret
0 95 dba63ee5-6603-300e-8071-8536afcbc2de.csv 0
1 95 0b8bfa51-cf28-35d0-94d2-7922f45120b2.csv 0
2 95 d7a64eee-165e-3d39-be67-adc82050bde3.csv 0
3 95 4da3314d-c5b0-3782-bdd6-27fb9e251261.csv 0
4 95 7d58a65f-af5a-3433-bcbb-a342b9468b71.csv 0
... ... ... ...
48334 11 d6e19de9-22a8-39e6-98c1-cc599c819a56.csv 1
48335 11 83895667-dc4e-303a-90e7-7dfc0725f476.csv 1
48336 11 a6ab9f83-4bea-323f-b08e-4a9fb4eab8d6.csv 1
48337 11 a19af894-a9c8-3127-87e4-39567f0a9e0c.csv 1
48338 11 861ce6ba-f676-3ea6-bfbb-16dfda24ac1a.csv 1

48339 rows × 3 columns

Die Label sind bereits balanciert, was das Vorhersageproblem später einfacher macht.

In [7]:
label["ret"].hist()
Out[7]:
<AxesSubplot:>

Um eine einfache Klassifikation auf das Label zu machen können Merkmale auf den Zeitreihen berechnet werden um kritische Ausprägungen mit einen Klassifikationsalgorithmus zu lernen. Das Schweizer Taschenmesser der Merkmalsextraktion ist https://github.com/blue-yonder/tsfresh der Karlsruher Firma BlueYonder

In [8]:
%%timeit -r1 -n1 -o
import tsfresh
data["id"]="a"
tsfresh.extract_features(data,n_jobs=1,column_id="id")
Feature Extraction: 100%|██████████| 5/5 [00:19<00:00,  4.00s/it]
29.4 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Out[8]:
<TimeitResult : 29.4 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>

Wenn man nun alle Dateien auf einer CPU Laden will so dauert das dann ca. so viele Stunden:

In [9]:
label.shape[0]*_.average/60/60
Out[9]:
394.79091108894903

Zum Glück unterstützt tsfresh bereits Parallelisierung:Probieren Sie verschiedene Werte für n_jobs aus. Sie werden merken, dass die Parallelisierung nicht ganz linear skaliert

Um die Liste aller Dateien zu bekommen, müssen wir übrigens noch den Pfad dran hängen und den Ordner dreistellig kodieren.

In [10]:
files=label.apply(lambda row: os.path.abspath(TRAIN_PATH+"{:03d}/{}".format(row["f_id"],row["file_name"])),axis=1)
Out[10]:
0        /gpfs/smartdata/iu5681/src/Parallel_computing/...
1        /gpfs/smartdata/iu5681/src/Parallel_computing/...
2        /gpfs/smartdata/iu5681/src/Parallel_computing/...
3        /gpfs/smartdata/iu5681/src/Parallel_computing/...
4        /gpfs/smartdata/iu5681/src/Parallel_computing/...
                               ...                        
48334    /gpfs/smartdata/iu5681/src/Parallel_computing/...
48335    /gpfs/smartdata/iu5681/src/Parallel_computing/...
48336    /gpfs/smartdata/iu5681/src/Parallel_computing/...
48337    /gpfs/smartdata/iu5681/src/Parallel_computing/...
48338    /gpfs/smartdata/iu5681/src/Parallel_computing/...
Length: 48339, dtype: object

Die Daten einfach alle in den Speicher zu laden funktioniert übrigens auch nicht. Insgesamt sprechen wir über soviele Gigabyte:

In [11]:
from pathlib import Path
sum(Path(f).stat().st_size  for f in files) /(1024**3)
Out[11]:
6.261525361798704

Nun kopieren wir die Teile von oben in eine Funktion um sie auf alle Dateien anzuwenden

In [12]:
def get_features(file):
    data= pd.read_csv(file)
    data["id"]=file
    return tsfresh.extract_features(data,disable_progressbar=True, n_jobs=1,column_id="id")

Damit es schneller geht können wir ein paar einfachere Features verwenden

In [13]:
def get_features(file):
    data= pd.read_csv(file)
    data["path"]=file
    return data.groupby("path").agg(["mean","var","min","max"])

Wir iterieren hierzu über alle Files und stecken Sie in einen Dataframe (zu Demozwecken nur die ersten 100). tqdm ist für die Statusanzeige zuständig

In [14]:
%%time
features=pd.concat(get_features(f) for f in tqdm(files[0:100]))
100%|██████████| 100/100 [00:25<00:00,  3.95it/s]
CPU times: user 25.7 s, sys: 83.7 ms, total: 25.8 s
Wall time: 26.4 s

Python unterstützt von Haus aus nur einen Prozessor. Hier nutzen wir also nicht unseren Prozesser aus. Hierzu gibt es die joblib. Mit delayed werden die Berechnungen asyncron gestartet. Wenn man mehrere Prozessorkerne hat wird man durch die Erhöhung von n_jobs wie schon oben eine leichte Beschleunigung feststellen (Engpass ist meist die Festplatte).

In [15]:
%%time
from joblib import Parallel, delayed
features=pd.concat(Parallel(n_jobs=4)(delayed(get_features)(f) for f in tqdm(files[0:100])))
100%|██████████| 100/100 [00:07<00:00, 13.34it/s]
CPU times: user 7.12 s, sys: 264 ms, total: 7.39 s
Wall time: 9.18 s

Limitierend sind hier auch die CPUs und der RAM in einem Rechner. Sehr große Rechner sind meist extrem teuer. Billiger geht es im Cluster. (HTCondor)[https://research.cs.wisc.edu/htcondor/] ist ein Clusterscheduler der von der Python Bibliothek (dask)[https://dask.org/] für verteiltes Rechnen unterstützt wird. Damit wird die Erstellung eines eigenen Clusters in einem Hochleistungs-Cluster einfach. Aber man kann hier auch viele Rechner im Firmennetzwerk zusammenschalten. (Geht auch über Kubernetes oder Yarn in der Cloud). Wir holen uns nun ein paar Rechner mit 8 Prozessorkernen und je 32GB RAM und ner kleinen Festplatte (wir wollen die Daten nur in den Speicher laden).

In [16]:
import dask.dataframe as dd
from dask_jobqueue import HTCondorCluster
from distributed import Client
from dask.distributed import progress


os.environ["_condor_SCHEDD_HOST"]="login-l.sdil.kit.edu"
cluster= HTCondorCluster(cores=8, memory= "32GB", disk="400MB")
client=Client(cluster)
cluster

Diese Konfiguration können wir nun in unserem Fall beliebig hochskalieren.

In [17]:
cluster.scale(160)

Unser Beispiel von oben lässt sich trivial beschleunigen. Wenn sie das Pythonpaket bokeh installiert haben können sie auf dem Dashboard (link oben) sehen wie die Funktion parallel auf dem Cluster ausgeführt wird. (Die Statusbar ist hier unsinnig, da alle jobs parallel in den Cluster geschickt werden)

In [18]:
%%time
from joblib import parallel_backend
with parallel_backend('dask'):
    features=pd.concat(Parallel()(delayed(get_features)(f) for f in tqdm(files[0:100])))
100%|██████████| 100/100 [00:08<00:00, 11.55it/s]
CPU times: user 6.8 s, sys: 306 ms, total: 7.1 s
Wall time: 11.6 s

Es geht aber noch einfacher. Die Idee von Big Data ist es das Berechnungsgraphen auf großen verteilten Datenquellen ausgeführt werden.

In [19]:
ddf=dd.read_csv(TRAIN_PATH+"006/02*.csv",include_path_column=True)
Out[19]:
Dask DataFrame Structure:
Wheel speed hub angle blade 1 angle blade 2 angle blade 3 angle pitch motor 1 current pitch motor 2 current Pitch motor 3 current overspeed sensor speed detection value 5 second yaw against wind average x direction vibration value y direction vibration value hydraulic brake pressure Aircraft weather station wind speed wind direction absolute value atmospheric pressure reactive power control status inverter grid side current inverter grid side voltage Inverter grid side active power inverter grid side reactive power inverter generator side power generator operating frequency generator current generator torque Inverter inlet temperature inverter outlet temperature inverter inlet pressure inverter outlet pressure generator power limit value reactive power set value Rated hub speed wind tower ambient temperature generator stator temperature 1 generator stator temperature 2 generator stator temperature 3 generator stator temperature 4 Generator stator temperature 5 generator stator temperature 6 generator air temperature 1 generator air temperature 2 main bearing temperature 1 main bearing temperature 2 Wheel temperature Wheel control cabinet temperature Cabin temperature Cabin control cabinet temperature Inverter INU temperature Inverter ISU temperature Inverter INU RMIO temperature Pitch motor 1 power estimation Pitch motor 2 power estimation Pitch motor 3 power estimation Fan current status value hub current status value yaw state value yaw request value blade 1 battery box temperature blade 2 battery box temperature blade 3 battery box temperature vane 1 pitch motor temperature blade 2 pitch motor temperature blade 3 pitch motor temperature blade 1 inverter box temperature blade 2 inverter box temperature blade 3 inverter box temperature blade 1 super capacitor voltage blade 2 super capacitor voltage blade 3 super capacitor voltage drive 1 thyristor temperature Drive 2 thyristor temperature Drive 3 thyristor temperature Drive 1 output torque Drive 2 output torque Drive 3 output torque path
npartitions=9
float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 category[known]
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Dask Name: read-csv, 9 tasks

Wir haben erstmal eine kleine Anzahl von Dateien geladen. Dask macht im Hintergrund allerdings nichts als automatisch eine Datenstruktur anzulegen. Wir können aber z.B. sehr schnell die ersten Daten anschauen (liest nur einen kleinen Auschnitt einer einzigen Datei).

In [20]:
ddf.head()
Out[20]:
Wheel speed hub angle blade 1 angle blade 2 angle blade 3 angle pitch motor 1 current pitch motor 2 current Pitch motor 3 current overspeed sensor speed detection value 5 second yaw against wind average ... blade 1 super capacitor voltage blade 2 super capacitor voltage blade 3 super capacitor voltage drive 1 thyristor temperature Drive 2 thyristor temperature Drive 3 thyristor temperature Drive 1 output torque Drive 2 output torque Drive 3 output torque path
0 1.77 339.01 21.0 21.01 21.0 1.88 2.64 1.76 1.78 -18.5 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 /gpfs/smartdata/iu5681/src/Parallel_computing/...
1 1.82 123.01 21.0 21.01 21.0 1.10 2.54 1.58 1.82 -14.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 /gpfs/smartdata/iu5681/src/Parallel_computing/...
2 1.82 230.00 21.0 21.01 21.0 1.56 2.70 1.40 1.82 1.8 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 /gpfs/smartdata/iu5681/src/Parallel_computing/...
3 1.73 33.98 21.0 21.01 21.0 0.80 2.70 0.86 1.74 -12.9 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 /gpfs/smartdata/iu5681/src/Parallel_computing/...
4 1.75 82.01 21.0 21.01 21.0 1.64 2.70 1.82 1.78 -14.6 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 /gpfs/smartdata/iu5681/src/Parallel_computing/...

5 rows × 76 columns

Genauso können wir unsere Merkmalsextraktion wie oben spezifizieren. Hier brauchen wir jetzt nicht mehr die Daten in jeder Funktion einzulesen

In [21]:
dfeatures=ddf.groupby(['path']).agg(["mean","var","min","max"])
Out[21]:
Dask DataFrame Structure:
Wheel speed hub angle blade 1 angle blade 2 angle blade 3 angle pitch motor 1 current pitch motor 2 current Pitch motor 3 current overspeed sensor speed detection value 5 second yaw against wind average x direction vibration value y direction vibration value hydraulic brake pressure Aircraft weather station wind speed wind direction absolute value atmospheric pressure reactive power control status inverter grid side current inverter grid side voltage Inverter grid side active power inverter grid side reactive power inverter generator side power generator operating frequency generator current generator torque Inverter inlet temperature inverter outlet temperature inverter inlet pressure inverter outlet pressure generator power limit value reactive power set value Rated hub speed wind tower ambient temperature generator stator temperature 1 generator stator temperature 2 generator stator temperature 3 generator stator temperature 4 Generator stator temperature 5 generator stator temperature 6 generator air temperature 1 generator air temperature 2 main bearing temperature 1 main bearing temperature 2 Wheel temperature Wheel control cabinet temperature Cabin temperature Cabin control cabinet temperature Inverter INU temperature Inverter ISU temperature Inverter INU RMIO temperature Pitch motor 1 power estimation Pitch motor 2 power estimation Pitch motor 3 power estimation Fan current status value hub current status value yaw state value yaw request value blade 1 battery box temperature blade 2 battery box temperature blade 3 battery box temperature vane 1 pitch motor temperature blade 2 pitch motor temperature blade 3 pitch motor temperature blade 1 inverter box temperature blade 2 inverter box temperature blade 3 inverter box temperature blade 1 super capacitor voltage blade 2 super capacitor voltage blade 3 super capacitor voltage drive 1 thyristor temperature Drive 2 thyristor temperature Drive 3 thyristor temperature Drive 1 output torque Drive 2 output torque Drive 3 output torque
mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max
npartitions=1
float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Dask Name: aggregate-agg, 21 tasks

Wieder ist kaum was im Hintergrund passiert. Dask hat hier einen Berechnungsgraphen aufgebaut und die Datenstruktur für das Resultat angelegt.

In [22]:
dfeatures.visualize()
Out[22]:

Um die Berechnung zu demonstrieren wagen wir uns an etwas mehr Daten (Der Graph wird dann etwas zu groß um ihn noch im Notebook darzustellen, sonst ist alles gleich)

In [23]:
ddf=dd.read_csv(TRAIN_PATH+"006/*.csv",include_path_column=True)
dfeatures=ddf.groupby(['path']).agg(["mean","var","min","max"])
Out[23]:
Dask DataFrame Structure:
Wheel speed hub angle blade 1 angle blade 2 angle blade 3 angle pitch motor 1 current pitch motor 2 current Pitch motor 3 current overspeed sensor speed detection value 5 second yaw against wind average x direction vibration value y direction vibration value hydraulic brake pressure Aircraft weather station wind speed wind direction absolute value atmospheric pressure reactive power control status inverter grid side current inverter grid side voltage Inverter grid side active power inverter grid side reactive power inverter generator side power generator operating frequency generator current generator torque Inverter inlet temperature inverter outlet temperature inverter inlet pressure inverter outlet pressure generator power limit value reactive power set value Rated hub speed wind tower ambient temperature generator stator temperature 1 generator stator temperature 2 generator stator temperature 3 generator stator temperature 4 Generator stator temperature 5 generator stator temperature 6 generator air temperature 1 generator air temperature 2 main bearing temperature 1 main bearing temperature 2 Wheel temperature Wheel control cabinet temperature Cabin temperature Cabin control cabinet temperature Inverter INU temperature Inverter ISU temperature Inverter INU RMIO temperature Pitch motor 1 power estimation Pitch motor 2 power estimation Pitch motor 3 power estimation Fan current status value hub current status value yaw state value yaw request value blade 1 battery box temperature blade 2 battery box temperature blade 3 battery box temperature vane 1 pitch motor temperature blade 2 pitch motor temperature blade 3 pitch motor temperature blade 1 inverter box temperature blade 2 inverter box temperature blade 3 inverter box temperature blade 1 super capacitor voltage blade 2 super capacitor voltage blade 3 super capacitor voltage drive 1 thyristor temperature Drive 2 thyristor temperature Drive 3 thyristor temperature Drive 1 output torque Drive 2 output torque Drive 3 output torque
mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max mean var min max
npartitions=1
float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Dask Name: aggregate-agg, 2895 tasks

Die eigentliche Berechnung stoßen wir mit compute an. Jetzt sollte man wieder auf das Cluster dashboard wechseln. Am besten schaut man sich hier den Graph an. Hier kann man sehen wie nach und nach parallel die Daten eingelesen werden und das Ergebnis stückweise aggregiert wird.

In [24]:
features=dfeatures.compute()
Out[24]:
Wheel speed hub angle blade 1 angle ... Drive 1 output torque Drive 2 output torque Drive 3 output torque
mean var min max mean var min max mean var ... min max mean var min max mean var min max
path
/gpfs/smartdata/iu5681/src/Parallel_computing/data/train/006/00273039-d989-3811-a90c-3ea5281a863d.csv 11.652135 0.995131 0.00 12.23 175.732668 11396.391191 0.00 357.01 0.248260 4.330114e-04 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
/gpfs/smartdata/iu5681/src/Parallel_computing/data/train/006/004539e0-0349-3410-8603-3d7e3918975e.csv 7.316793 0.121421 0.00 7.47 179.592517 10355.845285 0.00 360.00 0.267996 1.727589e-04 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
/gpfs/smartdata/iu5681/src/Parallel_computing/data/train/006/00a233ba-2567-3f7e-9aeb-9b599de7d9f1.csv 14.688076 1.125021 0.00 15.35 179.179799 10784.273071 0.00 359.75 0.238926 2.571403e-04 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
/gpfs/smartdata/iu5681/src/Parallel_computing/data/train/006/00ddf709-bb2e-3444-88b7-61dc04b3bf13.csv 10.150067 1.033120 7.99 12.18 184.451317 10813.880820 0.00 357.98 4.434978 1.894467e+01 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
/gpfs/smartdata/iu5681/src/Parallel_computing/data/train/006/015d4b9f-77c5-340d-bd57-024fb53f3480.csv 7.151622 0.003039 7.02 7.26 180.776289 10975.712335 1.01 360.00 0.266978 2.113536e-05 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
/gpfs/smartdata/iu5681/src/Parallel_computing/data/train/006/ff37d10d-bca3-37e9-8f5b-9c6abf77b862.csv 1.312825 6.887868 -0.05 7.02 77.193371 7435.980567 0.00 357.98 70.065740 1.047322e+03 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
/gpfs/smartdata/iu5681/src/Parallel_computing/data/train/006/ff4dbb6a-b846-3914-b6f8-031d498e3be6.csv 7.355244 0.008396 7.13 7.64 179.541178 10948.732273 0.25 359.75 0.265311 2.495867e-05 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
/gpfs/smartdata/iu5681/src/Parallel_computing/data/train/006/ff56b7ba-9fc2-306d-8d6e-fd75c31079da.csv 6.825356 0.590923 4.80 7.83 175.144556 10654.435697 0.00 359.24 0.269600 3.848552e-06 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
/gpfs/smartdata/iu5681/src/Parallel_computing/data/train/006/ff9791d0-aec7-3271-a2d6-942457e42f91.csv 8.051719 0.869359 0.00 9.93 176.688996 11072.518108 0.00 358.49 0.358393 5.772771e-04 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
/gpfs/smartdata/iu5681/src/Parallel_computing/data/train/006/ffff4a7c-ee26-3517-94f6-6e1734348276.csv 7.366778 0.002350 7.20 7.47 179.519911 10488.953408 2.02 360.00 0.270000 9.811503e-16 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1350 rows × 300 columns

Weil dask die Dateien nach und nach einliest, kann es auch zu Problemen bei der automatischen Erkennung der Spaltentypen kommen:

In [25]:
try: 
    features=dd.read_csv(TRAIN_PATH+"095/*.csv",include_path_column=True).\
            groupby(['path']).\
            agg(["mean","var","min","max"]).\
            compute()
except ValueError as e:
    print(e)
Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+-----------------------------------+---------+----------+
| Column                            | Found   | Expected |
+-----------------------------------+---------+----------+
| Drive 1 output torque             | float64 | int64    |
| Drive 2 output torque             | float64 | int64    |
| Drive 2 thyristor temperature     | float64 | int64    |
| Drive 3 output torque             | float64 | int64    |
| Fan current status value          | float64 | int64    |
| Inverter INU RMIO temperature     | float64 | int64    |
| Inverter INU temperature          | float64 | int64    |
| Inverter ISU temperature          | float64 | int64    |
| Inverter grid side active power   | float64 | int64    |
| Pitch motor 1 power estimation    | float64 | int64    |
| Pitch motor 2 power estimation    | float64 | int64    |
| Pitch motor 3 current             | float64 | int64    |
| Pitch motor 3 power estimation    | float64 | int64    |
| Rated hub speed                   | float64 | int64    |
| Wheel control cabinet temperature | float64 | int64    |
| Wheel temperature                 | float64 | int64    |
| atmospheric pressure              | float64 | int64    |
| blade 1 battery box temperature   | float64 | int64    |
| blade 1 inverter box temperature  | float64 | int64    |
| blade 2 battery box temperature   | float64 | int64    |
| blade 2 inverter box temperature  | float64 | int64    |
| blade 2 pitch motor temperature   | float64 | int64    |
| blade 3 battery box temperature   | float64 | int64    |
| blade 3 inverter box temperature  | float64 | int64    |
| blade 3 pitch motor temperature   | float64 | int64    |
| drive 1 thyristor temperature     | float64 | int64    |
| generator power limit value       | float64 | int64    |
| generator torque                  | float64 | int64    |
| hub angle                         | float64 | int64    |
| hub current status value          | float64 | int64    |
| inverter generator side power     | float64 | int64    |
| inverter grid side current        | float64 | int64    |
| inverter grid side reactive power | float64 | int64    |
| inverter grid side voltage        | float64 | int64    |
| pitch motor 1 current             | float64 | int64    |
| pitch motor 2 current             | float64 | int64    |
| reactive power control status     | float64 | int64    |
| reactive power set value          | float64 | int64    |
| vane 1 pitch motor temperature    | float64 | int64    |
| wind direction absolute value     | float64 | int64    |
| yaw request value                 | float64 | int64    |
| yaw state value                   | float64 | int64    |
+-----------------------------------+---------+----------+

Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:

dtype={'Drive 1 output torque': 'float64',
       'Drive 2 output torque': 'float64',
       'Drive 2 thyristor temperature': 'float64',
       'Drive 3 output torque': 'float64',
       'Fan current status value': 'float64',
       'Inverter INU RMIO temperature': 'float64',
       'Inverter INU temperature': 'float64',
       'Inverter ISU temperature': 'float64',
       'Inverter grid side active power': 'float64',
       'Pitch motor 1 power estimation': 'float64',
       'Pitch motor 2 power estimation': 'float64',
       'Pitch motor 3 current': 'float64',
       'Pitch motor 3 power estimation': 'float64',
       'Rated hub speed': 'float64',
       'Wheel control cabinet temperature': 'float64',
       'Wheel temperature': 'float64',
       'atmospheric pressure': 'float64',
       'blade 1 battery box temperature': 'float64',
       'blade 1 inverter box temperature': 'float64',
       'blade 2 battery box temperature': 'float64',
       'blade 2 inverter box temperature': 'float64',
       'blade 2 pitch motor temperature': 'float64',
       'blade 3 battery box temperature': 'float64',
       'blade 3 inverter box temperature': 'float64',
       'blade 3 pitch motor temperature': 'float64',
       'drive 1 thyristor temperature': 'float64',
       'generator power limit value': 'float64',
       'generator torque': 'float64',
       'hub angle': 'float64',
       'hub current status value': 'float64',
       'inverter generator side power': 'float64',
       'inverter grid side current': 'float64',
       'inverter grid side reactive power': 'float64',
       'inverter grid side voltage': 'float64',
       'pitch motor 1 current': 'float64',
       'pitch motor 2 current': 'float64',
       'reactive power control status': 'float64',
       'reactive power set value': 'float64',
       'vane 1 pitch motor temperature': 'float64',
       'wind direction absolute value': 'float64',
       'yaw request value': 'float64',
       'yaw state value': 'float64'}

to the call to `read_csv`/`read_table`.

Alternatively, provide `assume_missing=True` to interpret
all unspecified integer columns as floats.

Hier hilft es die Typen manuell zu setzen (wir nutzen der Einfachheit halber einfach immer Fließkommazahlen).

In [26]:
%time features=dd.read_csv(TRAIN_PATH+"095/*.csv",include_path_column=True, dtype='float64').\
            groupby(['path']).\
            agg(["mean","var","min","max"])
%time features=features.persist()
CPU times: user 2.56 s, sys: 568 ms, total: 3.12 s
Wall time: 2.98 s
CPU times: user 31 s, sys: 686 ms, total: 31.7 s
Wall time: 31.5 s
In [27]:
progress(features)

persist belässt im Gegensatz zu compute das Berechnungsergebnis im Cluster und ist asyncron. Die erste Zeit ist die Zeit die es braucht den Graphen aufzubauen. Die zweite Zeit ist die Zeit, die es brauch den Task an den Cluster zu schicken.

In [33]:
from dask.distributed import wait
wait(features, timeout=30)
client.cancel(features)

Leider skaliert der Daskgraph hier mit der Anzahl der Dateien. Man beachte, dass hier gar nichts berechnet wird sondern nur der Graph lokal aufgebaut wird. Entsprechend lange dauert auch die Berechnung auf dem Cluster zu starten (währenddessen tut der Scheduler gar nichts). Am besten nicht auskommentieren: 16GB Speicher reichen nicht... 🙂

In [37]:
%time features=dd.read_csv(TRAIN_PATH+"*/*.csv",include_path_column=True, dtype='float64').\
            groupby(['path']).\
            agg(["mean","var","min","max"])
#%time features=features.persist()
CPU times: user 58.5 s, sys: 8.8 s, total: 1min 7s
Wall time: 1min 6s
In [38]:
#progress(features)
In [39]:
#from dask.distributed import wait
#wait(features, timeout=30)
#client.cancel(features)

Alternativ kann man in diesem Fall klassisches map-reduce mit unserer Funktion get_features von oben benutzen. Das hält den Graphen überschaubar klein.

In [44]:
import dask.bag as db

%time features=db.from_sequence(files).map(get_features).\
reduction(pd.concat,pd.concat)
%time features=features.persist()
CPU times: user 83.7 ms, sys: 6.03 ms, total: 89.8 ms
Wall time: 88.6 ms
CPU times: user 111 ms, sys: 2.99 ms, total: 114 ms
Wall time: 114 ms
In [45]:
progress(features)

Um das Projekt abzuschließen können wir nun mit dem eigentlichen Maschinellen Lernen beginnen. Hierzu fügen wir die labels von oben an den Datensatz an und entfernen die Ids sowie evtl. Nulleinträge.

In [46]:
features=features.compute()
train=features.join(
    label.set_index(files)
).drop(['file_name','f_id'],axis=1).dropna()
Out[46]:
(Wheel speed, mean) (Wheel speed, var) (Wheel speed, min) (Wheel speed, max) (hub angle, mean) (hub angle, var) (hub angle, min) (hub angle, max) (blade 1 angle, mean) (blade 1 angle, var) ... (Drive 1 output torque, max) (Drive 2 output torque, mean) (Drive 2 output torque, var) (Drive 2 output torque, min) (Drive 2 output torque, max) (Drive 3 output torque, mean) (Drive 3 output torque, var) (Drive 3 output torque, min) (Drive 3 output torque, max) ret
path
/gpfs/smartdata/iu5681/src/Parallel_computing/data/train/095/dba63ee5-6603-300e-8071-8536afcbc2de.csv 13.394722 0.498932 0.00 14.14 172.347439 10709.343303 0.00 356.00 0.259421 0.000151 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
/gpfs/smartdata/iu5681/src/Parallel_computing/data/train/095/0b8bfa51-cf28-35d0-94d2-7922f45120b2.csv 14.962244 0.079155 14.32 15.51 174.595556 10632.927374 0.00 356.00 0.260000 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
/gpfs/smartdata/iu5681/src/Parallel_computing/data/train/095/d7a64eee-165e-3d39-be67-adc82050bde3.csv 15.089200 0.287379 13.41 15.86 176.764444 10886.184925 0.00 356.00 0.260000 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
/gpfs/smartdata/iu5681/src/Parallel_computing/data/train/095/4da3314d-c5b0-3782-bdd6-27fb9e251261.csv 13.077089 0.221067 11.87 14.15 170.177778 10579.442712 0.00 356.00 0.260000 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
/gpfs/smartdata/iu5681/src/Parallel_computing/data/train/095/7d58a65f-af5a-3433-bcbb-a342b9468b71.csv 16.138867 0.037577 15.73 16.59 179.893333 10386.946281 0.00 356.00 0.261156 0.000574 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
/gpfs/smartdata/iu5681/src/Parallel_computing/data/train/011/d6e19de9-22a8-39e6-98c1-cc599c819a56.csv 0.066200 0.009512 -0.19 0.31 161.388044 11954.801024 0.50 359.64 92.240000 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
/gpfs/smartdata/iu5681/src/Parallel_computing/data/train/011/83895667-dc4e-303a-90e7-7dfc0725f476.csv 0.055000 0.004598 -0.14 0.22 191.891689 3728.019529 109.87 304.74 92.240000 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
/gpfs/smartdata/iu5681/src/Parallel_computing/data/train/011/a6ab9f83-4bea-323f-b08e-4a9fb4eab8d6.csv 0.008853 0.006387 -0.25 0.22 333.971078 430.780434 0.00 355.14 92.028440 19.514261 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
/gpfs/smartdata/iu5681/src/Parallel_computing/data/train/011/a19af894-a9c8-3127-87e4-39567f0a9e0c.csv 0.008089 0.004167 -0.15 0.19 200.854242 404.828171 0.00 218.63 91.594965 59.219997 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
/gpfs/smartdata/iu5681/src/Parallel_computing/data/train/011/861ce6ba-f676-3ea6-bfbb-16dfda24ac1a.csv 0.000756 0.002392 -0.17 0.14 215.658711 46.816874 202.25 228.74 92.240000 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1

48339 rows × 301 columns

Um das Modell zu lernen und später zu überprüfen splitten wir die Daten und Trainings- und Testdaten.

In [48]:
import sklearn as sk
import sklearn.model_selection 

X_train, X_test, y_train, y_test = sk.model_selection.train_test_split( 
    train.drop('ret', axis=1), train["ret"], test_size=0.33, random_state=42)

Nun können wir den Klassifikator lernen.

In [49]:
import sklearn.ensemble 

classifier = sk.ensemble.RandomForestClassifier()
Out[49]:
RandomForestClassifier()

Auch wenn wir hier das Cluster zum Parallelisierung nutzen könnten, ist in dem Fall das Lernen so schnell dass es sich kaum lohnt. Hierzu müssten wir alle Daten und viel mehr Merkmale berechnen. Wer hier mehr wissen will sollte https://ml.dask.org/ lesen.

In [50]:
classifier.fit(X_train, y_train)
Out[50]:
RandomForestClassifier()

Die Ergebnisse sind übrigens gar nicht mal so schlecht

In [51]:
print(sk.metrics.classification_report(y_test,classifier.predict(X_test)))
              precision    recall  f1-score   support

           0       0.98      0.97      0.97      8143
           1       0.97      0.98      0.97      7809

    accuracy                           0.97     15952
   macro avg       0.97      0.97      0.97     15952
weighted avg       0.97      0.97      0.97     15952

Wer sich Lust hat anzuschauen, was die verschiedenen Merkmale zur Klassifikation beitragen, kann sich auch noch mal einen Entscheidungbaum anschauen.

In [52]:
from dtreeviz.trees import *
classifier = sk.tree.DecisionTreeClassifier(max_depth=6)  # limit depth of tree

classifier.fit(X_train, y_train)

dtreeviz(classifier, 
               X_train, 
               y_train,
               target_name='ret',
               feature_names=X_train.columns, 
               class_names=["1.0", "0.0"]  # need class_names for classifier
              )  
findfont: Font family ['Arial'] not found. Falling back to DejaVu Sans.
findfont: Font family ['Arial'] not found. Falling back to DejaVu Sans.
findfont: Font family ['Arial'] not found. Falling back to DejaVu Sans.
Out[52]:
Gcluster_legendnode52020-10-25T19:57:27.346969image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/node82020-10-25T19:57:27.606428image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/leaf62020-10-25T19:57:39.235121image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/node5->leaf6leaf72020-10-25T19:57:39.316239image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/node5->leaf7leaf92020-10-25T19:57:39.388132image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/node8->leaf9leaf102020-10-25T19:57:39.473865image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/node8->leaf10node42020-10-25T19:57:27.854970image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/node4->node5node4->node8node112020-10-25T19:57:28.136233image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/leaf122020-10-25T19:57:39.560795image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/node11->leaf12leaf132020-10-25T19:57:39.641831image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/node11->leaf13node32020-10-25T19:57:28.374275image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/node3->node4node3->node11node142020-10-25T19:57:29.358598image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/node162020-10-25T19:57:28.625642image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/node192020-10-25T19:57:28.865268image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/leaf172020-10-25T19:57:39.715441image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/node16->leaf17leaf182020-10-25T19:57:39.800145image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/node16->leaf18leaf202020-10-25T19:57:39.895556image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/node19->leaf20leaf212020-10-25T19:57:39.983958image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/node19->leaf21node152020-10-25T19:57:29.109913image/svg+xmlMatplotlib v3.3.2, https://matplotlib.org/