So one thing I thought might be useful and tried to do a proof-of-concept solution for, is a machine learning/deep learning model called sequence to sequence (seq2seq). This is usually used for language translation, text summarization, speech recognition and automated question answering. My idea is to use in a so-called auto-encoder configuration in this case. As a bit of ML/DL background, at a very high level, sequence models learn sequences (of numbers, text, audio..) and are able to predict the next item in a sequence. Seq2seq model learns to represent a sequence as a lower dimension vector (done by the encoder) while the decoder learns to generate another sequence from that vector. So a translation engine is given a number of sentences (sequences) in one language and another. For an autoencoder we give it the same sentence as a source and as a target, so it basically learns to recreate sentences similar to the ones given during the training. So how is this useful for our log files? The emphasis, in the previous sentence, is it can recreate sequences that it has come across during the training process. So what if we train the model with a log file/trice files while the system was in a normal state/operation? Line by line, it will learn to recreate those lines and lines similar to it. If all of sudden it comes across a line that is very different from anything the model came across during training, it will not be able to recreate it accurately i.e. it will generate a sentence (sequence) that will be vastly different from the original line in the log file it has come across.
So this is in a way anomaly detection in a text. We can then compare the source (original) line from the log file and compare it to the one generated by this model, and if we see that it is very different we will report it as an anomaly in the log file. Since the model was trained in a normal state of the system operation, such an anomaly should be something out of ordinary and possibly useful for the troubleshooting. Some of those log lines will have nothing to do with an issue or a problem, they will just be something different and out of norm from the log lines seen during model training so we may need to do additional filtering of those by doing a supervised classification, which would of course require someone to go through such extracts from log files and classify them as being important or not. To make the output more useful, I used a ML clustering algorithm called Mean Shift to automatically group those log lines that happened around the same time (using the timestamp from the log. Grouped like that we can treat those clusters like separate incidents that happened around some time (central point timestamp in the specific cluster of log lines). The grouping can be improved by grouping by some specific words or phrases by using the same algorithm.
head -n 32750 ohasd.trc > ohasd.trc.train tail -n 8188 ohasd.trc > ohasd.trc.testThen tokenize the trace files (split in separate words and convert them to numbers) so they can be ingested into the se2seq model. AWS provides a python script for that, but I have modified it so it just uses tokens that don’t contain special characters and numbers, so it will be easier for the ML model to be trained on it (have fewer items in the vocabulary and smaller sentences which would be easier for the seq2seq model to recreate):
python3 create_vocab_proto.py
We load the files to S3, so SageMaher can use it. Create a training job with these parameters can be created and run: ...
create_training_params = { .. "HyperParameters": { "max_seq_len_source": "20", "max_seq_len_target": "20", "optimized_metric": "bleu", "batch_size": "256", "checkpoint_frequency_num_batches": "1000", "rnn_num_hidden": "512", "num_layers_encoder": "3", "num_layers_decoder": "3", "num_embed_source": "512", "num_embed_target": "512", "checkpoint_threshold": "3" ...As seen we are optimizing for the metric BLEU, which is metric used to compare two sentences for their similarity. After creating and running the training job an endpoint can be created so this model can be used over an HTTP request from anywhere, so we would not need to install python ML libraries on the actual database server. The endpoints run on a ec2 instance in the background (which we don’t manage) but we are billed as long as the endpoint is up and running for that ec2 instance.
Cluster 0: |ORIG: | 2018-01-02 17:47:21.379 :OHASDMAIN:57868352: OHASD params [] Cluster 0: |ORIG: | 2018-01-02 17:47:21.379 :OHASDMAIN:57868352: Socket cleanup:0x49bde30 Cluster 0: |ORIG: | 2018-01-02 17:47:21.379 :OHASDMAIN:57868352: Got [0] potential names Cluster 0: |ORIG: | 2018-01-02 17:47:21.383 : OCRRAW:57868352: proprioo: for disk 0 (/u01/grid/cdata/localhost/ip-172-31-86-123.olr), id match (1), total id sets, (1) need recover (0), my votes (0), total votes (0), commit_lsn (1), lsn (1) Cluster 0: |ORIG: | 2018-01-02 17:47:21.383 : OCRRAW:57868352: proprioo: my id set: (1777565868, 1028247821, 0, 0, 0) Cluster 0: |ORIG: | 2018-01-02 17:47:21.383 : OCRRAW:57868352: proprioo: 1st set: (1777565868, 1028247821, 0, 0, 0) Cluster 0: |ORIG: | 2018-01-02 17:47:21.383 : OCRRAW:57868352: proprioo: 2nd set: (0, 0, 0, 0, 0) Cluster 0: |ORIG: | 2018-01-02 17:47:21.387 : OCRAPI:57868352: a_init:18: Thread init successful Cluster 0: |ORIG: | 2018-01-02 17:47:21.387 : OCRAPI:57868352: a_init:19: Client init successful Cluster 0: |ORIG: | 2018-01-02 17:47:21.388 :OHASDMAIN:57868352: Version compatibility check passed: Software Version: 12.2.0.1.0 Release Version: 12.2.0.1.0 Active Version: 12.2.0.1.0 Cluster 0: |ORIG: | 2018-01-02 17:47:21.392 : CRSMAIN:57868352: Logging level for Module: GIPCBASE 0 Cluster 0: |ORIG: | 2018-01-02 17:47:21.396 : CRSPE:57868352: ...done : 0 Cluster 0: |ORIG: | 2018-01-02 17:47:21.396 :OHASDMAIN:57868352: Initializing ubglm... Cluster 2: |ORIG: | 2018-02-02 13:00:44.538 : AGFW:2113812224: {0:7:16} Verifying msg rid = ora.asm ip-172-31-86-123 1 Cluster 2: |ORIG: | 2018-02-02 13:00:44.538 : AGFW:2113812224: {0:7:16} Received state LABEL change for ora.asm ip-172-31-86-123 1 [old label = Started, new label = Abnormal Termination] Cluster 2: |ORIG: | 2018-02-02 13:00:44.538 : CRSPE:2101204736: {0:7:16} State change received from ip-172-31-86-123 for ora.asm ip-172-31-86-123 1 Cluster 2: |ORIG: | 2018-02-02 13:00:44.539 : CRSPE:2101204736: {0:7:16} Processing unplanned state change for [ora.asm ip-172-31-86-123 1] Cluster 2: |ORIG: | 2018-02-02 13:00:44.539 : CRSPE:2101204736: {0:7:16} Scheduled local recovery for [ora.asm ip-172-31-86-123 1] Cluster 2: |ORIG: | 2018-02-02 13:00:44.540 : CRSPE:2101204736: {0:7:16} RI [ora.asm ip-172-31-86-123 1] new internal state: [CLEANING] old value: [STABLE] Cluster 2: |ORIG: | 2018-02-02 13:00:44.540 : CRSPE:2101204736: {0:7:16} state change vers moved to 6 for RI:ora.asm ip-172-31-86-123 1 Cluster 2: |ORIG: | 2018-02-02 13:00:44.540 : CRSPE:2101204736: {0:7:16} Sending message to agfw: id = 50979 Cluster 2: |ORIG: | 2018-02-02 13:00:44.540 : CRSPE:2101204736: {0:7:16} CRS-2679: Attempting to clean 'ora.asm' on 'ip-172-31-86-123' Cluster 2: |ORIG: | 2018-02-02 13:00:44.547 : AGFW:2113812224: {0:7:16} ora.orcl.db 1 1 received state from probe request. Old state = ONLINE, New state = ONLINE Cluster 2: |ORIG: | 2018-02-02 13:00:44.547 : AGFW:2113812224: {0:7:16} ora.orcl.db 1 1 received state from probe request. Old state = ONLINE, New state = ONLINE Cluster 2: |ORIG: | 2018-02-02 13:00:44.547 : CRSPE:2101204736: {0:7:17} ora.DATA.dg ip-172-31-86-123 1: uptime exceeds uptime threshold , resetting restart count Cluster 2: |ORIG: | 2018-02-02 13:00:44.547 : CRSPE:2101204736: {0:7:17} Scheduled local recovery for [ora.DATA.dg ip-172-31-86-123 1] Cluster 2: |ORIG: | 2018-02-02 13:00:44.552 : AGFW:2113812224: {0:7:17} ora.orcl.db 1 1 received state from probe request. Old state = ONLINE, New state = ONLINE Cluster 2: |ORIG: | 2018-02-02 13:00:46.551 : AGFW:2113812224: {0:7:18} Verifying msg rid = ora.orcl.db 1 1 Cluster 2: |ORIG: | 2018-02-02 13:00:46.551 : AGFW:2113812224: {0:7:18} Received state LABEL change for ora.orcl.db 1 1 [old label = Open,HOME=/u01/app/oracle/product/12.2.0/db, new label = Abnormal Termination,HOME=/u01/app/oracle/product/12.2.0/db] Cluster 2: |ORIG: | 2018-02-02 13:00:46.552 : CRSPE:2101204736: {0:7:18} State change received from ip-172-31-86-123 for ora.orcl.db 1 1
Ready to start your AI journey?