> Structure the data

Should we use 'high frequency' and 'production' data? considering that they are not the same columns in each data, once the union of this information is done we would have 26 variables to determine our answer in the wellfailure column

Can 3 events be applied for training? considering that NaN in the wellfailure column means that it is operating normally

For the final part of saving the zip file should our answers match the well name and date? or how this score is being evaluated because when we save in a zip file is a random sample of all the predicted data test, so each we execute to extract a sample we will have different results

Posted by: carlosv0410 @ Sept. 13, 2021, 5:21 p.m.

Hello,

You can use the features you consider they're helping your model to increase accuracy, if you find out that only high_frequency data works well for your model it's ok, nevertheless, if you also want to feed some parameters from production data you need first to do some coding for structuring your dataset. We are going to release some code these days to give a little insight about it.

With regards to labeling the missing values in the target column, in fact you can do it, the main thing to consider here is class imbalance. That's the challenging part of this dataset, since the telemetry period is 20 minutes, therefore most observations have a missing value as an output, except for those that have disruptive events.

Finally, please, take a look on the "Evaluation Criteria", but shortly, since test data only have 15 wells, we're not concerned only about what type of "Failure event" belongs to each one of them, but also when this event will occur, that's why type of failure is scored with "accuracy_score" from scikit learn, and this score gets a penalty depending on how far is your prediction from the real date. So, Date it's also part of your prediction.

Posted by: bluemirrors @ Sept. 14, 2021, 12:45 a.m.

Thanks for your answers, I have more questions, would we be predicting in addition to wellfailure also the date?
because I made predictions based on the high frequency variables separating a blind data and in the tests there were correct hits, it is not clear how the zip file uploaded to the codalab is being evaluated

Posted by: carlosv0410 @ Sept. 14, 2021, 2:22 a.m.

Sure thing,
it is worth mention that for "Predictive Maintenance" the main target output you want to find the is "Failure Event Date" in order to prevent your system to fail, so yes, you're right.
Ok, so the evaluation it's quite simple, you have to predict if the failure event is Manual-off (Reconditioning event) or yes (Failure Event), this classification is scored with the "accuracy_score" method from scikit learn, now, since we are also concerned about when this failure events are going to happen, you're predicted date is penalized based on how far is from the real date. For example, let's say the failure event is May 26, 2021 and your predicted date is May 24, 2021, then the penalization will be -0.05, for that classification.
Now if your prediction differs like for example with a month, the penalization will be -0.5. The closer the prediction to real date, the lower the penalty.

Posted by: bluemirrors @ Sept. 14, 2021, 4:30 a.m.

Hello everyone,
Let me just get it straight, so we are supposed to predict the date too?

I mean, that's not what the challenge says. Or is just a comparation between the real results with our results. Our results will differ a lot because the results dataframe is a random matrix from all the predictions.

I might think you need to enlighten this in order to accomplish better results

Posted by: joseeduardou98 @ Sept. 14, 2021, 4:50 a.m.

Ok dear people,

A couple of details to highlight:

1.- You might have noticed in the Challenge description from the Gitlab Repository where you probably downloaded the data that it specifies the goals of the challenge, same as the description in this platform.

2.- As a matter of fact, the term "Predicitive Maintenance" itself comprises an issue of time. The classification alone juts let you know which kind of event you are dealing with, but for "Prevention" purposes you need to predict "When" (Date) these events are going to happen.

3.- The workflow you mention was used strictly for demonstration of how to do the submission, in fact, the tutorial clearly specifies it, also it points out that you're not expecting to get an event failure for every single row. That's why it selects randomly only 15 rows to do the submition.

You understand, you're free to use any machine learning model to win this competition, since the most challenging part besides the data structuring is the class imbalance, you should not be doing random selection of the failure events.

Cheers.

Posted by: bluemirrors @ Sept. 14, 2021, 5:18 a.m.

Considering that it is a random sample, all uploads to codalab that have been made are penalized, the only difference is that some samples were not so far from the date, the training could be wrong but the date has more weight
I consider that there are many questions that we need to be answered in a team session, the idea and objective of the prediction is still not clear.
On what date could it be done since there are 10 days left to finish the challenge and all this time it has taken us to structure the data to eliminate noise to obtain a good model but the penalty is lowering our score

Posted by: carlosv0410 @ Sept. 14, 2021, 4:11 p.m.

Yeah, good point.

Evaluation criteria is clear enough I think. I didn't mention the deadline will be extended after the "structuring code" is released. As part of the goals of this challenge, we want you to develop your own workflows for it, as you know, data structuring takes up to 80% of the time in a project, however, we will be releasing our code to give an insight that participants may to adapt into their workflows, as you consider.

Another important thing is that you may not want to replicate the workflow that uses random selection of dates for the reasons explained above, you're free to develop your own models to win this challenge, that workflow was extrictly used for demonstration of the submission in the platform, not as the unique workflow to make the predictions.

Posted by: bluemirrors @ Sept. 14, 2021, 5:05 p.m.

Getting the event dates it's quite obvious for a predictive maintenance task, I agree with that, but what about the normal ranges of the parameters?
I mean, data is full of outliers and it's kind of difficult to distinguish a standard method to clean the data, it would be helpful to know the normal ranges of the parameters, as far as I know, these normal ranges can vary depending on the location of the wells, but it would be useful to know the general behavior of the parameter limits throughout the reservoir.

Posted by: rebeca_mald96 @ Sept. 15, 2021, 5:35 p.m.

Hi, sorry for the delay, we think your point is worth making.
I will include the normal ranges of the field in the next notebook.

Posted by: bluemirrors @ Sept. 16, 2021, 5:54 p.m.
Post in this thread