I try several models, now get accuracy close to 1. If some of you want to see my coding, please contact and arrange time for a meeting.
I encountered errors while handling format , zipping and submitting the files. No time to handle this by now.
Hope the website can be more user friendly and work fast. Just not pleasant to work with the process to upload a file.
Just want to share some ideas if you guys want to dig deeper and make better results:
You might need some solid domain knowledge and instinct intuition in order to pick good features.
Models:
Neuron network with deep and wide structure works the best. Random forest. SVC OK.
The Poorest performing classifiers: Perceptron, K-Nearest Neighbors
Average Performing Classifiers: Random Forest,
Good Performing Classifiers: SVM, Logistic Regression, Naïve Bayes
Some Statistics:
Total wells in dataset 7,012,320 wells
If no spud date and also if open hole---must be active 100%
If cased + completed: 23677 in total, only 31 suspended, 0.131%, 8392 active, 35.444%, 15254 Abandoned, 64.443%.
If have max production, 27 in total, abandoned 3, 11.111%, active 23, 85.185%, suspended 1, 3.704%.
If have spud date, and open hole 4839 wells, 30 abandoned, 1565 suspended, active 3244, 67.039%
If have spud date and cased completed: 560078 in total, abandoned 232843, Active 222517 suspended 104718
If from rig released day less than 10 years, 2008 to now:
Total 259
Abandoned 22
Suspended 49
Active 188: 72.586%
If before 1900, 92 in total, 1 suspended, 1.1%, 91 abandoned.99%
If older than 30 years 1900-1988
Total 64
Abandon 28
Active 21
Suspended 15
If have surface abandonment date: total 149148,
Active 512
Suspended: 853
Abandoned: 147774, 99.748%
If no facture, 431868 total
Abandoned 100334
Suspended 102403
Active 229131
If have facture: 7657 in total
Abandoned 19, 0.25%
Active 4580, 59.84%
Suspended 3058, 39.94%
If greater than 100 stages, 31 in total, 4 active, rest 27 all suspended.
If only 1 stage, 1225 in total
9 abandoned
98 suspended
1118 active
If Less than 10 stages, 2391 in total
14 abandoned
221 suspended
2156active.
If greater than 10, less than 20 stages, total 1568
Abandon 1
Suspended 764
Active803
20 to 50 stages, 3366 in total
4 abandoned
1895 suspended
1497 active
If bitumen and SAGD 24586 total
7183 suspended: 29.22%
Abandoned 4146:16.86%
Alive 12627: 51.36%
If bitumen cyclical 4700 total
333 abandoned
462 suspended
3905 active
If SAGD 3229 total
298 suspended
2801 active: 86.75%
130 abandoned
If Gas only 137318 total
Abandoned 16864
35759 suspended
84695 active: 61.68%
If disposal salt water: 168 total
70 abandoned
1 suspended
97 active
If water 1059 total
151 abandoned
593 suspended
351 active
If water source 25 total
Abandoned 13
Active 12
Unspecified 55009 total
34784 abandoned
16431 suspended
3791 active
Storage gas and LPG 251 total
6 abandoned
28 suspended
181 active
Coal bed methane 15700 total
121 abandoned
655 suspended
14924 active
Hope you can get some tips from the numbers.
Files:
Header train should join Well_class_train, each file has 588657 wells in total
This file would be used for training
Header-validation should join Well_class_validate, each file has 132133 wells.
This file would be used for testing
Header -test has 118078 wells, to be used for prediction.
The naming of the files seemed confusing due to the fact we normally name the files : Train data, Test data, Predict data.
Hope in the future we can clarify it better.
Thanks for sharing Song Li. I am wondering if you have submitted to leaderboard? Also, you said your accuracy close to 1, can you give a more accurate number? ensemble learning with ~ 5 different tree-based algorithms + some basic feature engineering are all for me.
Posted by: whitehair @ Nov. 1, 2019, 5:20 p.m.