Hey there.
I've just started the competition.
After reading the data files. I found that train.solution #rows mismatch train.data #rows. ((8077793, 11) vs (8151524, 1))
Also "b'Skipping line 3609902: expected 11 fields, saw 12\n'" in the train.solution.
Is this a Dataset error, or we have to deal with this somehow.
Still didn't get any response.
Please help me in getting started with This CodataLab competition.
I want just to make a first successful submission for my baseline model.
I just started with Pandas and encountered the same problem.
There may be a better way, but I used the following method to load it.
(It may be difficult to read. It's hard to write code in this forum.)
'''
def read_data(file_path, feature_names):
rows = []
with open(file_path, encoding="utf-8") as f:
for line in tqdm(f.readlines()):
line = line.strip()
line = line.split('\t')
rows.append(line)
df = pd.DataFrame(rows, columns=feature_names)
return df
feature_names = pd.read_csv(f'{INPUT_DIR}/feature.name', sep='\t').columns.tolist()
train = read_data(f'{INPUT_DIR}/train.data', feature_names)
valid = read_data(f'{INPUT_DIR}/validation.data', feature_names)
test = read_data(f'{INPUT_DIR}/test.data', feature_names[1:])
label = pd.read_csv(f'{INPUT_DIR}/train.solution', header=None).T.values[0]
'''
Yes, please clarify why training.data and training.solution have different number of entries.
Posted by: pasqlisena @ July 15, 2020, 12:07 p.m.For any future reader, I solved looking to https://github.com/pandas-dev/pandas/issues/21695
Posted by: pasqlisena @ July 16, 2020, 10:02 a.m.Hey,
I'm still getting the same mismatch in number of rows, can anyone please help
Hi, have you tried applying this? https://github.com/pandas-dev/pandas/issues/21695#issuecomment-401594317
Posted by: pasqlisena @ Aug. 3, 2020, 7:47 a.m.