COVID-19 Retweet Prediction Challenge Forum

Go back to competition Back to thread list Post in this thread

> train.solution number of rows mismatch train.data number of rows

Hey there.
I've just started the competition.
After reading the data files. I found that train.solution #rows mismatch train.data #rows. ((8077793, 11) vs (8151524, 1))
Also "b'Skipping line 3609902: expected 11 fields, saw 12\n'" in the train.solution.
Is this a Dataset error, or we have to deal with this somehow.

Posted by: JMourad @ July 8, 2020, 7:53 a.m.

Still didn't get any response.
Please help me in getting started with This CodataLab competition.
I want just to make a first successful submission for my baseline model.

Posted by: JMourad @ July 11, 2020, 12:35 p.m.

I just started with Pandas and encountered the same problem.
There may be a better way, but I used the following method to load it.
(It may be difficult to read. It's hard to write code in this forum.)

'''
def read_data(file_path, feature_names):
rows = []
with open(file_path, encoding="utf-8") as f:
for line in tqdm(f.readlines()):
line = line.strip()
line = line.split('\t')
rows.append(line)
df = pd.DataFrame(rows, columns=feature_names)
return df

feature_names = pd.read_csv(f'{INPUT_DIR}/feature.name', sep='\t').columns.tolist()

train = read_data(f'{INPUT_DIR}/train.data', feature_names)
valid = read_data(f'{INPUT_DIR}/validation.data', feature_names)
test = read_data(f'{INPUT_DIR}/test.data', feature_names[1:])

label = pd.read_csv(f'{INPUT_DIR}/train.solution', header=None).T.values[0]
'''

Posted by: myaunraitau @ July 15, 2020, 8:32 a.m.

Yes, please clarify why training.data and training.solution have different number of entries.

Posted by: pasqlisena @ July 15, 2020, 12:07 p.m.

For any future reader, I solved looking to https://github.com/pandas-dev/pandas/issues/21695

Posted by: pasqlisena @ July 16, 2020, 10:02 a.m.

Hey,
I'm still getting the same mismatch in number of rows, can anyone please help

Posted by: sammy786 @ Aug. 2, 2020, 3:45 a.m.

Hi, have you tried applying this? https://github.com/pandas-dev/pandas/issues/21695#issuecomment-401594317

Posted by: pasqlisena @ Aug. 3, 2020, 7:47 a.m.
Post in this thread