what should be done first, handling missing data or dealing with data types?Creating “demo” data from real data: disguising without disfiguringShould I remove data with known computational error before doing linear regression?

If an object moving in a circle experiences centripetal force, then doesn't it also experience centrifugal force, because of Newton's third law?

Why did the Soviet Union not "grant" Inner Mongolia to Mongolia after World War Two?

Should the average user with no special access rights be worried about SMS-based 2FA being theoretically interceptable?

List of 1000 most common words across all languages

Labview vs Matlab??Which one better for image processing?

Clear text passwords in Unix

Proper way to shut down consumer

Suffocation while cooking under an umbrella?

How 象【しょう】 ( ≈かたち、すがた、ようす) and 象【ぞう】 (どうぶつ) got to be written with the same kanji?

How to deal with a Homophobic PC

Is it impolite to ask for halal food when traveling to and in Thailand?

How can an attacker use robots.txt?

On the meaning of 'anyways' in "What Exactly Is a Quartz Crystal, Anyways?"

Is the use of language other than English 'Reasonable Suspicion' for detention?

Lost Update Understanding

Safe to use 220V electric clothes dryer when building has been bridged down to 110V?

How to create fractional SI units (SI...sqrts)?

Is it a good idea to leave minor world details to the reader's imagination?

Which place in our solar system is the most fit for terraforming?

Tesla coil and Tesla tower

Why did UK NHS pay for homeopathic treatments?

What secular civic space would pioneers build for small frontier towns?

Reorder a matrix, twice

What is the difference between an astronaut in the ISS and a freediver in perfect neutral buoyancy?

what should be done first, handling missing data or dealing with data types?

Creating “demo” data from real data: disguising without disfiguringShould I remove data with known computational error before doing linear regression?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

In data science, Which process should come first, handling missing data or handling data types. I am asking this question because I have problem in following cases:

1) Handling Missing data first, then handling data types - It would be difficult to handle missing data for categorical columns

2) Handling data types first, then handling missing data - if we create dummies for catergorical values, it might happen that while handling missing data, we might assign two different categories to same row.

asked 9 hours ago

Kiran

1163 bronze badges

New contributor

add a comment
|

In data science, Which process should come first, handling missing data or handling data types. I am asking this question because I have problem in following cases:

1) Handling Missing data first, then handling data types - It would be difficult to handle missing data for categorical columns

asked 9 hours ago

Kiran

1163 bronze badges

New contributor

add a comment
|

In data science, Which process should come first, handling missing data or handling data types. I am asking this question because I have problem in following cases:

1) Handling Missing data first, then handling data types - It would be difficult to handle missing data for categorical columns

asked 9 hours ago

Kiran

1163 bronze badges

New contributor

In data science, Which process should come first, handling missing data or handling data types. I am asking this question because I have problem in following cases:

1) Handling Missing data first, then handling data types - It would be difficult to handle missing data for categorical columns

data-cleaning

asked 9 hours ago

Kiran

1163 bronze badges

New contributor

asked 9 hours ago

Kiran

1163 bronze badges

New contributor

asked 9 hours ago

Kiran

1163 bronze badges

New contributor

asked 9 hours ago

Kiran

1163 bronze badges

asked 9 hours ago

Kiran

1163 bronze badges

New contributor

add a comment
|

2 Answers
2

active

oldest

votes

Handle data first, then perform multiple imputation.

Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:

Specific data types produce specific models, so the quality of your imputation depends on handling data types

Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)

Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)

In this way you can handle missing data for categorical variables alongside continuous or interval variables.

References

Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1), 40–49.

White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.

edited 3 hours ago

answered 7 hours ago

Alexis

17.3k4 gold badges49 silver badges106 bronze badges

add a comment
|

I don't think it matters which one you do first as long as your methods are valid. Yes, imputing categorical missing data is difficult. However, if you create your dummy variables and then try to impute you haven't made your problem any easier as you now have C (number of categories) variables to determine.

I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.

answered 8 hours ago

Emma Jean

2088 bronze badges

add a comment
|

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

Kiran is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f428142%2fwhat-should-be-done-first-handling-missing-data-or-dealing-with-data-types%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Handle data first, then perform multiple imputation.

Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:

Specific data types produce specific models, so the quality of your imputation depends on handling data types

Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)

Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)

In this way you can handle missing data for categorical variables alongside continuous or interval variables.

References

White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.

edited 3 hours ago

answered 7 hours ago

Alexis

17.3k4 gold badges49 silver badges106 bronze badges

add a comment
|

Handle data first, then perform multiple imputation.

Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:

Specific data types produce specific models, so the quality of your imputation depends on handling data types

Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)

Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)

In this way you can handle missing data for categorical variables alongside continuous or interval variables.

References

White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.

edited 3 hours ago

answered 7 hours ago

Alexis

17.3k4 gold badges49 silver badges106 bronze badges

add a comment
|

Handle data first, then perform multiple imputation.

Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:

Specific data types produce specific models, so the quality of your imputation depends on handling data types

Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)

Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)

In this way you can handle missing data for categorical variables alongside continuous or interval variables.

References

White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.

edited 3 hours ago

answered 7 hours ago

Alexis

17.3k4 gold badges49 silver badges106 bronze badges

Handle data first, then perform multiple imputation.

Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:

Specific data types produce specific models, so the quality of your imputation depends on handling data types

Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)

Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)

In this way you can handle missing data for categorical variables alongside continuous or interval variables.

References

White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.

edited 3 hours ago

answered 7 hours ago

Alexis

17.3k4 gold badges49 silver badges106 bronze badges

edited 3 hours ago

answered 7 hours ago

Alexis

17.3k4 gold badges49 silver badges106 bronze badges

answered 7 hours ago

Alexis

17.3k4 gold badges49 silver badges106 bronze badges

answered 7 hours ago

Alexis

17.3k4 gold badges49 silver badges106 bronze badges

add a comment
|

I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.

answered 8 hours ago

Emma Jean

2088 bronze badges

add a comment
|

I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.

answered 8 hours ago

Emma Jean

2088 bronze badges

add a comment
|

I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.

answered 8 hours ago

Emma Jean

2088 bronze badges

I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.

answered 8 hours ago

Emma Jean

2088 bronze badges

answered 8 hours ago

Emma Jean

2088 bronze badges

answered 8 hours ago

Emma Jean

2088 bronze badges

answered 8 hours ago

Emma Jean

2088 bronze badges

add a comment
|

Kiran is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Kiran is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Cross Validated!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

IntCE ixGuVoMubWCR8sn863T,O4 THp

搜尋此網誌

Xjyuk