what should be done first, handling missing data or dealing with data types?Creating “demo” data from real data: disguising without disfiguringShould I remove data with known computational error before doing linear regression?
If an object moving in a circle experiences centripetal force, then doesn't it also experience centrifugal force, because of Newton's third law?
Why did the Soviet Union not "grant" Inner Mongolia to Mongolia after World War Two?
Should the average user with no special access rights be worried about SMS-based 2FA being theoretically interceptable?
List of 1000 most common words across all languages
Labview vs Matlab??Which one better for image processing?
Clear text passwords in Unix
Proper way to shut down consumer
Suffocation while cooking under an umbrella?
How 象【しょう】 ( ≈かたち、 すがた、ようす) and 象【ぞう】 (どうぶつ) got to be written with the same kanji?
How to deal with a Homophobic PC
Is it impolite to ask for halal food when traveling to and in Thailand?
How can an attacker use robots.txt?
On the meaning of 'anyways' in "What Exactly Is a Quartz Crystal, Anyways?"
Is the use of language other than English 'Reasonable Suspicion' for detention?
Lost Update Understanding
Safe to use 220V electric clothes dryer when building has been bridged down to 110V?
How to create fractional SI units (SI...sqrts)?
Is it a good idea to leave minor world details to the reader's imagination?
Which place in our solar system is the most fit for terraforming?
Tesla coil and Tesla tower
Why did UK NHS pay for homeopathic treatments?
What secular civic space would pioneers build for small frontier towns?
Reorder a matrix, twice
What is the difference between an astronaut in the ISS and a freediver in perfect neutral buoyancy?
what should be done first, handling missing data or dealing with data types?
Creating “demo” data from real data: disguising without disfiguringShould I remove data with known computational error before doing linear regression?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
$begingroup$
In data science, Which process should come first, handling missing data or handling data types. I am asking this question because I have problem in following cases:
1) Handling Missing data first, then handling data types - It would be difficult to handle missing data for categorical columns
2) Handling data types first, then handling missing data - if we create dummies for catergorical values, it might happen that while handling missing data, we might assign two different categories to same row.
data-cleaning
New contributor
$endgroup$
add a comment
|
$begingroup$
In data science, Which process should come first, handling missing data or handling data types. I am asking this question because I have problem in following cases:
1) Handling Missing data first, then handling data types - It would be difficult to handle missing data for categorical columns
2) Handling data types first, then handling missing data - if we create dummies for catergorical values, it might happen that while handling missing data, we might assign two different categories to same row.
data-cleaning
New contributor
$endgroup$
add a comment
|
$begingroup$
In data science, Which process should come first, handling missing data or handling data types. I am asking this question because I have problem in following cases:
1) Handling Missing data first, then handling data types - It would be difficult to handle missing data for categorical columns
2) Handling data types first, then handling missing data - if we create dummies for catergorical values, it might happen that while handling missing data, we might assign two different categories to same row.
data-cleaning
New contributor
$endgroup$
In data science, Which process should come first, handling missing data or handling data types. I am asking this question because I have problem in following cases:
1) Handling Missing data first, then handling data types - It would be difficult to handle missing data for categorical columns
2) Handling data types first, then handling missing data - if we create dummies for catergorical values, it might happen that while handling missing data, we might assign two different categories to same row.
data-cleaning
data-cleaning
New contributor
New contributor
New contributor
asked 9 hours ago
KiranKiran
1163 bronze badges
1163 bronze badges
New contributor
New contributor
add a comment
|
add a comment
|
2 Answers
2
active
oldest
votes
$begingroup$
Handle data first, then perform multiple imputation.
Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:
- Specific data types produce specific models, so the quality of your imputation depends on handling data types
- Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)
- Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)
In this way you can handle missing data for categorical variables alongside continuous or interval variables.
References
Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1), 40–49.
White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.
$endgroup$
add a comment
|
$begingroup$
I don't think it matters which one you do first as long as your methods are valid. Yes, imputing categorical missing data is difficult. However, if you create your dummy variables and then try to impute you haven't made your problem any easier as you now have C (number of categories) variables to determine.
I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.
$endgroup$
add a comment
|
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Kiran is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f428142%2fwhat-should-be-done-first-handling-missing-data-or-dealing-with-data-types%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Handle data first, then perform multiple imputation.
Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:
- Specific data types produce specific models, so the quality of your imputation depends on handling data types
- Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)
- Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)
In this way you can handle missing data for categorical variables alongside continuous or interval variables.
References
Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1), 40–49.
White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.
$endgroup$
add a comment
|
$begingroup$
Handle data first, then perform multiple imputation.
Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:
- Specific data types produce specific models, so the quality of your imputation depends on handling data types
- Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)
- Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)
In this way you can handle missing data for categorical variables alongside continuous or interval variables.
References
Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1), 40–49.
White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.
$endgroup$
add a comment
|
$begingroup$
Handle data first, then perform multiple imputation.
Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:
- Specific data types produce specific models, so the quality of your imputation depends on handling data types
- Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)
- Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)
In this way you can handle missing data for categorical variables alongside continuous or interval variables.
References
Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1), 40–49.
White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.
$endgroup$
Handle data first, then perform multiple imputation.
Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:
- Specific data types produce specific models, so the quality of your imputation depends on handling data types
- Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)
- Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)
In this way you can handle missing data for categorical variables alongside continuous or interval variables.
References
Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1), 40–49.
White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.
edited 3 hours ago
answered 7 hours ago
AlexisAlexis
17.3k4 gold badges49 silver badges106 bronze badges
17.3k4 gold badges49 silver badges106 bronze badges
add a comment
|
add a comment
|
$begingroup$
I don't think it matters which one you do first as long as your methods are valid. Yes, imputing categorical missing data is difficult. However, if you create your dummy variables and then try to impute you haven't made your problem any easier as you now have C (number of categories) variables to determine.
I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.
$endgroup$
add a comment
|
$begingroup$
I don't think it matters which one you do first as long as your methods are valid. Yes, imputing categorical missing data is difficult. However, if you create your dummy variables and then try to impute you haven't made your problem any easier as you now have C (number of categories) variables to determine.
I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.
$endgroup$
add a comment
|
$begingroup$
I don't think it matters which one you do first as long as your methods are valid. Yes, imputing categorical missing data is difficult. However, if you create your dummy variables and then try to impute you haven't made your problem any easier as you now have C (number of categories) variables to determine.
I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.
$endgroup$
I don't think it matters which one you do first as long as your methods are valid. Yes, imputing categorical missing data is difficult. However, if you create your dummy variables and then try to impute you haven't made your problem any easier as you now have C (number of categories) variables to determine.
I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.
answered 8 hours ago
Emma JeanEmma Jean
2088 bronze badges
2088 bronze badges
add a comment
|
add a comment
|
Kiran is a new contributor. Be nice, and check out our Code of Conduct.
Kiran is a new contributor. Be nice, and check out our Code of Conduct.
Kiran is a new contributor. Be nice, and check out our Code of Conduct.
Kiran is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f428142%2fwhat-should-be-done-first-handling-missing-data-or-dealing-with-data-types%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown