what should be done first, handling missing data or dealing with data types?Creating “demo” data from real data: disguising without disfiguringShould I remove data with known computational error before doing linear regression?

If an object moving in a circle experiences centripetal force, then doesn't it also experience centrifugal force, because of Newton's third law?

Why did the Soviet Union not "grant" Inner Mongolia to Mongolia after World War Two?

Should the average user with no special access rights be worried about SMS-based 2FA being theoretically interceptable?

List of 1000 most common words across all languages

Labview vs Matlab??Which one better for image processing?

Clear text passwords in Unix

Proper way to shut down consumer

Suffocation while cooking under an umbrella?

How 象【しょう】 ( ≈かたち、 すがた、ようす) and 象【ぞう】 (どうぶつ) got to be written with the same kanji?

How to deal with a Homophobic PC

Is it impolite to ask for halal food when traveling to and in Thailand?

How can an attacker use robots.txt?

On the meaning of 'anyways' in "What Exactly Is a Quartz Crystal, Anyways?"

Is the use of language other than English 'Reasonable Suspicion' for detention?

Lost Update Understanding

Safe to use 220V electric clothes dryer when building has been bridged down to 110V?

How to create fractional SI units (SI...sqrts)?

Is it a good idea to leave minor world details to the reader's imagination?

Which place in our solar system is the most fit for terraforming?

Tesla coil and Tesla tower

Why did UK NHS pay for homeopathic treatments?

What secular civic space would pioneers build for small frontier towns?

Reorder a matrix, twice

What is the difference between an astronaut in the ISS and a freediver in perfect neutral buoyancy?



what should be done first, handling missing data or dealing with data types?


Creating “demo” data from real data: disguising without disfiguringShould I remove data with known computational error before doing linear regression?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








3












$begingroup$


In data science, Which process should come first, handling missing data or handling data types. I am asking this question because I have problem in following cases:



1) Handling Missing data first, then handling data types - It would be difficult to handle missing data for categorical columns



2) Handling data types first, then handling missing data - if we create dummies for catergorical values, it might happen that while handling missing data, we might assign two different categories to same row.










share|cite|improve this question







New contributor



Kiran is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$




















    3












    $begingroup$


    In data science, Which process should come first, handling missing data or handling data types. I am asking this question because I have problem in following cases:



    1) Handling Missing data first, then handling data types - It would be difficult to handle missing data for categorical columns



    2) Handling data types first, then handling missing data - if we create dummies for catergorical values, it might happen that while handling missing data, we might assign two different categories to same row.










    share|cite|improve this question







    New contributor



    Kiran is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    $endgroup$
















      3












      3








      3





      $begingroup$


      In data science, Which process should come first, handling missing data or handling data types. I am asking this question because I have problem in following cases:



      1) Handling Missing data first, then handling data types - It would be difficult to handle missing data for categorical columns



      2) Handling data types first, then handling missing data - if we create dummies for catergorical values, it might happen that while handling missing data, we might assign two different categories to same row.










      share|cite|improve this question







      New contributor



      Kiran is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      $endgroup$




      In data science, Which process should come first, handling missing data or handling data types. I am asking this question because I have problem in following cases:



      1) Handling Missing data first, then handling data types - It would be difficult to handle missing data for categorical columns



      2) Handling data types first, then handling missing data - if we create dummies for catergorical values, it might happen that while handling missing data, we might assign two different categories to same row.







      data-cleaning






      share|cite|improve this question







      New contributor



      Kiran is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.










      share|cite|improve this question







      New contributor



      Kiran is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.








      share|cite|improve this question




      share|cite|improve this question






      New contributor



      Kiran is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.








      asked 9 hours ago









      KiranKiran

      1163 bronze badges




      1163 bronze badges




      New contributor



      Kiran is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.




      New contributor




      Kiran is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.

























          2 Answers
          2






          active

          oldest

          votes


















          3














          $begingroup$

          Handle data first, then perform multiple imputation.



          Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:



          • Specific data types produce specific models, so the quality of your imputation depends on handling data types

          • Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)

          • Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)

          In this way you can handle missing data for categorical variables alongside continuous or interval variables.



          References



          Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1), 40–49.



          White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.






          share|cite|improve this answer











          $endgroup$






















            1














            $begingroup$

            I don't think it matters which one you do first as long as your methods are valid. Yes, imputing categorical missing data is difficult. However, if you create your dummy variables and then try to impute you haven't made your problem any easier as you now have C (number of categories) variables to determine.



            I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.






            share|cite|improve this answer









            $endgroup$

















              Your Answer








              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "65"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader:
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              ,
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );







              Kiran is a new contributor. Be nice, and check out our Code of Conduct.









              draft saved

              draft discarded
















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f428142%2fwhat-should-be-done-first-handling-missing-data-or-dealing-with-data-types%23new-answer', 'question_page');

              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              3














              $begingroup$

              Handle data first, then perform multiple imputation.



              Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:



              • Specific data types produce specific models, so the quality of your imputation depends on handling data types

              • Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)

              • Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)

              In this way you can handle missing data for categorical variables alongside continuous or interval variables.



              References



              Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1), 40–49.



              White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.






              share|cite|improve this answer











              $endgroup$



















                3














                $begingroup$

                Handle data first, then perform multiple imputation.



                Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:



                • Specific data types produce specific models, so the quality of your imputation depends on handling data types

                • Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)

                • Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)

                In this way you can handle missing data for categorical variables alongside continuous or interval variables.



                References



                Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1), 40–49.



                White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.






                share|cite|improve this answer











                $endgroup$

















                  3














                  3










                  3







                  $begingroup$

                  Handle data first, then perform multiple imputation.



                  Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:



                  • Specific data types produce specific models, so the quality of your imputation depends on handling data types

                  • Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)

                  • Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)

                  In this way you can handle missing data for categorical variables alongside continuous or interval variables.



                  References



                  Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1), 40–49.



                  White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.






                  share|cite|improve this answer











                  $endgroup$



                  Handle data first, then perform multiple imputation.



                  Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:



                  • Specific data types produce specific models, so the quality of your imputation depends on handling data types

                  • Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)

                  • Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)

                  In this way you can handle missing data for categorical variables alongside continuous or interval variables.



                  References



                  Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1), 40–49.



                  White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.







                  share|cite|improve this answer














                  share|cite|improve this answer



                  share|cite|improve this answer








                  edited 3 hours ago

























                  answered 7 hours ago









                  AlexisAlexis

                  17.3k4 gold badges49 silver badges106 bronze badges




                  17.3k4 gold badges49 silver badges106 bronze badges


























                      1














                      $begingroup$

                      I don't think it matters which one you do first as long as your methods are valid. Yes, imputing categorical missing data is difficult. However, if you create your dummy variables and then try to impute you haven't made your problem any easier as you now have C (number of categories) variables to determine.



                      I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.






                      share|cite|improve this answer









                      $endgroup$



















                        1














                        $begingroup$

                        I don't think it matters which one you do first as long as your methods are valid. Yes, imputing categorical missing data is difficult. However, if you create your dummy variables and then try to impute you haven't made your problem any easier as you now have C (number of categories) variables to determine.



                        I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.






                        share|cite|improve this answer









                        $endgroup$

















                          1














                          1










                          1







                          $begingroup$

                          I don't think it matters which one you do first as long as your methods are valid. Yes, imputing categorical missing data is difficult. However, if you create your dummy variables and then try to impute you haven't made your problem any easier as you now have C (number of categories) variables to determine.



                          I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.






                          share|cite|improve this answer









                          $endgroup$



                          I don't think it matters which one you do first as long as your methods are valid. Yes, imputing categorical missing data is difficult. However, if you create your dummy variables and then try to impute you haven't made your problem any easier as you now have C (number of categories) variables to determine.



                          I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.







                          share|cite|improve this answer












                          share|cite|improve this answer



                          share|cite|improve this answer










                          answered 8 hours ago









                          Emma JeanEmma Jean

                          2088 bronze badges




                          2088 bronze badges
























                              Kiran is a new contributor. Be nice, and check out our Code of Conduct.









                              draft saved

                              draft discarded

















                              Kiran is a new contributor. Be nice, and check out our Code of Conduct.












                              Kiran is a new contributor. Be nice, and check out our Code of Conduct.











                              Kiran is a new contributor. Be nice, and check out our Code of Conduct.














                              Thanks for contributing an answer to Cross Validated!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid


                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.

                              Use MathJax to format equations. MathJax reference.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f428142%2fwhat-should-be-done-first-handling-missing-data-or-dealing-with-data-types%23new-answer', 'question_page');

                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              ParseJSON using SSJSUsing AMPscript with SSJS ActivitiesHow to resubscribe a user in Marketing cloud using SSJS?Pulling Subscriber Status from Lists using SSJSRetrieving Emails using SSJSProblem in updating DE using SSJSUsing SSJS to send single email in Marketing CloudError adding EmailSendDefinition using SSJS

                              Кампала Садржај Географија Географија Историја Становништво Привреда Партнерски градови Референце Спољашње везе Мени за навигацију0°11′ СГШ; 32°20′ ИГД / 0.18° СГШ; 32.34° ИГД / 0.18; 32.340°11′ СГШ; 32°20′ ИГД / 0.18° СГШ; 32.34° ИГД / 0.18; 32.34МедијиПодациЗванични веб-сајту

                              19. јануар Садржај Догађаји Рођења Смрти Празници и дани сећања Види још Референце Мени за навигацијуу