Philosophical question on logistic regression: why isn't the optimal threshold value trained? Unicorn Meta Zoo #1: Why another podcast? Announcing the arrival of Valued Associate #679: Cesar ManaraWhy is accuracy not the best measure for assessing classification models?Why isn't Logistic Regression called Logistic Classification?Classification probability thresholdIs accuracy an improper scoring rule in a binary classification setting?ROC and false positive rate with over samplingGEE Logistic Model with Subject Specific Predictions?How to find the optimal cp value in rpart doing cross validation manually?Optimal cut-off calculation in logistic regressionDo I do threshold selection for my logit model on the testing or training subset?ROC curves from cross-validation are identical/overlaid and AUC is the same for each foldTurning Roc curve threshold by cross validationDetermine the cutoff threshold for binary classification models using cross validationHow are the training and cross-validation metrics calculated in H2O?Is it valid to use ROC calculated during test/validation to interpret results of final production model?

How exactly does Hawking radiation decrease the mass of black holes?

Which big number is bigger?

A Paper Record is What I Hamper

Retract an already submitted recommendation letter (written for an undergrad student)

Is it acceptable to use working hours to read general interest books?

Do I need to watch Ant-Man and the Wasp and Captain Marvel before watching Avengers: Endgame?

All ASCII characters with a given bit count

What is the best way to deal with NPC-NPC combat?

Why doesn't the standard consider a template constructor as a copy constructor?

Double-nominative constructions and “von”

How do I check if a string is entirely made of the same substring?

Why did C use the -> operator instead of reusing the . operator?

Air bladders in bat-like skin wings for better lift?

First instead of 1 when referencing

Will I lose my paid in full property

How do I prove this combinatorial identity

Why didn't the Space Shuttle bounce back into space as many times as possible so as to lose a lot of kinetic energy up there?

"My boss was furious with me and I have been fired" vs. "My boss was furious with me and I was fired"

What is /etc/mtab in Linux?

Why does Arg'[1. + I] return -0.5?

Can a stored procedure reference the database in which it is stored?

What's the difference between using dependency injection with a container and using a service locator?

Jaya, Venerated Firemage + Chandra's Pyrohelix = 4 damage among two targets?

Has a Nobel Peace laureate ever been accused of war crimes?



Philosophical question on logistic regression: why isn't the optimal threshold value trained?



Unicorn Meta Zoo #1: Why another podcast?
Announcing the arrival of Valued Associate #679: Cesar ManaraWhy is accuracy not the best measure for assessing classification models?Why isn't Logistic Regression called Logistic Classification?Classification probability thresholdIs accuracy an improper scoring rule in a binary classification setting?ROC and false positive rate with over samplingGEE Logistic Model with Subject Specific Predictions?How to find the optimal cp value in rpart doing cross validation manually?Optimal cut-off calculation in logistic regressionDo I do threshold selection for my logit model on the testing or training subset?ROC curves from cross-validation are identical/overlaid and AUC is the same for each foldTurning Roc curve threshold by cross validationDetermine the cutoff threshold for binary classification models using cross validationHow are the training and cross-validation metrics calculated in H2O?Is it valid to use ROC calculated during test/validation to interpret results of final production model?



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








3












$begingroup$


Usually in logistic regression, we fit a model and get some predictions on the training set. We then cross-validate on those training predictions (something like here) and decide the optimal threshold value based on something like the ROC curve.



Why don't we incorporate cross-validation of the threshold INTO the actual model, and train the whole thing end-to-end?










share|cite|improve this question











$endgroup$











  • $begingroup$
    Possible duplicate of Classification probability threshold
    $endgroup$
    – EdM
    1 hour ago






  • 2




    $begingroup$
    That thread is certainly related, but I wouldn't call it a duplicate.
    $endgroup$
    – gung
    1 hour ago

















3












$begingroup$


Usually in logistic regression, we fit a model and get some predictions on the training set. We then cross-validate on those training predictions (something like here) and decide the optimal threshold value based on something like the ROC curve.



Why don't we incorporate cross-validation of the threshold INTO the actual model, and train the whole thing end-to-end?










share|cite|improve this question











$endgroup$











  • $begingroup$
    Possible duplicate of Classification probability threshold
    $endgroup$
    – EdM
    1 hour ago






  • 2




    $begingroup$
    That thread is certainly related, but I wouldn't call it a duplicate.
    $endgroup$
    – gung
    1 hour ago













3












3








3





$begingroup$


Usually in logistic regression, we fit a model and get some predictions on the training set. We then cross-validate on those training predictions (something like here) and decide the optimal threshold value based on something like the ROC curve.



Why don't we incorporate cross-validation of the threshold INTO the actual model, and train the whole thing end-to-end?










share|cite|improve this question











$endgroup$




Usually in logistic regression, we fit a model and get some predictions on the training set. We then cross-validate on those training predictions (something like here) and decide the optimal threshold value based on something like the ROC curve.



Why don't we incorporate cross-validation of the threshold INTO the actual model, and train the whole thing end-to-end?







logistic cross-validation optimization roc threshold






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited 1 min ago







StatsSorceress

















asked 1 hour ago









StatsSorceressStatsSorceress

16218




16218











  • $begingroup$
    Possible duplicate of Classification probability threshold
    $endgroup$
    – EdM
    1 hour ago






  • 2




    $begingroup$
    That thread is certainly related, but I wouldn't call it a duplicate.
    $endgroup$
    – gung
    1 hour ago
















  • $begingroup$
    Possible duplicate of Classification probability threshold
    $endgroup$
    – EdM
    1 hour ago






  • 2




    $begingroup$
    That thread is certainly related, but I wouldn't call it a duplicate.
    $endgroup$
    – gung
    1 hour ago















$begingroup$
Possible duplicate of Classification probability threshold
$endgroup$
– EdM
1 hour ago




$begingroup$
Possible duplicate of Classification probability threshold
$endgroup$
– EdM
1 hour ago




2




2




$begingroup$
That thread is certainly related, but I wouldn't call it a duplicate.
$endgroup$
– gung
1 hour ago




$begingroup$
That thread is certainly related, but I wouldn't call it a duplicate.
$endgroup$
– gung
1 hour ago










3 Answers
3






active

oldest

votes


















2












$begingroup$

It isn't because logistic regression isn't a classifier (cf., Why isn't Logistic Regression called Logistic Classification?). It is a model to estimate the parameter, $p$, that governs the behavior of the Bernoulli distribution. That is, you are assuming that the response distribution, conditional on the covariates, is Bernoulli, and so you want to estimate how the parameter that controls that variable changes as a function of the covariates. It is a direct probability model only. Of course, it can be used as a classifier subsequently, and sometimes is in certain contexts, but it is still a probability model.






share|cite|improve this answer









$endgroup$












  • $begingroup$
    Okay, I understand that part of the theory (thank you for that eloquent explanation!) but why can't we incorporate the classification aspect into the model? That is, why can't we find p, then find the threshold, and train the whole thing end-to-end to minimize some loss?
    $endgroup$
    – StatsSorceress
    1 hour ago






  • 1




    $begingroup$
    You certainly could (@Sycorax's answer speaks to that possibility). But because that isn't what LR itself is, but rather some ad hoc augmentation, you would need to code up the full optimization scheme yourself. Note BTW, that Frank Harrell has pointed out that process will lead to what might be considered an inferior model by many standards.
    $endgroup$
    – gung
    1 hour ago











  • $begingroup$
    Hmm. I read the accepted answer in the related question here, and I agree with it in theory, but sometimes in machine learning classification applications we don't care about the relative error types, we just care about "correct classification". In that case, could you train end-to-end as I describe?
    $endgroup$
    – StatsSorceress
    36 mins ago


















2












$begingroup$

Regardless of the underlying model, we can work out the sampling distributions of TPR and FPR at a threshold. This implies that we can characterize the variability in TPR and FPR at some threshold, and we can back into a desired error rate trade-off.



A ROC curve is a little bit deceptive because the only thing that you control is the threshold, however the plot displays TPR and FPR, which are functions of the threshold. Moreover, the TPR and FPR are both statistics, so they are subject to the vagaries of random sampling. This implies that if you were to repeat the procedure (say by cross-validation), you could come up with a different FPR and TPR at some specific threshold value.



However, if we can estimate the variability in the TPR and FPR, then repeating the ROC procedure is not necessary. We just pick a threshold such that the endpoints of a confidence interval (with some width) are acceptable. That is, pick the model so that the FPR is plausibly below some researcher-specified maximum, and/or the TPR is plausibly above some researcher-specified minimum. If your model can't attain your targets, you'll have to build a better model.



Of course, what TPR and FPR values are tolerable in your usage will be context-dependent.



For more information, see ROC Curves for Continuous Data
by Wojtek J. Krzanowski and David J. Hand.






share|cite|improve this answer











$endgroup$












  • $begingroup$
    This doesn't really answer my question, but it's a very nice description of ROC curves.
    $endgroup$
    – StatsSorceress
    1 hour ago










  • $begingroup$
    In what way does this not answer your question? What is your question, if not asking about how to choose a threshold for classification?
    $endgroup$
    – Sycorax
    1 hour ago










  • $begingroup$
    I was asking why we don't train the threshold instead of choosing it after training the model.
    $endgroup$
    – StatsSorceress
    1 hour ago










  • $begingroup$
    How would you train a threshold?
    $endgroup$
    – Sycorax
    1 hour ago










  • $begingroup$
    Couldn't you find the optimal threshold for each minibatch, and take an average or something? I have a related question here if you're curious: stackoverflow.com/questions/55788153/…
    $endgroup$
    – StatsSorceress
    1 hour ago


















2












$begingroup$

It's because the optimal threshold is not only a function of the true positive rate (TPR), the false positive rate (FPR), accuracy or whatever else. The other crucial ingredient is the cost and the payoff of correct and wrong decisions.



If your target is a common cold, your response to a positive test is to prescribe two aspirin, and the cost of a true untreated positive is an unnecessary two days' worth of headaches, then your optimal decision (not classification!) threshold is quite different than if your target is some life-threatening disease, and your decision is (a) some comparatively simple procedure like an appendectomy, or (b) a major intervention like months of chemotherapy! And note that although your target variable may be binary (sick/healthy), your decisions may have more values (send home with two aspirin/run more tests/admit to hospital and watch/operate immediately).



Bottom line: if you know your cost structure and all the different decisions, you can certainly train a decision support system (DSS) directly, which includes a probabilistic classification or prediction. I would, however, strongly argue that discretizing predictions or classifications via thresholds is not the right way to go about this.



See also my answer to the earlier "Classification probability threshold" thread. Or this answer of mine. Or that one.






share|cite|improve this answer









$endgroup$













    Your Answer








    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "65"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f405041%2fphilosophical-question-on-logistic-regression-why-isnt-the-optimal-threshold-v%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    2












    $begingroup$

    It isn't because logistic regression isn't a classifier (cf., Why isn't Logistic Regression called Logistic Classification?). It is a model to estimate the parameter, $p$, that governs the behavior of the Bernoulli distribution. That is, you are assuming that the response distribution, conditional on the covariates, is Bernoulli, and so you want to estimate how the parameter that controls that variable changes as a function of the covariates. It is a direct probability model only. Of course, it can be used as a classifier subsequently, and sometimes is in certain contexts, but it is still a probability model.






    share|cite|improve this answer









    $endgroup$












    • $begingroup$
      Okay, I understand that part of the theory (thank you for that eloquent explanation!) but why can't we incorporate the classification aspect into the model? That is, why can't we find p, then find the threshold, and train the whole thing end-to-end to minimize some loss?
      $endgroup$
      – StatsSorceress
      1 hour ago






    • 1




      $begingroup$
      You certainly could (@Sycorax's answer speaks to that possibility). But because that isn't what LR itself is, but rather some ad hoc augmentation, you would need to code up the full optimization scheme yourself. Note BTW, that Frank Harrell has pointed out that process will lead to what might be considered an inferior model by many standards.
      $endgroup$
      – gung
      1 hour ago











    • $begingroup$
      Hmm. I read the accepted answer in the related question here, and I agree with it in theory, but sometimes in machine learning classification applications we don't care about the relative error types, we just care about "correct classification". In that case, could you train end-to-end as I describe?
      $endgroup$
      – StatsSorceress
      36 mins ago















    2












    $begingroup$

    It isn't because logistic regression isn't a classifier (cf., Why isn't Logistic Regression called Logistic Classification?). It is a model to estimate the parameter, $p$, that governs the behavior of the Bernoulli distribution. That is, you are assuming that the response distribution, conditional on the covariates, is Bernoulli, and so you want to estimate how the parameter that controls that variable changes as a function of the covariates. It is a direct probability model only. Of course, it can be used as a classifier subsequently, and sometimes is in certain contexts, but it is still a probability model.






    share|cite|improve this answer









    $endgroup$












    • $begingroup$
      Okay, I understand that part of the theory (thank you for that eloquent explanation!) but why can't we incorporate the classification aspect into the model? That is, why can't we find p, then find the threshold, and train the whole thing end-to-end to minimize some loss?
      $endgroup$
      – StatsSorceress
      1 hour ago






    • 1




      $begingroup$
      You certainly could (@Sycorax's answer speaks to that possibility). But because that isn't what LR itself is, but rather some ad hoc augmentation, you would need to code up the full optimization scheme yourself. Note BTW, that Frank Harrell has pointed out that process will lead to what might be considered an inferior model by many standards.
      $endgroup$
      – gung
      1 hour ago











    • $begingroup$
      Hmm. I read the accepted answer in the related question here, and I agree with it in theory, but sometimes in machine learning classification applications we don't care about the relative error types, we just care about "correct classification". In that case, could you train end-to-end as I describe?
      $endgroup$
      – StatsSorceress
      36 mins ago













    2












    2








    2





    $begingroup$

    It isn't because logistic regression isn't a classifier (cf., Why isn't Logistic Regression called Logistic Classification?). It is a model to estimate the parameter, $p$, that governs the behavior of the Bernoulli distribution. That is, you are assuming that the response distribution, conditional on the covariates, is Bernoulli, and so you want to estimate how the parameter that controls that variable changes as a function of the covariates. It is a direct probability model only. Of course, it can be used as a classifier subsequently, and sometimes is in certain contexts, but it is still a probability model.






    share|cite|improve this answer









    $endgroup$



    It isn't because logistic regression isn't a classifier (cf., Why isn't Logistic Regression called Logistic Classification?). It is a model to estimate the parameter, $p$, that governs the behavior of the Bernoulli distribution. That is, you are assuming that the response distribution, conditional on the covariates, is Bernoulli, and so you want to estimate how the parameter that controls that variable changes as a function of the covariates. It is a direct probability model only. Of course, it can be used as a classifier subsequently, and sometimes is in certain contexts, but it is still a probability model.







    share|cite|improve this answer












    share|cite|improve this answer



    share|cite|improve this answer










    answered 1 hour ago









    gunggung

    110k34268539




    110k34268539











    • $begingroup$
      Okay, I understand that part of the theory (thank you for that eloquent explanation!) but why can't we incorporate the classification aspect into the model? That is, why can't we find p, then find the threshold, and train the whole thing end-to-end to minimize some loss?
      $endgroup$
      – StatsSorceress
      1 hour ago






    • 1




      $begingroup$
      You certainly could (@Sycorax's answer speaks to that possibility). But because that isn't what LR itself is, but rather some ad hoc augmentation, you would need to code up the full optimization scheme yourself. Note BTW, that Frank Harrell has pointed out that process will lead to what might be considered an inferior model by many standards.
      $endgroup$
      – gung
      1 hour ago











    • $begingroup$
      Hmm. I read the accepted answer in the related question here, and I agree with it in theory, but sometimes in machine learning classification applications we don't care about the relative error types, we just care about "correct classification". In that case, could you train end-to-end as I describe?
      $endgroup$
      – StatsSorceress
      36 mins ago
















    • $begingroup$
      Okay, I understand that part of the theory (thank you for that eloquent explanation!) but why can't we incorporate the classification aspect into the model? That is, why can't we find p, then find the threshold, and train the whole thing end-to-end to minimize some loss?
      $endgroup$
      – StatsSorceress
      1 hour ago






    • 1




      $begingroup$
      You certainly could (@Sycorax's answer speaks to that possibility). But because that isn't what LR itself is, but rather some ad hoc augmentation, you would need to code up the full optimization scheme yourself. Note BTW, that Frank Harrell has pointed out that process will lead to what might be considered an inferior model by many standards.
      $endgroup$
      – gung
      1 hour ago











    • $begingroup$
      Hmm. I read the accepted answer in the related question here, and I agree with it in theory, but sometimes in machine learning classification applications we don't care about the relative error types, we just care about "correct classification". In that case, could you train end-to-end as I describe?
      $endgroup$
      – StatsSorceress
      36 mins ago















    $begingroup$
    Okay, I understand that part of the theory (thank you for that eloquent explanation!) but why can't we incorporate the classification aspect into the model? That is, why can't we find p, then find the threshold, and train the whole thing end-to-end to minimize some loss?
    $endgroup$
    – StatsSorceress
    1 hour ago




    $begingroup$
    Okay, I understand that part of the theory (thank you for that eloquent explanation!) but why can't we incorporate the classification aspect into the model? That is, why can't we find p, then find the threshold, and train the whole thing end-to-end to minimize some loss?
    $endgroup$
    – StatsSorceress
    1 hour ago




    1




    1




    $begingroup$
    You certainly could (@Sycorax's answer speaks to that possibility). But because that isn't what LR itself is, but rather some ad hoc augmentation, you would need to code up the full optimization scheme yourself. Note BTW, that Frank Harrell has pointed out that process will lead to what might be considered an inferior model by many standards.
    $endgroup$
    – gung
    1 hour ago





    $begingroup$
    You certainly could (@Sycorax's answer speaks to that possibility). But because that isn't what LR itself is, but rather some ad hoc augmentation, you would need to code up the full optimization scheme yourself. Note BTW, that Frank Harrell has pointed out that process will lead to what might be considered an inferior model by many standards.
    $endgroup$
    – gung
    1 hour ago













    $begingroup$
    Hmm. I read the accepted answer in the related question here, and I agree with it in theory, but sometimes in machine learning classification applications we don't care about the relative error types, we just care about "correct classification". In that case, could you train end-to-end as I describe?
    $endgroup$
    – StatsSorceress
    36 mins ago




    $begingroup$
    Hmm. I read the accepted answer in the related question here, and I agree with it in theory, but sometimes in machine learning classification applications we don't care about the relative error types, we just care about "correct classification". In that case, could you train end-to-end as I describe?
    $endgroup$
    – StatsSorceress
    36 mins ago













    2












    $begingroup$

    Regardless of the underlying model, we can work out the sampling distributions of TPR and FPR at a threshold. This implies that we can characterize the variability in TPR and FPR at some threshold, and we can back into a desired error rate trade-off.



    A ROC curve is a little bit deceptive because the only thing that you control is the threshold, however the plot displays TPR and FPR, which are functions of the threshold. Moreover, the TPR and FPR are both statistics, so they are subject to the vagaries of random sampling. This implies that if you were to repeat the procedure (say by cross-validation), you could come up with a different FPR and TPR at some specific threshold value.



    However, if we can estimate the variability in the TPR and FPR, then repeating the ROC procedure is not necessary. We just pick a threshold such that the endpoints of a confidence interval (with some width) are acceptable. That is, pick the model so that the FPR is plausibly below some researcher-specified maximum, and/or the TPR is plausibly above some researcher-specified minimum. If your model can't attain your targets, you'll have to build a better model.



    Of course, what TPR and FPR values are tolerable in your usage will be context-dependent.



    For more information, see ROC Curves for Continuous Data
    by Wojtek J. Krzanowski and David J. Hand.






    share|cite|improve this answer











    $endgroup$












    • $begingroup$
      This doesn't really answer my question, but it's a very nice description of ROC curves.
      $endgroup$
      – StatsSorceress
      1 hour ago










    • $begingroup$
      In what way does this not answer your question? What is your question, if not asking about how to choose a threshold for classification?
      $endgroup$
      – Sycorax
      1 hour ago










    • $begingroup$
      I was asking why we don't train the threshold instead of choosing it after training the model.
      $endgroup$
      – StatsSorceress
      1 hour ago










    • $begingroup$
      How would you train a threshold?
      $endgroup$
      – Sycorax
      1 hour ago










    • $begingroup$
      Couldn't you find the optimal threshold for each minibatch, and take an average or something? I have a related question here if you're curious: stackoverflow.com/questions/55788153/…
      $endgroup$
      – StatsSorceress
      1 hour ago















    2












    $begingroup$

    Regardless of the underlying model, we can work out the sampling distributions of TPR and FPR at a threshold. This implies that we can characterize the variability in TPR and FPR at some threshold, and we can back into a desired error rate trade-off.



    A ROC curve is a little bit deceptive because the only thing that you control is the threshold, however the plot displays TPR and FPR, which are functions of the threshold. Moreover, the TPR and FPR are both statistics, so they are subject to the vagaries of random sampling. This implies that if you were to repeat the procedure (say by cross-validation), you could come up with a different FPR and TPR at some specific threshold value.



    However, if we can estimate the variability in the TPR and FPR, then repeating the ROC procedure is not necessary. We just pick a threshold such that the endpoints of a confidence interval (with some width) are acceptable. That is, pick the model so that the FPR is plausibly below some researcher-specified maximum, and/or the TPR is plausibly above some researcher-specified minimum. If your model can't attain your targets, you'll have to build a better model.



    Of course, what TPR and FPR values are tolerable in your usage will be context-dependent.



    For more information, see ROC Curves for Continuous Data
    by Wojtek J. Krzanowski and David J. Hand.






    share|cite|improve this answer











    $endgroup$












    • $begingroup$
      This doesn't really answer my question, but it's a very nice description of ROC curves.
      $endgroup$
      – StatsSorceress
      1 hour ago










    • $begingroup$
      In what way does this not answer your question? What is your question, if not asking about how to choose a threshold for classification?
      $endgroup$
      – Sycorax
      1 hour ago










    • $begingroup$
      I was asking why we don't train the threshold instead of choosing it after training the model.
      $endgroup$
      – StatsSorceress
      1 hour ago










    • $begingroup$
      How would you train a threshold?
      $endgroup$
      – Sycorax
      1 hour ago










    • $begingroup$
      Couldn't you find the optimal threshold for each minibatch, and take an average or something? I have a related question here if you're curious: stackoverflow.com/questions/55788153/…
      $endgroup$
      – StatsSorceress
      1 hour ago













    2












    2








    2





    $begingroup$

    Regardless of the underlying model, we can work out the sampling distributions of TPR and FPR at a threshold. This implies that we can characterize the variability in TPR and FPR at some threshold, and we can back into a desired error rate trade-off.



    A ROC curve is a little bit deceptive because the only thing that you control is the threshold, however the plot displays TPR and FPR, which are functions of the threshold. Moreover, the TPR and FPR are both statistics, so they are subject to the vagaries of random sampling. This implies that if you were to repeat the procedure (say by cross-validation), you could come up with a different FPR and TPR at some specific threshold value.



    However, if we can estimate the variability in the TPR and FPR, then repeating the ROC procedure is not necessary. We just pick a threshold such that the endpoints of a confidence interval (with some width) are acceptable. That is, pick the model so that the FPR is plausibly below some researcher-specified maximum, and/or the TPR is plausibly above some researcher-specified minimum. If your model can't attain your targets, you'll have to build a better model.



    Of course, what TPR and FPR values are tolerable in your usage will be context-dependent.



    For more information, see ROC Curves for Continuous Data
    by Wojtek J. Krzanowski and David J. Hand.






    share|cite|improve this answer











    $endgroup$



    Regardless of the underlying model, we can work out the sampling distributions of TPR and FPR at a threshold. This implies that we can characterize the variability in TPR and FPR at some threshold, and we can back into a desired error rate trade-off.



    A ROC curve is a little bit deceptive because the only thing that you control is the threshold, however the plot displays TPR and FPR, which are functions of the threshold. Moreover, the TPR and FPR are both statistics, so they are subject to the vagaries of random sampling. This implies that if you were to repeat the procedure (say by cross-validation), you could come up with a different FPR and TPR at some specific threshold value.



    However, if we can estimate the variability in the TPR and FPR, then repeating the ROC procedure is not necessary. We just pick a threshold such that the endpoints of a confidence interval (with some width) are acceptable. That is, pick the model so that the FPR is plausibly below some researcher-specified maximum, and/or the TPR is plausibly above some researcher-specified minimum. If your model can't attain your targets, you'll have to build a better model.



    Of course, what TPR and FPR values are tolerable in your usage will be context-dependent.



    For more information, see ROC Curves for Continuous Data
    by Wojtek J. Krzanowski and David J. Hand.







    share|cite|improve this answer














    share|cite|improve this answer



    share|cite|improve this answer








    edited 1 hour ago

























    answered 1 hour ago









    SycoraxSycorax

    43.1k12112208




    43.1k12112208











    • $begingroup$
      This doesn't really answer my question, but it's a very nice description of ROC curves.
      $endgroup$
      – StatsSorceress
      1 hour ago










    • $begingroup$
      In what way does this not answer your question? What is your question, if not asking about how to choose a threshold for classification?
      $endgroup$
      – Sycorax
      1 hour ago










    • $begingroup$
      I was asking why we don't train the threshold instead of choosing it after training the model.
      $endgroup$
      – StatsSorceress
      1 hour ago










    • $begingroup$
      How would you train a threshold?
      $endgroup$
      – Sycorax
      1 hour ago










    • $begingroup$
      Couldn't you find the optimal threshold for each minibatch, and take an average or something? I have a related question here if you're curious: stackoverflow.com/questions/55788153/…
      $endgroup$
      – StatsSorceress
      1 hour ago
















    • $begingroup$
      This doesn't really answer my question, but it's a very nice description of ROC curves.
      $endgroup$
      – StatsSorceress
      1 hour ago










    • $begingroup$
      In what way does this not answer your question? What is your question, if not asking about how to choose a threshold for classification?
      $endgroup$
      – Sycorax
      1 hour ago










    • $begingroup$
      I was asking why we don't train the threshold instead of choosing it after training the model.
      $endgroup$
      – StatsSorceress
      1 hour ago










    • $begingroup$
      How would you train a threshold?
      $endgroup$
      – Sycorax
      1 hour ago










    • $begingroup$
      Couldn't you find the optimal threshold for each minibatch, and take an average or something? I have a related question here if you're curious: stackoverflow.com/questions/55788153/…
      $endgroup$
      – StatsSorceress
      1 hour ago















    $begingroup$
    This doesn't really answer my question, but it's a very nice description of ROC curves.
    $endgroup$
    – StatsSorceress
    1 hour ago




    $begingroup$
    This doesn't really answer my question, but it's a very nice description of ROC curves.
    $endgroup$
    – StatsSorceress
    1 hour ago












    $begingroup$
    In what way does this not answer your question? What is your question, if not asking about how to choose a threshold for classification?
    $endgroup$
    – Sycorax
    1 hour ago




    $begingroup$
    In what way does this not answer your question? What is your question, if not asking about how to choose a threshold for classification?
    $endgroup$
    – Sycorax
    1 hour ago












    $begingroup$
    I was asking why we don't train the threshold instead of choosing it after training the model.
    $endgroup$
    – StatsSorceress
    1 hour ago




    $begingroup$
    I was asking why we don't train the threshold instead of choosing it after training the model.
    $endgroup$
    – StatsSorceress
    1 hour ago












    $begingroup$
    How would you train a threshold?
    $endgroup$
    – Sycorax
    1 hour ago




    $begingroup$
    How would you train a threshold?
    $endgroup$
    – Sycorax
    1 hour ago












    $begingroup$
    Couldn't you find the optimal threshold for each minibatch, and take an average or something? I have a related question here if you're curious: stackoverflow.com/questions/55788153/…
    $endgroup$
    – StatsSorceress
    1 hour ago




    $begingroup$
    Couldn't you find the optimal threshold for each minibatch, and take an average or something? I have a related question here if you're curious: stackoverflow.com/questions/55788153/…
    $endgroup$
    – StatsSorceress
    1 hour ago











    2












    $begingroup$

    It's because the optimal threshold is not only a function of the true positive rate (TPR), the false positive rate (FPR), accuracy or whatever else. The other crucial ingredient is the cost and the payoff of correct and wrong decisions.



    If your target is a common cold, your response to a positive test is to prescribe two aspirin, and the cost of a true untreated positive is an unnecessary two days' worth of headaches, then your optimal decision (not classification!) threshold is quite different than if your target is some life-threatening disease, and your decision is (a) some comparatively simple procedure like an appendectomy, or (b) a major intervention like months of chemotherapy! And note that although your target variable may be binary (sick/healthy), your decisions may have more values (send home with two aspirin/run more tests/admit to hospital and watch/operate immediately).



    Bottom line: if you know your cost structure and all the different decisions, you can certainly train a decision support system (DSS) directly, which includes a probabilistic classification or prediction. I would, however, strongly argue that discretizing predictions or classifications via thresholds is not the right way to go about this.



    See also my answer to the earlier "Classification probability threshold" thread. Or this answer of mine. Or that one.






    share|cite|improve this answer









    $endgroup$

















      2












      $begingroup$

      It's because the optimal threshold is not only a function of the true positive rate (TPR), the false positive rate (FPR), accuracy or whatever else. The other crucial ingredient is the cost and the payoff of correct and wrong decisions.



      If your target is a common cold, your response to a positive test is to prescribe two aspirin, and the cost of a true untreated positive is an unnecessary two days' worth of headaches, then your optimal decision (not classification!) threshold is quite different than if your target is some life-threatening disease, and your decision is (a) some comparatively simple procedure like an appendectomy, or (b) a major intervention like months of chemotherapy! And note that although your target variable may be binary (sick/healthy), your decisions may have more values (send home with two aspirin/run more tests/admit to hospital and watch/operate immediately).



      Bottom line: if you know your cost structure and all the different decisions, you can certainly train a decision support system (DSS) directly, which includes a probabilistic classification or prediction. I would, however, strongly argue that discretizing predictions or classifications via thresholds is not the right way to go about this.



      See also my answer to the earlier "Classification probability threshold" thread. Or this answer of mine. Or that one.






      share|cite|improve this answer









      $endgroup$















        2












        2








        2





        $begingroup$

        It's because the optimal threshold is not only a function of the true positive rate (TPR), the false positive rate (FPR), accuracy or whatever else. The other crucial ingredient is the cost and the payoff of correct and wrong decisions.



        If your target is a common cold, your response to a positive test is to prescribe two aspirin, and the cost of a true untreated positive is an unnecessary two days' worth of headaches, then your optimal decision (not classification!) threshold is quite different than if your target is some life-threatening disease, and your decision is (a) some comparatively simple procedure like an appendectomy, or (b) a major intervention like months of chemotherapy! And note that although your target variable may be binary (sick/healthy), your decisions may have more values (send home with two aspirin/run more tests/admit to hospital and watch/operate immediately).



        Bottom line: if you know your cost structure and all the different decisions, you can certainly train a decision support system (DSS) directly, which includes a probabilistic classification or prediction. I would, however, strongly argue that discretizing predictions or classifications via thresholds is not the right way to go about this.



        See also my answer to the earlier "Classification probability threshold" thread. Or this answer of mine. Or that one.






        share|cite|improve this answer









        $endgroup$



        It's because the optimal threshold is not only a function of the true positive rate (TPR), the false positive rate (FPR), accuracy or whatever else. The other crucial ingredient is the cost and the payoff of correct and wrong decisions.



        If your target is a common cold, your response to a positive test is to prescribe two aspirin, and the cost of a true untreated positive is an unnecessary two days' worth of headaches, then your optimal decision (not classification!) threshold is quite different than if your target is some life-threatening disease, and your decision is (a) some comparatively simple procedure like an appendectomy, or (b) a major intervention like months of chemotherapy! And note that although your target variable may be binary (sick/healthy), your decisions may have more values (send home with two aspirin/run more tests/admit to hospital and watch/operate immediately).



        Bottom line: if you know your cost structure and all the different decisions, you can certainly train a decision support system (DSS) directly, which includes a probabilistic classification or prediction. I would, however, strongly argue that discretizing predictions or classifications via thresholds is not the right way to go about this.



        See also my answer to the earlier "Classification probability threshold" thread. Or this answer of mine. Or that one.







        share|cite|improve this answer












        share|cite|improve this answer



        share|cite|improve this answer










        answered 57 mins ago









        Stephan KolassaStephan Kolassa

        48.4k8102182




        48.4k8102182



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Cross Validated!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f405041%2fphilosophical-question-on-logistic-regression-why-isnt-the-optimal-threshold-v%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            19. јануар Садржај Догађаји Рођења Смрти Празници и дани сећања Види још Референце Мени за навигацијуу

            Israel Cuprins Etimologie | Istorie | Geografie | Politică | Demografie | Educație | Economie | Cultură | Note explicative | Note bibliografice | Bibliografie | Legături externe | Meniu de navigaresite web oficialfacebooktweeterGoogle+Instagramcanal YouTubeInstagramtextmodificaremodificarewww.technion.ac.ilnew.huji.ac.ilwww.weizmann.ac.ilwww1.biu.ac.ilenglish.tau.ac.ilwww.haifa.ac.ilin.bgu.ac.ilwww.openu.ac.ilwww.ariel.ac.ilCIA FactbookHarta Israelului"Negotiating Jerusalem," Palestine–Israel JournalThe Schizoid Nature of Modern Hebrew: A Slavic Language in Search of a Semitic Past„Arabic in Israel: an official language and a cultural bridge”„Latest Population Statistics for Israel”„Israel Population”„Tables”„Report for Selected Countries and Subjects”Human Development Report 2016: Human Development for Everyone„Distribution of family income - Gini index”The World FactbookJerusalem Law„Israel”„Israel”„Zionist Leaders: David Ben-Gurion 1886–1973”„The status of Jerusalem”„Analysis: Kadima's big plans”„Israel's Hard-Learned Lessons”„The Legacy of Undefined Borders, Tel Aviv Notes No. 40, 5 iunie 2002”„Israel Journal: A Land Without Borders”„Population”„Israel closes decade with population of 7.5 million”Time Series-DataBank„Selected Statistics on Jerusalem Day 2007 (Hebrew)”Golan belongs to Syria, Druze protestGlobal Survey 2006: Middle East Progress Amid Global Gains in FreedomWHO: Life expectancy in Israel among highest in the worldInternational Monetary Fund, World Economic Outlook Database, April 2011: Nominal GDP list of countries. Data for the year 2010.„Israel's accession to the OECD”Popular Opinion„On the Move”Hosea 12:5„Walking the Bible Timeline”„Palestine: History”„Return to Zion”An invention called 'the Jewish people' – Haaretz – Israel NewsoriginalJewish and Non-Jewish Population of Palestine-Israel (1517–2004)ImmigrationJewishvirtuallibrary.orgChapter One: The Heralders of Zionism„The birth of modern Israel: A scrap of paper that changed history”„League of Nations: The Mandate for Palestine, 24 iulie 1922”The Population of Palestine Prior to 1948originalBackground Paper No. 47 (ST/DPI/SER.A/47)History: Foreign DominationTwo Hundred and Seventh Plenary Meeting„Israel (Labor Zionism)”Population, by Religion and Population GroupThe Suez CrisisAdolf EichmannJustice Ministry Reply to Amnesty International Report„The Interregnum”Israel Ministry of Foreign Affairs – The Palestinian National Covenant- July 1968Research on terrorism: trends, achievements & failuresThe Routledge Atlas of the Arab–Israeli conflict: The Complete History of the Struggle and the Efforts to Resolve It"George Habash, Palestinian Terrorism Tactician, Dies at 82."„1973: Arab states attack Israeli forces”Agranat Commission„Has Israel Annexed East Jerusalem?”original„After 4 Years, Intifada Still Smolders”From the End of the Cold War to 2001originalThe Oslo Accords, 1993Israel-PLO Recognition – Exchange of Letters between PM Rabin and Chairman Arafat – Sept 9- 1993Foundation for Middle East PeaceSources of Population Growth: Total Israeli Population and Settler Population, 1991–2003original„Israel marks Rabin assassination”The Wye River Memorandumoriginal„West Bank barrier route disputed, Israeli missile kills 2”"Permanent Ceasefire to Be Based on Creation Of Buffer Zone Free of Armed Personnel Other than UN, Lebanese Forces"„Hezbollah kills 8 soldiers, kidnaps two in offensive on northern border”„Olmert confirms peace talks with Syria”„Battleground Gaza: Israeli ground forces invade the strip”„IDF begins Gaza troop withdrawal, hours after ending 3-week offensive”„THE LAND: Geography and Climate”„Area of districts, sub-districts, natural regions and lakes”„Israel - Geography”„Makhteshim Country”Israel and the Palestinian Territories„Makhtesh Ramon”„The Living Dead Sea”„Temperatures reach record high in Pakistan”„Climate Extremes In Israel”Israel in figures„Deuteronom”„JNF: 240 million trees planted since 1901”„Vegetation of Israel and Neighboring Countries”Environmental Law in Israel„Executive branch”„Israel's election process explained”„The Electoral System in Israel”„Constitution for Israel”„All 120 incoming Knesset members”„Statul ISRAEL”„The Judiciary: The Court System”„Israel's high court unique in region”„Israel and the International Criminal Court: A Legal Battlefield”„Localities and population, by population group, district, sub-district and natural region”„Israel: Districts, Major Cities, Urban Localities & Metropolitan Areas”„Israel-Egypt Relations: Background & Overview of Peace Treaty”„Solana to Haaretz: New Rules of War Needed for Age of Terror”„Israel's Announcement Regarding Settlements”„United Nations Security Council Resolution 497”„Security Council resolution 478 (1980) on the status of Jerusalem”„Arabs will ask U.N. to seek razing of Israeli wall”„Olmert: Willing to trade land for peace”„Mapping Peace between Syria and Israel”„Egypt: Israel must accept the land-for-peace formula”„Israel: Age structure from 2005 to 2015”„Global, regional, and national disability-adjusted life years (DALYs) for 306 diseases and injuries and healthy life expectancy (HALE) for 188 countries, 1990–2013: quantifying the epidemiological transition”10.1016/S0140-6736(15)61340-X„World Health Statistics 2014”„Life expectancy for Israeli men world's 4th highest”„Family Structure and Well-Being Across Israel's Diverse Population”„Fertility among Jewish and Muslim Women in Israel, by Level of Religiosity, 1979-2009”„Israel leaders in birth rate, but poverty major challenge”„Ethnic Groups”„Israel's population: Over 8.5 million”„Israel - Ethnic groups”„Jews, by country of origin and age”„Minority Communities in Israel: Background & Overview”„Israel”„Language in Israel”„Selected Data from the 2011 Social Survey on Mastery of the Hebrew Language and Usage of Languages”„Religions”„5 facts about Israeli Druze, a unique religious and ethnic group”„Israël”Israel Country Study Guide„Haredi city in Negev – blessing or curse?”„New town Harish harbors hopes of being more than another Pleasantville”„List of localities, in alphabetical order”„Muncitorii români, doriți în Israel”„Prietenia româno-israeliană la nevoie se cunoaște”„The Higher Education System in Israel”„Middle East”„Academic Ranking of World Universities 2016”„Israel”„Israel”„Jewish Nobel Prize Winners”„All Nobel Prizes in Literature”„All Nobel Peace Prizes”„All Prizes in Economic Sciences”„All Nobel Prizes in Chemistry”„List of Fields Medallists”„Sakharov Prize”„Țara care și-a sfidat "destinul" și se bate umăr la umăr cu Silicon Valley”„Apple's R&D center in Israel grew to about 800 employees”„Tim Cook: Apple's Herzliya R&D center second-largest in world”„Lecții de economie de la Israel”„Land use”Israel Investment and Business GuideA Country Study: IsraelCentral Bureau of StatisticsFlorin Diaconu, „Kadima: Flexibilitate și pragmatism, dar nici un compromis în chestiuni vitale", în Revista Institutului Diplomatic Român, anul I, numărul I, semestrul I, 2006, pp. 71-72Florin Diaconu, „Likud: Dreapta israeliană constant opusă retrocedării teritoriilor cureite prin luptă în 1967", în Revista Institutului Diplomatic Român, anul I, numărul I, semestrul I, 2006, pp. 73-74MassadaIsraelul a crescut in 50 de ani cât alte state intr-un mileniuIsrael Government PortalIsraelIsraelIsraelmmmmmXX451232cb118646298(data)4027808-634110000 0004 0372 0767n7900328503691455-bb46-37e3-91d2-cb064a35ffcc1003570400564274ge1294033523775214929302638955X146498911146498911

            Кастелфранко ди Сопра Становништво Референце Спољашње везе Мени за навигацију43°37′18″ СГШ; 11°33′32″ ИГД / 43.62156° СГШ; 11.55885° ИГД / 43.62156; 11.5588543°37′18″ СГШ; 11°33′32″ ИГД / 43.62156° СГШ; 11.55885° ИГД / 43.62156; 11.558853179688„The GeoNames geographical database”„Istituto Nazionale di Statistica”проширитиууWorldCat156923403n850174324558639-1cb14643287r(подаци)